sphinx全文搜索小记

时间:2022-04-25 00:40:09
最近做了sphinx的全文搜索,现在记录一下,以便以后需要查看。
 
参考手册:http://www.coreseek.cn/docs/coreseek_3.2-sphinx_0.9.9.html
 
本次sphinx全文搜索使用:
 
软件:coreseek  服务器:linux  程序语言:php 数据库:mysql
 
1.服务器上搭建coreseek服务:
 
切换到root用户,确保拥有完整的权限来安装软件
$ su root
 
安装步骤:
参考文档:http://www.coreseek.cn/products-install/install_on_bsd_linux/
 
(1).下载获取4.1版本的coreseek:
    wget -c http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz
 
(2).解压coreseek文件:
    tar xzvf coreseek-4.1-beta.tar.gz
 
(3).安装coreseek开发的mmseg,为coreseek提供中文分词功能
    ①.进入目录:
    cd coreseek-4.1-beta
    cd mmseg-3.2.14
    [root@localhost mmseg-3.2.14]# ./bootstrap
 
    显示:
    + aclocal -I config
config/sys_siglist.m4:20: warning: underquoted definition of SIC_VAR_SYS_SIGLIST
config/sys_siglist.m4:20:   run info '(automake)Extending aclocal'
config/sys_siglist.m4:20:   or see http://sources.redhat.com/automake/automake.html#Extending-aclocal
+ libtoolize --force --copy
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, `config'.
libtoolize: copying file `config/ltmain.sh'
libtoolize: Consider adding `AC_CONFIG_MACRO_DIR([m4])' to configure.in and
libtoolize: rerunning libtoolize, to keep the correct libtool macros in-tree.
libtoolize: Consider adding `-I m4' to ACLOCAL_AMFLAGS in Makefile.am.
+ autoheader
+ automake --add-missing --copy
+ autoconf
②.安装:
[root@localhost mmseg-3.2.14]#./configure --prefix=/usr/local/webserver/mmseg3
[root@localhost mmseg-3.2.14]#make
[root@localhost mmseg-3.2.14]#make install

(4).测试中文分词

    [root@localhost mmseg-3.2.14]#/usr/local/webserver/mmseg3/bin/mmseg -d /usr/local/webserver/mmseg3/etc src/t1.txt
        显示:
        中文/x 分/x 词/x 测试/x
        中国人/x 上海市/x
 
         Word Splite took: 0 ms.
 
         说明正常
 
(5).安装coreseek
     [root@localhost mmseg-3.2.14]#cd ..
 
(6).执行configure,进行编译配置
     #sh buildconf.sh

     #./configure --prefix=/usr/local/webserver/coreseek  --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/webserver/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/webserver/mmseg3/lib/ --                  with-mysql=/usr/local/webserver/mysql

 
显示:
configuration done
------------------
 
You can now run 'make install' to build and install Sphinx binaries.
On a multi-core machine, try 'make -j4 install' to speed up the build.
 
Updates, articles, help forum, and commercial support, consulting, training,
and development services are available at http://sphinxsearch.com/
 
Thank you for choosing Sphinx!
 
安装:
[root@localhost csft-4.1]#make
 
显示:
/home/centos/coreseek-4.1-beta/csft-4.1/src/sphinx.cpp:22292: undefined reference to `libiconv_open'
/home/centos/coreseek-4.1-beta/csft-4.1/src/sphinx.cpp:22310: undefined reference to `libiconv'
/home/centos/coreseek-4.1-beta/csft-4.1/src/sphinx.cpp:22316: undefined reference to `libiconv_close'
 
这是报错了,需要修改文件:
 
修改./src/Makefile,将第157行的   
LIBS = -ldl -lm -lz -lexpat  -L/usr/local/lib -lrt  -lpthread   
 改成     
LIBS = -ldl -lm -lz -lexpat -liconv -L/usr/local/lib -lrt  -lpthread
 
然后再次执行make
 
备份方法:make ZEND_EXTRA_LIBS='-liconv'
 
make install
cd ..

yum whatprovides lexpat
 
 

(7).测试配置:

     [root@localhost csft-4.1]# /usr/local/webserver/coreseek/bin/indexer -c /usr/local/webserver/coreseek/etc/sphinx-min.conf.dist
     显示:
     Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]

     Copyright (c) 2007-2011,
     Beijing Choice Software Technologies Inc (http://www.coreseek.com)

     ERROR: nothing to do.

 
     表示正常
 
(8).修改配置文件
    ①.进入etc目录:
    cd /usr/local/webserver/coreseek/etc/
    ②.修改配置文件名为csft:
    mv sphinx.conf csft.conf
    ③.修改配置文件:
     vi csft.conf
     (相关命令:a:进入编辑模式  esc:退出编辑模式  保存退出:shift,:,wq)
    需建立两个索引,一个主索引library,一个增量索引delta
   文件内容如下:
 
   # 索引源 #
   source library_src {
     #数据源类型
     type = mysql
     #mysql主机
     sql_host = 192.168.1.206
     #mysql用户名
     sql_user = root
     #mysql密码
     sql_pass = 123
     #mysql数据库名
     sql_db = pp_library
     #mysql端口
     sql_port = 3306
     #mysql检索编码,特别要注意这点,很多人中文检索不到是数据库的编码是GBK或其他非UTF8
       sql_query_pre = SET NAMES UTF8
      sql_query_pre = SET SESSION query_cache_type=OFF
       # 获取数据的sql
       sql_query = SELECT cid,cid AS id,top_status,status,tid,totalnum,pubtime,lasttime, creatdate,content FROM wb_library WHERE status = 0 ORDER BY cid asc
 
       #属性配置,搜索和排序用到
       sql_attr_uint = id
       sql_attr_uint = top_status
       sql_attr_uint = status
       sql_attr_uint = tid
       sql_attr_uint = totalnum
       sql_attr_timestamp = pubtime
       sql_attr_timestamp = lasttime
  }
 
   source delta_src : library_src
  { 
             type = mysql

     sql_host = 192.168.1.206
     sql_user = root
     sql_pass = 123
     sql_db = pp_library
     sql_port = 3306
      sql_query_pre = SET NAMES utf8
 
     sql_query_pre = SET SESSION query_cache_type=OFF
     sql_query_pre = REPLACE INTO search_counter (counterid, max_doc_id)  SELECT 1,MAX(cid) FROM wb_library #创建增量索引前更改标识位置
     sql_query_post = UPDATE search_counter SET min_doc_id=max_doc_id WHERE counterid=1 #创建增量索引后更改标识位置
     sql_query =  SELECT cid,cid as id,top_status,status,tid,totalnum,pubtime,lasttime,creatdate,content FROM wb_library WHERE cid > (select min_doc_id FROM search_counter) AND cid <= (select max_doc_id FROM search_counter)
        sql_attr_uint = id
        sql_attr_uint = top_status
        sql_attr_uint = status
        sql_attr_uint = tid
        sql_attr_uint = totalnum
        sql_attr_timestamp = pubtime
        sql_attr_timestamp = lasttime
 
     sql_range_step = 1000
     sql_ranged_throttle = 1000
  }
 
# 索引 #
 
index library {
     #声明索引源
     source = library_src
     #索引文件存放路径及索引的文件名
     path = /home/sphinxdata/indexer/library
     #文档信息存储方式
     docinfo = extern
     #缓存数据内存锁定
     mlock = 0
     #形态学(对中文无效)
     morphology = none
     #索引的词最小长度
     min_word_len = 1
     #数据编码
     charset_type = utf-8
        charset_dictpath =  /usr/local/webserver/mmseg3/etc
 
     # 字符表,注意:如使用这种方式,则sphinx会对中文进行单字切分,
     # 即进行字索引,若要使用中文分词,必须使用其他分词插件如 coreseek,sfc
     charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
     A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
     U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
     U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
     U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
     U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \
     U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\
     U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \
     U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \
     U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \
     U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \
     U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \
     U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\
     U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \
     U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \
     U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \
     U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\
     U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \
     U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \
     U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \
     U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \
     U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \
     U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \
     U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \
     U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \
     U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \
     U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \
     U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \
     U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \
     U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \
     U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \
     U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \
     U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \
     U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \
     U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \
     U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \
     U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \
     U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF
     #最小前缀
     min_prefix_len = 0
     #最小中缀
     min_infix_len = 1
     #对于非字母型数据的长度切割
     ngram_len = 1
}
 
# 增量索引 #
index delta : library
{   
    source = delta_src   
    path = /home/sphinxdata/indexer/dleta
    charset_type    = utf-8
    charset_dictpath =  /usr/local/webserver/mmseg3/etc
    docinfo               = extern
    ngram_len           = 0 
}

# 索引器配置 #
indexer{
     mem_limit = 5120000k # 内存限制
}

# sphinx 服务进程 #
searchd {
     #监听端口,在此版本开始,官方已在IANA获得正式授权的9312端口,以前版本默认的是3312
     listen = 192.168.1.239:9312
     #服务进程日志 ,一旦sphinx出现异常,基本上可以从这里查询有效信息,轮换(rotate)出的问题一般可在此寻到答案
     log = /usr/local/webserver/coreseek/var/log/searchd.log
     #客户端查询日志,笔者注:若欲对一些关键词进行统计,可以分析此日志文件
     query_log = /usr/local/webserver/coreseek/var/log/query.log
     #请求超时
     read_timeout = 5
     #同时可执行的最大searchd 进程数
     max_children = 50
     #进程ID文件
     pid_file = /usr/local/webserver/coreseek/var/log/searchd.pid
     #查询结果的最大返回数
     max_matches = 2000000
     #是否支持无缝切换,做增量索引时通常需要
     seamless_rotate = 1
     preopen_indexes = 0
     #sphinxql 兼容模式
     compat_sphinxql_magics = 0
}
 

(9).生成索引文件
    /usr/local/webserver/coreseek/bin/indexer -c /usr/local/webserver/coreseek/etc/csft.conf --all    

 
(10).启动服务
       /usr/local/webserver/coreseek/bin/searchd -c /usr/local/webserver/coreseek/etc/csft.conf
 
由于是增量索引,需要有定时脚本来更新索引:
 
每天生成一次主索引:
/usr/local/sphinx/bin/indexer library--config /usr/local/sphinx/etc/sphinx.conf
 
每10分钟生成一次增量索引:
/usr/local/sphinx/bin/indexer delta --config /usr/local/sphinx/etc/sphinx.conf
同时合并增量索引到主索引:
/usr/local/sphinx/bin/indexer --merge library  delta --config /usr/local/sphinx/etc/sphinx.conf
 
以上linux上的coreseek服务已经搭建好,下面通过api接口来调用
 
 

2.调用api接口来实现全文搜索

      php  api接口文档:http://docs.php.net/manual/zh/book.sphinx.php
 
    // sphinx测试
    require  WEIBO_ROOT .  'source/class/class_sphinx.php'  ;
    $cl  =  new  SphinxClient ();
    $sphinx  = getglobal( 'config/sphinx'  );
     $cl ->SetServer (  $sphinx  [ 'host'  ],  $sphinx  [ 'port'  ]);
     $cl ->SetFilter (  "id"  ,  array  (27,28),  true );  // id 不是27和28
     $cl ->SetFilter (  "tid"  ,  array  (68, 69));     // tid 是68或69
     $cl ->SetFilter (  "status"  ,  array  (0));    
     $cl ->SetFilter (  "top_status"  ,  array  (2));
     $cl ->setFilterRange(  'pubtime'  , 0,  $_G  [ 'timestamp'  ]);  // pubtime范围
     $cl ->SetSortMode ( SPH_SORT_ATTR_DESC,  "top_status"  );   // 按top_status降序排序
     $cl ->SetSortMode ( SPH_SORT_ATTR_ASC,  "totalnum"  );       // 按totalnum 升序排序
     $cl ->setLimits(0, 5 );     // 用于分页,还可以设置最大匹配数,默认是1000,可以修改,参考接口:http://docs.php.net/manual/zh/sphinxclient.setlimits.php
     $res  =  $cl  ->Query (  '我' ,  "library"  );   
     echo  '<pre>' ;
    print_r(  $res ); exit  ;
 
    
    total:总数   matches:匹配项,然后就可以 根据获取到的cid集合就可以
    查询详细内容了
 
    默认关键次为空是是没有结果的,如果要展现所有信息,需要修改匹配模式为 SPH_MATCH_FULLSCAN
    $cl->setMatchMode(SPH_MATCH_FULLSCAN);
     $res  =  $cl  ->Query ( '' ,  "library" );
    
     
 
    coreseek的简单使用差不多就这样了,主要是索引的创建和定时更新维护。分词可以使用它自带的分词,基本是没有问题的,如果要求高的话,可能就要修改它的词库了,
    这块暂时没有做研究。