elasticsearch使用中文分词器和拼音分词器,自定义分词器

刺骨的言语ヽ痛彻心扉 2021-12-19 01:47 737阅读 0赞

elasticsearch使用中文分词器和拼音分词器,自定义分词器

1. 到github 下载分词器

上面有已经编译好打好的包。下载后在es安装目录下的plugins/目录下创建ik和pinyin两个文件夹,把下载好的zip包解压在里面。重启es就会生效了。github上readme.txt文件里有使用说明。注意下载的时候下载版本对应的,比如我的es版本是5.6.16,下载分词器的时候也要下载这个版本的。

ik 中文分词器:https://github.com/medcl/elasticsearch-analysis-ik/releases

pinyin 拼音分词器:https://github.com/medcl/elasticsearch-analysis-pinyin/releases

也可以下载源码后,用mvn手动打包,但是特别慢,我打了个拼音包两个多小时,可能和没翻墙也有关系。

2. 使用分词器

解压后重启es就可以使用了。分词器是属于索引的,所以测试分词器的时候,要指定是哪个索引。

ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。

  1. get http://localhost:9200/user_index/_analyze?analyzer=ik_smart&text=张三李四

返回

  1. {
  2. "tokens": [
  3. {
  4. "token": "张三李四",
  5. "start_offset": 0,
  6. "end_offset": 4,
  7. "type": "CN_WORD",
  8. "position": 0
  9. }
  10. ]
  11. }

ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;

  1. get http://localhost:9200/user_index/_analyze?analyzer=ik_max_word&text=张三李四

返回

  1. {
  2. "tokens": [
  3. {
  4. "token": "张三李四",
  5. "start_offset": 0,
  6. "end_offset": 4,
  7. "type": "CN_WORD",
  8. "position": 0
  9. },
  10. {
  11. "token": "张三",
  12. "start_offset": 0,
  13. "end_offset": 2,
  14. "type": "CN_WORD",
  15. "position": 1
  16. },
  17. {
  18. "token": "三",
  19. "start_offset": 1,
  20. "end_offset": 2,
  21. "type": "TYPE_CNUM",
  22. "position": 2
  23. },
  24. {
  25. "token": "李四",
  26. "start_offset": 2,
  27. "end_offset": 4,
  28. "type": "CN_WORD",
  29. "position": 3
  30. },
  31. {
  32. "token": "四",
  33. "start_offset": 3,
  34. "end_offset": 4,
  35. "type": "TYPE_CNUM",
  36. "position": 4
  37. }
  38. ]
  39. }

  1. get http://localhost:9200/user_index/_analyze?analyzer=pinyin&text=张三李四

返回

  1. {
  2. "tokens": [
  3. {
  4. "token": "zhang",
  5. "start_offset": 0,
  6. "end_offset": 1,
  7. "type": "word",
  8. "position": 0
  9. },
  10. {
  11. "token": "zsls",
  12. "start_offset": 0,
  13. "end_offset": 4,
  14. "type": "word",
  15. "position": 0
  16. },
  17. {
  18. "token": "san",
  19. "start_offset": 1,
  20. "end_offset": 2,
  21. "type": "word",
  22. "position": 1
  23. },
  24. {
  25. "token": "li",
  26. "start_offset": 2,
  27. "end_offset": 3,
  28. "type": "word",
  29. "position": 2
  30. },
  31. {
  32. "token": "si",
  33. "start_offset": 3,
  34. "end_offset": 4,
  35. "type": "word",
  36. "position": 3
  37. }
  38. ]
  39. }

3. 自定义分词器,ik+pinyin组合使用

ik中文分词器,貌似没有可以设置的属性,直接用就行了。

拼音分词器有许多可以设置的选项。可以自行定义。原本的拼音分词器,只能分析出来全拼、首字母全拼、和每个字的完整拼音,不过这个每个字的完整拼音我觉得没什么作用,太细微。我想实现的功能是,可以让中文分词器分词后的字词,再被拼音分词器分词,就可以用下面的方式,tokenizer 使用 中文分词器ik_max_word,最后的标记过滤器,再使用pinyin 分词器过滤一遍就可以了。

  1. {
  2. "index": {
  3. "number_of_replicas" : "0",
  4. "number_of_shards" : "1",
  5. "analysis": {
  6. "analyzer": {
  7. "ik_pinyin_analyzer": {
  8. "tokenizer": "my_ik_pinyin",
  9. "filter": "pinyin_first_letter_and_full_pinyin_filter"
  10. },
  11. "pinyin_analyzer": {
  12. "tokenizer": "my_pinyin"
  13. }
  14. },
  15. "tokenizer": {
  16. "my_ik_pinyin": {
  17. "type": "ik_max_word"
  18. },
  19. "my_pinyin": {
  20. "type": "pinyin",
  21. "keep_first_letter": true,
  22. "keep_separate_first_letter": false,
  23. "keep_full_pinyin": false,
  24. "keep_joined_full_pinyin": true,
  25. "keep_none_chinese": true,
  26. "none_chinese_pinyin_tokenize": false,
  27. "keep_none_chinese_in_joined_full_pinyin": true,
  28. "keep_original": false,
  29. "limit_first_letter_length": 16,
  30. "lowercase": true,
  31. "trim_whitespace": true,
  32. "remove_duplicated_term": true
  33. }
  34. },
  35. "filter": {
  36. "pinyin_first_letter_and_full_pinyin_filter": {
  37. "type": "pinyin",
  38. "keep_first_letter": true,
  39. "keep_separate_first_letter": false,
  40. "keep_full_pinyin": false,
  41. "keep_joined_full_pinyin": true,
  42. "keep_none_chinese": true,
  43. "none_chinese_pinyin_tokenize": false,
  44. "keep_none_chinese_in_joined_full_pinyin": true,
  45. "keep_original": false,
  46. "limit_first_letter_length": 16,
  47. "lowercase": true,
  48. "trim_whitespace": true,
  49. "remove_duplicated_term": true
  50. }
  51. }
  52. }
  53. }
  54. }

我们测试一下

  1. http://localhost:9200/drug_index/_analyze?analyzer=ik_pinyin_analyzer&text=阿莫西林胶囊

返回的结果就是汉字ik_max_word分词后的结果,再按照拼音分词的规则做了分析。

  1. {
  2. "tokens": [
  3. {
  4. "token": "amoxilin",
  5. "start_offset": 0,
  6. "end_offset": 4,
  7. "type": "CN_WORD",
  8. "position": 0
  9. },
  10. {
  11. "token": "amxl",
  12. "start_offset": 0,
  13. "end_offset": 4,
  14. "type": "CN_WORD",
  15. "position": 0
  16. },
  17. {
  18. "token": "moxi",
  19. "start_offset": 1,
  20. "end_offset": 3,
  21. "type": "CN_WORD",
  22. "position": 1
  23. },
  24. {
  25. "token": "mx",
  26. "start_offset": 1,
  27. "end_offset": 3,
  28. "type": "CN_WORD",
  29. "position": 1
  30. },
  31. {
  32. "token": "xilin",
  33. "start_offset": 2,
  34. "end_offset": 4,
  35. "type": "CN_WORD",
  36. "position": 2
  37. },
  38. {
  39. "token": "xl",
  40. "start_offset": 2,
  41. "end_offset": 4,
  42. "type": "CN_WORD",
  43. "position": 2
  44. },
  45. {
  46. "token": "jiaonang",
  47. "start_offset": 4,
  48. "end_offset": 6,
  49. "type": "CN_WORD",
  50. "position": 3
  51. },
  52. {
  53. "token": "jn",
  54. "start_offset": 4,
  55. "end_offset": 6,
  56. "type": "CN_WORD",
  57. "position": 3
  58. }
  59. ]
  60. }

4. 代码测试

  1. package com.boot.es.model;
  2. import lombok.Data;
  3. import org.springframework.data.annotation.Id;
  4. import org.springframework.data.elasticsearch.annotations.Document;
  5. import org.springframework.data.elasticsearch.annotations.Field;
  6. import org.springframework.data.elasticsearch.annotations.FieldType;
  7. import org.springframework.data.elasticsearch.annotations.InnerField;
  8. import org.springframework.data.elasticsearch.annotations.MultiField;
  9. import org.springframework.data.elasticsearch.annotations.Setting;
  10. /** * Author: susq * Date: 2019-06-30 10:12 */
  11. @Data
  12. @Document(indexName = "drug_index", type = "drug")
  13. @Setting(settingPath = "settings.json")
  14. public class Drug {
  15. @Id
  16. private Long id;
  17. @Field(type = FieldType.Keyword)
  18. private String price;
  19. @MultiField(
  20. mainField = @Field(type = FieldType.Keyword),
  21. otherFields = {
  22. @InnerField(type = FieldType.Text, suffix = "ik", analyzer = "ik_max_word", searchAnalyzer = "ik_max_word"),
  23. @InnerField(type = FieldType.Text, suffix = "ik_pinyin", analyzer = "ik_pinyin_analyzer", searchAnalyzer = "ik_pinyin_analyzer"),
  24. @InnerField(type = FieldType.Text, suffix = "pinyin", analyzer = "pinyin_analyzer", searchAnalyzer = "pinyin_analyzer")
  25. }
  26. )
  27. private String name;
  28. @MultiField(
  29. mainField = @Field(type = FieldType.Keyword),
  30. otherFields = {
  31. @InnerField(type = FieldType.Text, suffix = "ik", analyzer = "ik_max_word", searchAnalyzer = "ik_smart"),
  32. @InnerField(type = FieldType.Text, suffix = "ik_pinyin", analyzer = "ik_pinyin_analyzer", searchAnalyzer = "ik_pinyin_analyzer"),
  33. @InnerField(type = FieldType.Text, suffix = "pinyin", analyzer = "pinyin_analyzer", searchAnalyzer = "pinyin_analyzer")
  34. }
  35. )
  36. private String effect;
  37. }
  38. @Test
  39. public void drugSaveTest() {
  40. Drug drug = new Drug();
  41. drug.setId(1L);
  42. drug.setName("阿莫西林胶囊");
  43. drug.setPrice("10");
  44. drug.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
  45. Drug drug1 = new Drug();
  46. drug1.setId(3L);
  47. drug1.setName("阿莫西林");
  48. drug1.setPrice("10");
  49. drug1.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
  50. Drug drug2 = new Drug();
  51. drug2.setId(2L);
  52. drug2.setName("999感冒灵颗粒");
  53. drug2.setPrice("20");
  54. drug2.setEffect("本品解热镇痛。用于感冒引起的头痛,发热,鼻塞,流涕,咽痛等");
  55. drugRepository.saveAll(Lists.newArrayList(drug, drug1, drug2));
  56. List<Drug> drugs = Lists.newArrayList(drugRepository.findAll());
  57. log.info("以保存的drugs: {}", drugs);
  58. }
  59. @Test
  60. public void drugSaveTest() {
  61. Drug drug = new Drug();
  62. drug.setId(1L);
  63. drug.setName("阿莫西林胶囊");
  64. drug.setPrice("10");
  65. drug.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
  66. Drug drug1 = new Drug();
  67. drug1.setId(3L);
  68. drug1.setName("阿莫西林");
  69. drug1.setPrice("10");
  70. drug1.setEffect("阿莫西林适用于敏感菌(不产β内酰胺酶菌株)所致的感染");
  71. Drug drug2 = new Drug();
  72. drug2.setId(2L);
  73. drug2.setName("999感冒灵颗粒");
  74. drug2.setPrice("20");
  75. drug2.setEffect("本品解热镇痛。用于感冒引起的头痛,发热,鼻塞,流涕,咽痛等");
  76. drugRepository.saveAll(Lists.newArrayList(drug, drug1, drug2));
  77. List<Drug> drugs = Lists.newArrayList(drugRepository.findAll());
  78. log.info("以保存的drugs: {}", drugs);
  79. }
  80. /** * 这个测试中,name(不带后缀的时候是Keyword类型),不分词的时候,如果能匹配到 * 那就是完全匹配,应该要得分高一点,所以设置是match查询的两倍 */
  81. @Test
  82. public void drugIkSearchTest() {
  83. NativeSearchQueryBuilder builder = new NativeSearchQueryBuilder();
  84. NativeSearchQuery query = builder.withQuery(QueryBuilders.boolQuery()
  85. .should(QueryBuilders.matchQuery("name", "阿莫西林")).boost(2)
  86. .should(QueryBuilders.matchQuery("name.ik", "阿莫西林")).boost(1))
  87. .build();
  88. log.info("DSL:{}", query.getQuery().toString());
  89. Iterable<Drug> iterable = drugRepository.search(query);
  90. List<Drug> drugs = Lists.newArrayList(iterable);
  91. log.info("result: {}", drugs);
  92. }
  93. /** * 这个测试中,name.pinyin(只生成整个name的全拼和所有汉字首字母的全拼接), * 这个匹配的时候就是完全匹配,得分应该高一点 */
  94. @Test
  95. public void drugPinyinSearchTest() {
  96. NativeSearchQueryBuilder builder = new NativeSearchQueryBuilder();
  97. NativeSearchQuery query = builder.withQuery(QueryBuilders.boolQuery()
  98. .should(QueryBuilders.matchQuery("name.ik_pinyin", "阿莫西林").boost(1))
  99. .should(QueryBuilders.matchQuery("name.pinyin", "阿莫西林").boost(2))
  100. )
  101. .withSort(SortBuilders.scoreSort())
  102. .build();
  103. log.info("DSL:{}", query.getQuery().toString());
  104. Iterable<Drug> iterable = drugRepository.search(query);
  105. List<Drug> drugs = Lists.newArrayList(iterable);
  106. log.info("result: {}", drugs);
  107. }

发表评论

表情:
评论列表 (有 0 条评论,737人围观)

还没有评论,来说两句吧...

相关阅读