自定义分词器-蒲公英云

自定义分词器

在Es中，一个字段可以定义多个字段类型，并设置分词器。

Es分词器由三部分组成：character filter:分词前过滤；tokenizer:按照一定逻辑规则分词；token filter:对分词结果处理。

1 用keyword分词（输入即输出），并在分词前过滤掉html

post _analyze
{
“tokenizer”:”keyword”,
“char_filter”:[“html_strip”],
“text”:”hello word“
}

post _analyze
{
“tokenizer”:”path_hierarchy”,
“text”:”/usr/local/java/elasticsearch”
}

3 按词拆分，分词前把 - 转换成 _

post _analyze
{
“tokenizer”:”standard”,
“char_filter”:[
{
“type”:”mapping”,
“mappings”:[“- => _“]
}
],
“text”:”123-456, I-test, test-990 650-555-1234”
}

4 把表情符号替换成单词

post _analyze
{
“tokenizer”:”standard”,
“char_filter”:[
{
“type”:”mapping”,
“mappings”:[“:) => happy”,”:( => sad”]
}
],
“text”:[“I am feeling :)”,”feeling :(“]
}

5 分词后停用词过滤掉近义词转换
post _analyze
{
“tokenizer”:”standard”,
“filter”:[“stop”,”snowball”],
“text”:[“The girls in China are playing this games!”]
}

6 分词后停用词过滤掉近义词转换

post _analyze
{
“tokenizer”:”whitespace”,
“filter”:[“stop”,”snowball”],
“text”:[“The rain in Spain falls mainly on the plain.”]
}

7 The 小写后被当做停用词删除
post _analyze
{
“tokenizer”:”whitespace”,
“filter”:[“lowercase”,”stop”,”snowball”],
“text”:[“The girls in China are playing this games!”]
}

8 正则表达式
post _analyze
{
“tokenizer”:”standard”,
“char_filter”:[
{
“type”:”pattern_replace”,
“pattern”:”http://(.\*)”,
“replacement”:”$0”
}
],
“text”:”http://www.elastic.co“
}

9 自定义分词器

# 自定义分词器
设置多个自定义分词器在analyzer节点下，增加多个下级节点，一个下级节点定义一个分词器
delete my_index
put my_index
{
“settings”:{
“analysis”:{
“analyzer”:{
“my_custom_analyzer”:{
“type”:”custom”,
“char_filter”:[“emoticons”],
“tokenizer”:”punctuation”,
“filter”:[“lowercase”,”english_stop”]
}
},
“tokenizer”:{
“punctuation”:{
“type”:”pattern”,
“pattern”:”[.,!?]“
}
},
“char_filter”:{
“emoticons”:{
“type”:”mapping”,
“mappings”:[“:) => happy”]
}
},
“filter”:{
“english_stop”:{
“type”:”stop”,
“stopwords”:”_english_“
}
}
}
}
}