自定义分词器
在Es中,一个字段可以定义多个字段类型,并设置分词器。
Es分词器由三部分组成:character filter:分词前过滤;tokenizer:按照一定逻辑规则分词;token filter:对分词结果处理。
1 用keyword分词(输入即输出),并在分词前过滤掉html
post _analyze
{
“tokenizer”:”keyword”,
“char_filter”:[“html_strip”],
“text”:”hello word“
}
2 用目录分割符分词,会分割成多级目录
post _analyze
{
“tokenizer”:”path_hierarchy”,
“text”:”/usr/local/java/elasticsearch”
}
3 按词拆分,分词前把 - 转换成 _
post _analyze
{
“tokenizer”:”standard”,
“char_filter”:[
{
“type”:”mapping”,
“mappings”:[“- => _“]
}
],
“text”:”123-456, I-test, test-990 650-555-1234”
}
4 把表情符号替换成单词
post _analyze
{
“tokenizer”:”standard”,
“char_filter”:[
{
“type”:”mapping”,
“mappings”:[“:) => happy”,”:( => sad”]
}
],
“text”:[“I am feeling :)”,”feeling :(“]
}
5 分词后停用词过滤掉 近义词转换
post _analyze
{
“tokenizer”:”standard”,
“filter”:[“stop”,”snowball”],
“text”:[“The girls in China are playing this games!”]
}
6 分词后停用词过滤掉 近义词转换
post _analyze
{
“tokenizer”:”whitespace”,
“filter”:[“stop”,”snowball”],
“text”:[“The rain in Spain falls mainly on the plain.”]
}
7 The 小写后 被当做停用词删除
post _analyze
{
“tokenizer”:”whitespace”,
“filter”:[“lowercase”,”stop”,”snowball”],
“text”:[“The girls in China are playing this games!”]
}
8 正则表达式
post _analyze
{
“tokenizer”:”standard”,
“char_filter”:[
{
“type”:”pattern_replace”,
“pattern”:”http://(.\*)”,
“replacement”:”$0”
}
],
“text”:”http://www.elastic.co“
}
9 自定义分词器
# 自定义分词器
设置多个自定义分词器 在analyzer节点下,增加多个下级节点,一个下级节点定义一个分词器
delete my_index
put my_index
{
“settings”:{
“analysis”:{
“analyzer”:{
“my_custom_analyzer”:{
“type”:”custom”,
“char_filter”:[“emoticons”],
“tokenizer”:”punctuation”,
“filter”:[“lowercase”,”english_stop”]
}
},
“tokenizer”:{
“punctuation”:{
“type”:”pattern”,
“pattern”:”[.,!?]“
}
},
“char_filter”:{
“emoticons”:{
“type”:”mapping”,
“mappings”:[“:) => happy”]
}
},
“filter”:{
“english_stop”:{
“type”:”stop”,
“stopwords”:”_english_“
}
}
}
}
}
测试自定义分词器
post my_index/_analyze
{
“analyzer”:”my_custom_analyzer”,
“text”:”i am a :) person , and you ?”
}
还没有评论,来说两句吧...