目录

多字段特性和自定义Analyzer

多字段特性

  • 增加一个keyword字段,实现精确匹配
  • 使用不同的analyzer
    • 不同语言
    • Pinyin 字段搜索 子字段 拼音分词
    • 为搜索和索引指定不同的analyzer

精确值和全文本

  • 精确值:数字,日期,具体的一个字符串(例如:”Apple store“),es中的keyword
  • 全文本:非结构化的文本数据,es中的text,会进行分词

精确值在索引的时,不需要做特殊的分词处理

https://i.bmp.ovh/imgs/2022/06/04/d18a8156fbc8d4f3.png

自定义分词

  • 当es自带的分词器无法满足时,可以自定义分词器。通过组合不同的组件实现
    • Character Filter
    • Tokenizer
    • Token Filter
Character Filter
  • 在 Tokenizer之前对文本进行处理,例如增加,删除,替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息

  • 一些自带的Character Filters

    • HTML strip : 去除 html标签
    • Mapping:字符串替换
    • Pattern replace : 正则匹配替换
Tokenizer
  • 将原始的文本按一定规则,切分为词(term or token)
  • es内置的Tokenizers
    • Whitespace / standard / uax_url_email/ pattern/keyword/path hierarchy
  • 可以用Java开发插件,实现自己的tokenizer
token Filters
  • 将Tokenizer输出的单词(term),进行增删改
  • 自带的Token Filters
    • lowercase/stop/synonym(添加近义词)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
PUT logs/_doc/1
{"level":"DEBUG"}

GET /logs/_mapping

# 处理网络爬虫抓取的数据
POST _analyze
{
  "tokenizer":"keyword",
  "char_filter":["html_strip"],
  "text": "<b>hello world</b>"
}

# 文件路径分词
POST _analyze
{
  "tokenizer":"path_hierarchy",
  "text":"/user/ymruan/a/b/c/d/e"
}



#使用char filter进行替换
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

# char filter 替换表情符号
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ ":) => happy", ":( => sad"]
      }
    ],
    "text": ["I am felling :)", "Feeling :( today"]
}

# 正则表达式
GET _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "pattern_replace",
        "pattern" : "http://(.*)",
        "replacement" : "$1"
      }
    ],
    "text" : "http://www.elastic.co"
}



// white space and snowball
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}



#  whitespace与stop
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The rain in Spain falls mainly on the plain."]
}


# remove 加入lowercase后,The被当成 stopword删除
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase","stop","snowball"],
  "text": ["The gilrs in China are playing this game!"]
}