elasticsearch学习笔记三十五Elasticsearch-索引管理

34次阅读

共计 9130 个字符,预计需要花费 23 分钟才能阅读完成。

索引的基本操作

创建索引

PUT /{index}
{"settings": {},
  "mappings": {"properties": {}
  }
}

创建索引示例:

PUT /test_index
{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "properties": {
      "field1": {
        "type": "text",
        "fields": {
          "keyword": {"type": "keyword"}
        }
      },
      "ctime": {"type": "date"}
    }
  }
}

修改索引

PUT /{index}/_settings
{"setttings": {}
}

PUT /test_index/_settings
{
  "settings": {"number_of_replicas": 2}
}

删除索引

DELETE /{index}

删除索引 API 也可以通过使用逗号分隔列表应用于多个索引,或者通过使用_all 或 * 作为索引应用于所有索引(小心!)。

要禁用允许通过通配符删除索引,或者将配置中的_all 设置 action.destructive_requires_name 设置为 true。也可以通过群集更新设置 api 更改此设置。

修改分词器以及定义自己的分词器

Elasticsearch 附带了各种内置分析器,无需进一步配置即可在任何索引中使用:

standard analyzer: 
所述 standard 分析器将文本分为在字边界条件,由 Unicode 的文本分割算法所定义的。它删除了大多数标点符号,小写术语,并支持删除停用词。Simple analyzer:
该 simple 分析仪将文本分为方面每当遇到一个字符是不是字母。然后全部变为小写
whitespace analyzer: 
whitespace 只要遇到任何空格字符,分析器就会将文本划分为术语。它不会进行小写转换。stop analyzer: 
该 stop 分析器是像 simple,而且还支持去除停止词。keyword analyzer: 
所述 keyword 分析器是一个“空操作”分析器接受任何文本它被赋予并输出完全相同的文本作为一个单一的术语,也就是不会分词,进行精确匹配。pattern analyzer: 
所述 pattern 分析器使用一个正则表达式对文本进行拆分。它支持小写转换和停用字。language analyzer: 
Elasticsearch 提供了许多特定于语言的分析器,如 english 或 french。fingerprint analyzer: 
所述 fingerprint 分析器是一种专业的指纹分析器,它可以创建一个指纹,用于重复检测。

修改分词器的设置

启动 english 停用词 token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}
GET /my_index/_analyze
{
  "analyzer": "standard",
  "text": "a dog is in the house"
}
{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "in",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "the",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}
GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text": "a dog is in the house"
}
{
  "tokens" : [
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

定制自己的分词器

PUT /test_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": ["&=>and"]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}
GET /test_index/_analyze
{
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
  "analyzer": "my_analyzer"
}
{
  "tokens" : [
    {
      "token" : "tomandjerry",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "are",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "friend",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "in",
      "start_offset" : 23,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 30,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "haha",
      "start_offset" : 42,
      "end_offset" : 46,
      "type" : "<ALPHANUM>",
      "position" : 7
    }
  ]
}

定制化自己的 dynamic mapping 策略

dynamic 参数

true: 遇到陌生字段就进行 dynamic mapping
false: 遇到陌生字段就忽略
strict: 遇到陌生字段,就报错
举例:

PUT /test_index
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": {"type": "text"},
      "address": {
        "type": "object",
        "dynamic": "true"
      }
    }
  }
}
PUT /test_index/_doc/1
{
  "title": "my article",
  "content": "this is my article",
  "address": {
    "province": "guangdong",
    "city": "guangzhou"
  }
}
{
  "error": {
    "root_cause": [
      {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [content] within [_doc] is not allowed"
      }
    ],
    "type": "strict_dynamic_mapping_exception",
    "reason": "mapping set to strict, dynamic introduction of [content] within [_doc] is not allowed"
  },
  "status": 400
}
PUT /test_index/_doc/1
{
  "title": "my article",
  "address": {
    "province": "guangdong",
    "city": "guangzhou"
  }
}
{
  "_index" : "test_index",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

date_detection

elasticsearch 默认会按照一定格式识别 date,比如 yyyy-MM-dd。但是如果某个 field 先过来一个 2017-01-01 的值,就会被自动 dynamic mapping 成 date,后面如果再来一个 ”hello world” 之类的值,就会报错。此时的解决方案是可以手动关闭某个 type 的 date_detention, 如果有需要,自己手动指定某个 field 为 date 类型。

PUT /{index}
{
  "mappings": {"date_detection": false}
}

dynamic template

"dynamic_templates": [
    {
        "my_template_name": {
            ... match conditions ...
            "mapping": {...}
        }
    }
]

示例:

PUT /test_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "en": {
          "match": "*_en",
          "match_mapping_type": "string",
          "mapping": {
            "type": "text",
            "analyzer": "english"
          }
        }
      }  
    ]
  }
}
PUT /test_index/_doc/1
{"title": "this is my first article"}
PUT /test_index/_doc/2
{"title_en": "this is my first article"}
GET /test_index/_mapping
{
  "test_index" : {
    "mappings" : {
      "dynamic_templates" : [
        {
          "en" : {
            "match" : "*_en",
            "match_mapping_type" : "string",
            "mapping" : {
              "analyzer" : "english",
              "type" : "text"
            }
          }
        }
      ],
      "properties" : {
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "title_en" : {
          "type" : "text",
          "analyzer" : "english"
        }
      }
    }
  }
}
GET /test_index/_search?q=is
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {"title" : "this is my first article"}
      }
    ]
  }
}

此时 title 没有匹配到任何的 dynamic 模板,默认就是 standard 分词器,不会过滤停用词,is 会进入倒排索引,用 is 来搜索就可以搜索到。而 title_en 匹配到了 dynamic 模板,就是 english 分词器,会过滤停用词,is 这种停用词就会被过滤掉,用 is 来搜索就搜索不到了。

基于 scoll+bulk+ 索引别名实现零停机重建索引

1、重建索引

一个 field 的设置是不能被修改的,如果要修改一个 field, 那么应该重新按照新的 mapping,建立一个 index, 然后将数据批量查询出来,重新用 bulk api 写入 index 中,批量查询的时候,建议采用 scroll api,并且采用多线程并发的方式来 reindex 数据,每次 scroll 就查询指定日期的一段数据,交给一个线程即可。
(1)一开始,依靠 dynamic mapping,插入数据,但是不小心有些数据是 2017-01-01 这种日期格式的,所以 title 的这种 field 被自动映射为了 date 类型,实际上它应该是 string 类型。

PUT /test_index/_doc/1
{"title": "2017-01-01"}

GET /test_index/_mapping

{
  "test_index" : {
    "mappings" : {
      "properties" : {
        "title" : {"type" : "date"}
      }
    }
  }
}

(2)当后期向索引中加入 string 类型的 title 值的时候,就会报错

PUT /test_index/_doc/2
{"title": "my first article"}

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [title] of type [date] in document with id'2'"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [title] of type [date] in document with id'2'","caused_by": {"type":"illegal_argument_exception","reason":"failed to parse date field [my first article] with format [strict_date_optional_time||epoch_millis]","caused_by": {"type":"date_time_parse_exception","reason":"Failed to parse with all enclosed parsers"}
    }
  },
  "status": 400
}

(3)如果此时想修改 title 的类型,是不可能的

PUT /test_index
{
  "mappings": {
    "properties": {
      "title": {"type": "text"}
    }
  }
}
{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [test_index/mZALkQ8IQV67SjCVqkhq4g] already exists",
        "index_uuid": "mZALkQ8IQV67SjCVqkhq4g",
        "index": "test_index"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [test_index/mZALkQ8IQV67SjCVqkhq4g] already exists",
    "index_uuid": "mZALkQ8IQV67SjCVqkhq4g",
    "index": "test_index"
  },
  "status": 400
}

(4)此时,唯一的办法就是 reindex,也就是说,重新建立一个索引,将旧索引的数据查询出来,在导入新索引
(5)如果说旧索引的名字是 old_index,新索引的名字是 new_index,终端应用,已经在使用 old_index 进行操作了,难到还要去终止应用,修改使用的 index 为 new_index,在重新启动应用吗
(6)所以说此时应该采用别名的方式,给应用一个别名,这个别名指向旧索引,应用先用着。指向的还是旧索引
格式:

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test",
        "alias": "alias1"
      }
    }
  ]
}
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test_index",
        "alias": "test_index_alias"
      }
    }
  ]
}

(7)新建一个 index, 调整 title 为 string

PUT /test_index_new
{
  "mappings": {
    "properties": {
      "title": {"type": "text"}
    }
  }
}

(8)使用 scroll api 将数据批量查询出来

GET /test_index/_search?scroll=1m
{
  "query": {"match_all": {}
  },
  "sort": ["_doc"],
  "size": 1
}
{
  "_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAACz3UWUC1iLVRFdnlRT3lsTXlFY01FaEFwUQ==",
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : null,
        "_source" : {"title" : "2017-01-01"},
        "sort" : [0]
      }
    ]
  }
}
POST /_search/scroll
{
  "scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAC0GYWUC1iLVRFdnlRT3lsTXlFY01FaEFwUQ==",
  "scroll": "1m"
}
{
  "_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAC0GYWUC1iLVRFdnlRT3lsTXlFY01FaEFwUQ==",
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : []}
}

(9)采用 bulk api 将 scroll 查出来的一批数据,批量写入新索引

POST /_bulk
{"index": {"_index": "test_index_new", "_id": "1"}}
{"title": "2017-01-01"}

GET /test_index_new/_search
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_index_new",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {"title" : "2017-01-01"}
      }
    ]
  }
}

(10)反复循环,查询一批又一批的数据出来,再批量写入新索引
(11)将 test_index_alias 切换到 test_index_new 上面去,应用会直接通过 index 别名使用新的索引中的数据,应用不需要停机,高可用

POST /_aliases
{
    "actions" : [{ "remove" : { "index" : "test_index", "alias" : "test_index_alias"} },
        {"add" : { "index" : "test_index_new", "alias" : "test_index_alias"} }
    ]
}

GET /test_index_alias/_search

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_index_new",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {"title" : "2017-01-01"}
      }
    ]
  }
}

2、基于 alias 对 client 客户端透明切换 index

格式:

POST /_aliases
{
    "actions" : [{ "remove" : { "index" : "test1", "alias" : "alias1"} },
        {"add" : { "index" : "test2", "alias" : "alias1"} }
    ]
}

注意 actions 里面的 json 一定不要换行,否则无法解析会报错

正文完
 0