In Analysis and analyzers we said that analyzer is a wrapper which combines three functions into a single package, which are executed in sequence:
- Character filters
Character filters are used to “tidy up” a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like
<p>or<div>that we don’t want to be indexed. We can use thehtml_stripcharacter filter to remove all HTML tags and to convert HTML entities likeÁinto the corresponding Unicode character:Á.An analyzer may have zero or more character filters.
- Tokenizers
An analyzer must have a single tokenizer. The tokenizer breaks the string up into individual terms or tokens. The
standardtokenizer, which is used in thestandardanalyzer, breaks up a string into invidual terms on word boundaries, and removes most punctuation, but other tokenizers exist which have different behaviour.For instance, the
keywordtokenizer outputs exactly the same string as it received, without any tokenization. Thewhitespacetokenizer splits text on whitespace only. Thepatterntokenizer can be used to split text on a matching regular expression.- Token filters
After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified.
Token filters may change, add or remove tokens. We have already mentioned the
lowercaseandstoptoken filters, but there are many more available in Elasticsearch. Stemming token filters “stem” words to their root form. Theascii_foldingfilter removes diacritics, converting a term like"très"into"tres". Thengramandedge_ngramtoken filters can produce tokens suitable for partial matching or autocomplete.
Read full article from Custom analyzers
No comments:
Post a Comment