Custom analyzers



In Analysis and analyzers we said that analyzer is a wrapper which combines three functions into a single package, which are executed in sequence:

Character filters

Character filters are used to “tidy up” a string before it is tokenized. For instance, if our text is in HTML format, it will contain HTML tags like <p> or <div> that we don’t want to be indexed. We can use the html_strip character filter to remove all HTML tags and to convert HTML entities like &Aacute; into the corresponding Unicode character: Á.

An analyzer may have zero or more character filters.

Tokenizers

An analyzer must have a single tokenizer. The tokenizer breaks the string up into individual terms or tokens. The standard tokenizer, which is used in the standard analyzer, breaks up a string into invidual terms on word boundaries, and removes most punctuation, but other tokenizers exist which have different behaviour.

For instance, the keyword tokenizer outputs exactly the same string as it received, without any tokenization. The whitespace tokenizer splits text on whitespace only. The pattern tokenizer can be used to split text on a matching regular expression.

Token filters

After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified.

Token filters may change, add or remove tokens. We have already mentioned the lowercase and stop token filters, but there are many more available in Elasticsearch. Stemming token filters “stem” words to their root form. The ascii_folding filter removes diacritics, converting a term like "très" into "tres". The ngram and edge_ngram token filters can produce tokens suitable for partial matching or autocomplete.


Read full article from Custom analyzers


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts