Testing Lucene Analyzers with elasticsearch - Control+R



ndexing and Analysis

Putting scoring aside for a moment, the first step is to find matches. Since humans don’t always use the same exact words to describe something, the words need to be massaged a bit to normalize case and remove suffixes such as -ed or -ing. This massaging is called analysis, and it is performed on both the documents being searched and the search query itself.

Analyzing a lot of documents takes time, so it is usually done up front. This process is called indexing. Analyzed documents are stored in a format that is efficient for searching, called an index.

Consider a document that contains the word “Searching”. An analyzer might lowercase the word and remove the -ing suffix, leaving just “search”. This analyzed term is what gets stored in the index. Later, a user might come along and search for the word “searched”. The query is similarly analyzed, yielding “search” as the search term. This term matches the previously-indexed document, so the document is returned to the user.

For all of this to work, the analyzer used during indexing and the analyzer used on the query must be compatible. If the analyzer used during indexing converts all words to uppercase but the analyzer used on the query converts all words to lowercase, there will never be a match!

Fortunately, you don’t have to code up any of this yourself — elasticsearch (and the Lucene library it uses under the hood) provides all of this functionality. But as with any tool, and especially a tool as deep as elasticsearch, you’ll be able to use it more effectively if you understand how it works.


Read full article from Testing Lucene Analyzers with elasticsearch – Control+R


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts