ndexing and Analysis
Putting scoring aside for a moment, the first step is to find matches. Since humans don’t always use the same exact words to describe something, the words need to be massaged a bit to normalize case and remove suffixes such as -ed or -ing. This massaging is called analysis, and it is performed on both the documents being searched and the search query itself.
Analyzing a lot of documents takes time, so it is usually done up front. This process is called indexing. Analyzed documents are stored in a format that is efficient for searching, called an index.
Consider a document that contains the word “Searching”. An analyzer might lowercase the word and remove the -ing suffix, leaving just “search”. This analyzed term is what gets stored in the index. Later, a user might come along and search for the word “searched”. The query is similarly analyzed, yielding “search” as the search term. This term matches the previously-indexed document, so the document is returned to the user.
For all of this to work, the analyzer used during indexing and the analyzer used on the query must be compatible. If the analyzer used during indexing converts all words to uppercase but the analyzer used on the query converts all words to lowercase, there will never be a match!
Fortunately, you don’t have to code up any of this yourself — elasticsearch (and the Lucene library it uses under the hood) provides all of this functionality. But as with any tool, and especially a tool as deep as elasticsearch, you’ll be able to use it more effectively if you understand how it works.
Read full article from Testing Lucene Analyzers with elasticsearch – Control+R
No comments:
Post a Comment