All About Programming: Introducing Lucene Index Doc Values « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

Introducing Lucene Index Doc Values « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions
From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes.

However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval. We need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene's internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

Read full article from Introducing Lucene Index Doc Values « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

Introducing Lucene Index Doc Values « Trifork Blog / Trifork: Enterprise Java, Open Source, software solutions

No comments:

Post a Comment

Labels

Popular Posts