Changing Bits: New index statistics in Lucene 4.0
In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.
Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.
To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":
In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the
From these available statistics you're now free to derive other commonly used statistics:
Read full article from Changing Bits: New index statistics in Lucene 4.0
In the past, Lucene recorded only the bare minimal aggregate index statistics necessary to support its hard-wired classic vector space scoring model.
Fortunately, this situation is wildly improved in trunk (to be 4.0), where we have a selection of modern scoring models, including Okapi BM25, Language models, Divergence from Randomness models and Information-based models. To support these, we now save a number of commonly used index statistics per index segment, and make them available at search time.
To understand the new statistics, let's pretend we've indexed the following two example documents, each with only one field "title":
- document 1: The Lion, the Witch, and the Wardrobe
- document 2: The Da Vinci Code
-
TermsEnum.docFreq()
- How many documents contain at least one occurrence of the term in the field; 3.x indices also save this (
TermEnum.docFreq()
). For term "lion" docFreq is 1, and for term "the" it's 2. -
Terms.getSumDocFreq()
- Number of postings, i.e. sum of
TermsEnum.docFreq()
across all terms in the field. For our example documents this is 9. -
TermsEnum.totalTermFreq()
- Number of occurrences of this term in the field, across all documents. For term "the" it's 4, for term "vinci" it's 1.
-
Terms.getSumTotalTermFreq()
- Number of term occurrences in the field, across all documents; this is the sum of
TermsEnum.totalTermFreq()
across all unique terms in the field. For our example documents this is 11. -
Terms.getDocCount()
- How many documents have at least one term for this field. In our example documents, this is 2, but if for example one of the documents was missing the title field, it would be 1.
-
Terms.getUniqueTermCount()
- How many unique terms were seen in this field. For our example documents this is 8. Note that this statistic is of limited utility for scoring, because it's only available per-segment and you cannot (efficiently!) compute this across all segments in the index (unless there is only one segment).
-
Fields.getUniqueTermCount()
- Number of unique terms across all fields; this is the sum of
Terms.getUniqueTermCount()
across all fields. In our example documents this is 8. Note that this is also only available per-segment. -
Fields.getUniqueFieldCount()
- Number of unique fields. For our example documents this is 1; if we also had a body field and an abstract field, it would be 3. Note that this is also only available per-segment.
TermsEnum.docFreq()
, so if you want to experiment with the new scoring models in Lucene 4.0, you should either re-index or upgrade your index usingIndexUpgrader
. Note that the new scoring models all use the same single-byte norms format, so you can freely switch between them without re-indexing. In addition to what's stored in the index, there are also these statistics available per-field, per-document while indexing, in the
FieldInvertState
passed to Similarity.computeNorm
method for both 3.x and 4.0:
-
length
- How many tokens in the document. For document 1 it's 7; for document 2 it's 4.
-
uniqueTermCount
- For this field in this document, how many unique terms are there? For document 1, it's 5; for document 2 it's 4.
-
maxTermFrequency
- What was the count for the most frequent term in this document. For document 1 it's 3 ("the" occurs 3 times); for document 2 it's 1.
From these available statistics you're now free to derive other commonly used statistics:
- Average field length across all documents is
Terms.getSumTotalTermFreq()
divided bymaxDoc
(orTerms.getDocCount()
, if not all documents have the field). - Average within-document field term frequency is
FieldInvertState.length
divided byFieldInvertState.uniqueTermCount
. - Average number of unique terms per field across all documents is
Terms.getSumDocFreq()
divided bymaxDoc
(orTerms.getDocCount(field)
, if not all documents have the field).
Read full article from Changing Bits: New index statistics in Lucene 4.0
No comments:
Post a Comment