All About Programming: The Lucene formula: TF & IDF

For now, I want to look at the tf() function. The default implementation is in DefaultSimilarity.Tf, and is defined as:

That isn’t really helpful, however. Not without knowing what freq is. And I’m pretty sure that Sqrt isn’t cheap, which probably explains this:

So it caches the score function, and it appears that the tf() is purely based on the count, not on anything else. Before I’ll go and find out what is done with the score cache, let’s look at the weight value. This is calculated here:

And idf stands for inverse document frequency. That is defined by default to be:

And the actual implementation as:

And the caller for that is:

During indexing, there is a manually maintained hash table that contains information about each unique term, and when we move from one document to another, we write the number of times each term appeared in the document. For fun, this is written to what I think is an in memory buffer for safe keeping, but it is really hard to follow.

let us look what happens when we use multiple segments. It is actually quite trivial. We just need to sum the term frequency each term across all segments. This gets more interesting when we involve deletes. Because of the way Lucene handle deletes, it can’t really handle this scenario, and deleting a document do not remove its frequency counts for the terms that it had. That is the case until the index does a merge, and fix everything that way.

What about another important quality, the number of times this term appears in a specific document? Those are stored in the frq file, and they are accessible during queries. This is then used in conjunction with the overall term frequency to generate the boost factor per result.

Please read full article from The Lucene formula: TF & IDF

The Lucene formula: TF & IDF

No comments:

Post a Comment

Labels

Popular Posts