The Lucene formula: TF & IDF



The Lucene formula: TF & IDF
image
For now, I want to look at the tf() function. The default implementation is in DefaultSimilarity.Tf, and is defined as:
image
That isn’t really helpful, however. Not without knowing what freq is.  And I’m pretty sure that Sqrt isn’t cheap, which probably explains this:
image
So it caches the score function, and it appears that the tf() is purely based on the count, not on anything else. Before I’ll go and find out what is done with the score cache, let’s look at the weight value. This is calculated here:
image
And idf stands for inverse document frequency.  That is defined by default to be:
image
And the actual implementation as:
image
And the caller for that is:
image

During indexing, there is a manually maintained hash table that contains information about each unique term, and when we move from one document to another, we write the number of times each term appeared in the document. For fun, this is written to what I think is an in memory buffer for safe keeping, but it is really hard to follow.

let us look what happens when we use multiple segments. It is actually quite trivial. We just need to sum the term frequency each term across all segments. This gets more interesting when we involve deletes. Because of the way Lucene handle deletes, it can’t really handle this scenario, and deleting a document do not remove its frequency counts for the terms that it had. That is the case until the index does a merge, and fix everything that way.

What about another important quality, the number of times this term appears in a specific document? Those are stored in the frq file, and they are accessible during queries. This is then used in conjunction with the overall term frequency to generate the boost factor per result.

Please read full article from The Lucene formula: TF & IDF

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts