Lucene 4.10.1 Package lucene.search.similarities



Lucene 4.10.1 Package lucene.search.similarities
This package contains the various ranking models that can be used in Lucene. The abstract class Similarity serves as the base for ranking functions. For searching, users can employ the models already implemented or create their own by extending one of the classes in this package.

Summary of the Ranking Methods

DefaultSimilarity is the original Lucene scoring function. It is based on a highly optimized Vector Space Model. For more information, seeTFIDFSimilarity.
BM25Similarity is an optimized implementation of the successful Okapi BM25 model.
SimilarityBase provides a basic implementation of the Similarity contract and exposes a highly simplified interface, which makes it an ideal starting point for new ranking functions. Lucene ships the following methods built on SimilarityBase:
Since SimilarityBase is not optimized to the same extent as DefaultSimilarity and BM25Similarity, a difference in performance is to be expected when using the methods listed above. However, optimizations can always be implemented in subclasses

To change Similarity, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.
To make this change, implement your own Similarity (likely you'll want to simply subclass an existing method, be it DefaultSimilarity or a descendant of SimilarityBase), and then register the new class by calling IndexWriterConfig.setSimilarity(Similarity) before indexing and IndexSearcher.setSimilarity(Similarity) before searching.

Extending SimilarityBase
The easiest way to quickly implement a new ranking method is to extend SimilarityBase, which provides basic implementations for the low level . Subclasses are only required to implement the SimilarityBase.score(BasicStats, float, float) andSimilarityBase.toString() methods.
Another option is to extend one of the frameworks based on SimilarityBase. These Similarities are implemented modularly, e.g.DFRSimilarity delegates computation of the three parts of its formula to the classes BasicModelAfterEffect and Normalization. Instead of subclassing the Similarity, one can simply introduce a new basic model and tell DFRSimilarity to use it.

Changing DefaultSimilarity

If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:
  1. The SweetSpotSimilarity in org.apache.lucene.misc gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.
  2. Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.
  3. Changing Length Normalization — By overriding Similarity.computeNorm(FieldInvertState state), it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".
In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):
[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.
Please read full article from Lucene 4.10.1 Package lucene.search.similarities

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts