Custom Per-Field Similarity in Solr4.1



Custom Per-Field Similarity in Solr4.1
Lets say a search term appears 20 times in a field “long description” of the solr document, it appears 5 times in field “short description”, due to the nature of data contained in field “short description”, we want to add a multiplier to the term frequency (tf) that Lucene Similarity calculates.
E.g. the tf score returned for “short description” should be multiplied by 10. For “long description” we dont want to give any additional score if there are more than 10 occurrences of the search term; the logic being, more occurrences beyond a threshold does not necessarily make it more relevant.
Solr provides a way to override the default Lucene Similarity class by specifying under schema.xml as mentioned below:
<similarity>com.sdudhara.MyCustomSImilarity</similarity>
In the MyCustomSImilarity class that extends from DefaultSimilarity (or any other Lucene Similarity class) , you can override the methods in the DefaultSimilarity class e.g. tf(). This will however impact all the fields and use same logic to calculate tf() score for all the fields.
To do this at the field level, in the schema.xml, go to the fieldType where you have defined the fieldType for the given field. Within the fieldType, you can add one more line with <similiarity>com.sdudhara.MyCustomSimilarity</similarity>
 <fieldType name="text_dfr" class="solr.TextField">
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <similarity class="solr.DFRSimilarityFactory">
      <str name="basicModel">I(F)</str>
      <str name="afterEffect">B</str>
      <str name="normalization">H2</str>
    </similarity>
 </fieldType>
 <fieldType name="text_ib" class="solr.TextField">
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <similarity class="solr.IBSimilarityFactory">
      <str name="distribution">SPL</str>
      <str name="lambda">DF</str>
      <str name="normalization">H2</str>
    </similarity>
 </fieldType>
  • You will also need to delete existing data, reindex the data, since Similarity class is also used during index times. So, unless you reindex the data, you wont be able to get the custom similarities take effect
A related StackOverflow link that I had posted to resolve the issue:
http://stackoverflow.com/questions/15751766/solr-4-1-dismax-pf-not-returning-expected-results/15868556#15868556
Read full article from Custom Per-Field Similarity in Solr4.1

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts