Lucene学习笔记之Similarity(Similarity in Lucene) | Something Technical



Similarity主要是定义document和terms之间的相关性,Lucene里面也是集成了好几种在IR领域常见的similarity measure,这些类都继承Similarity,这里zhangdx也是先用一个图列出这些集成的Similarity Measure之间的继承关系

similarity

然后我们普及一下每一个Similarity的知识

  • BM25Similarity,可以算是跟tf-idf一类路子,具体的公式如下

  • MultiSimilarity : 对多个Similarity的结果进行整合
  • PerFieldSimilarityWrapper : 对不同的Field使用不同的Similarity,这是一个抽象类,具体的实现需要由用户来定义
  • SimilarityBase,抽象类,把公共的部分提取出来,子类只需要实现score和toString
    • DFRSimilarity : Divergence from randomness
    • IBSimilarity : information-based model
    • LMSimilarity : language modeling
  • TFIDFSimilarity,最经典的similarity,也是Lucene默认的实现

参考资料

  1. http://en.wikipedia.org/wiki/Okapi_BM25

Read full article from Lucene学习笔记之Similarity(Similarity in Lucene) | Something Technical


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts