Lucene的评分(score)机制的简单解释



Lucene的评分(score)机制的简单解释
通过Searcher.explain(Query query, int doc)方法可以查看某个文档的得分的具体构成。 

在Lucene中score简单说是由 tf * idf * boost * lengthNorm计算得出的。 

tf:是查询的词在文档中出现的次数的平方根 
idf:表示反转文档频率,观察了一下所有的文档都一样,所以那就没什么用处,不会起什么决定作用。 
boost:激励因子,可以通过setBoost方法设置,需要说明的通过field和doc都可以设置,所设置的值会同时起作用 
lengthNorm:是由搜索的field的长度决定了,越长文档的分值越低。 

所以我们编程能够控制score的就是设置boost值。 

还有个问题,为什么一次查询后最大的分值总是1.0呢? 
因为Lucene会把计算后,最大分值超过1.0的分值作为分母,其他的文档的分值都除以这个最大值,计算出最终的得分。 
  1.        Document doc1 = new Document();  
  2.         Document doc2 = new Document();  
  3.         Document doc3 = new Document();  
  4.           
  5.         Field f1 = new Field("bookname","bc bc", Field.Store.YES, Field.Index.TOKENIZED);  
  6.         Field f2 = new Field("bookname","ab bc", Field.Store.YES, Field.Index.TOKENIZED);  
  7.         Field f3 = new Field("bookname","ab bc cd", Field.Store.YES, Field.Index.TOKENIZED);  
  8.           
  9.         doc1.add(f1);  
  10.         doc2.add(f2);  
  11.         doc3.add(f3);  
  12.           
  13.         writer.addDocument(doc1);  
  14.         writer.addDocument(doc2);  
  15.         writer.addDocument(doc3);  
  16.           
  17.         writer.close();  
  18.           
  19.         IndexSearcher searcher = new IndexSearcher(INDEX_STORE_PATH);  
  20.         TermQuery q = new TermQuery(new Term("bookname""bc"));  
  21.         q.setBoost(2f);  
  22.         Hits hits = searcher.search(q); 
运行结果: 
引用
bc bc 0.629606
0.629606 = (MATCH) fieldWeight(bookname:bc in 0), product of:
  1.4142135 = tf(termFreq(bookname:bc)=2)
  0.71231794 = idf(docFreq=3, numDocs=3)
  0.625 = fieldNorm(field=bookname, doc=0)

ab bc 0.4451987
0.4451987 = (MATCH) fieldWeight(bookname:bc in 1), product of:
  1.0 = tf(termFreq(bookname:bc)=1)
  0.71231794 = idf(docFreq=3, numDocs=3)
  0.625 = fieldNorm(field=bookname, doc=1)

ab bc cd 0.35615897
0.35615897 = (MATCH) fieldWeight(bookname:bc in 2), product of:
  1.0 = tf(termFreq(bookname:bc)=1)
  0.71231794 = idf(docFreq=3, numDocs=3)
  0.5 = fieldNorm(field=bookname, doc=2) 

从结果中我们可以看到: 
bc bc文档中bc出现了2次,tf为2的平方根,所以是1.4142135。而其他的两个文档出现了一次,所以是1.0 
所有的三个文档的idf值都是一样的,是0.71231794 
默认情况下,boost的值都是1.0,所以lengthNorm就是当前的fieldNorm的值。前两个文档的长度一致,为0.625,而排在最后的文档,因为长度要长一些,所以分值要低,为0.5 

现在对f2这个字段增加激励因子:f2.setBoost(2.0f); 
运行结果变为: 
引用
ab bc 0.8903974
0.8903974 = (MATCH) fieldWeight(bookname:bc in 1), product of:
  1.0 = tf(termFreq(bookname:bc)=1)
  0.71231794 = idf(docFreq=3, numDocs=3)
  1.25 = fieldNorm(field=bookname, doc=1)

发现fieldNorm值有0.625变成了1.25,所以就是乘以了2.0 

接下来再对第二个文档增加激励因子:doc2.setBoost(2.0f); 
运行结果变为: 
引用
ab bc 1.0
1.7807949 = (MATCH) fieldWeight(bookname:bc in 1), product of:
  1.0 = tf(termFreq(bookname:bc)=1)
  0.71231794 = idf(docFreq=3, numDocs=3)
  2.5 = fieldNorm(field=bookname, doc=1)

发现fieldNorm又乘以了2,所以说对于Document和Field的setBoost都会乘到一起。 

因为该文档的最终的score超过了1.0变成1.7807949,所以其他的两个文档的最终得分都要除以该值, 
分别变成: 
引用
bc bc 0.35355335
ab bc cd 0.19999999
Please read full article from Lucene的评分(score)机制的简单解释

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts