Lucene的分组(Grouping/GroupBy)功能 << 克己服人,礼智谦让!



Lucene的分组(Grouping/GroupBy)功能 « 克己服人,礼智谦让!
topN分组的主要原理:
两次检索
1) 执行第一次检索,应用FirstPassGroupingCollector,对组进行排序并截取(offset, offset+topn)范围内的组
2) 执行第二次检索,应用SecondPassGroupingCollector,为每个组提取前n条记录
为提高第二次检索的效率,引入CacheCollector,执行第一次检索时缓存匹配的记录,并在第二次检索读取缓存的记录
为记录总的组的数量,另外还引入了AllGroupsCollector
注意,如果需要对Lucene的的score进行修正,则需要重载TermFirstPassGroupingCollector和 TermSecondPassGroupingCollector

  1. public void groupBy(IndexSearcher searcher, Query query, Sort groupSort) throws IOException {  
  2.   int topNGroups = 10// 每页需要多少个组  
  3.   int groupOffset = 0// 起始的组  
  4.   boolean fillFields = true;  
  5.   Sort docSort = groupSort; // groupSort用于对组进行排序,docSort用于对组内记录进行排序,多数情况下两者是相同的,但也可不同  
  6.   // Sort docSort = new Sort(new SortField[] { new SortField("page", SortField.INT, true) });  
  7.   int docOffset = 0;   // 用于组内分页,起始的记录  
  8.   int docsPerGroup = 2;// 每组返回多少条结果  
  9.   boolean requiredTotalGroupCount = true// 是否需要计算总的组的数量  
  10.   
  11.   // 如果需要对Lucene的score进行修正,则需要重载TermFirstPassGroupingCollector  
  12.   TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector(searcher.getIndexReader(), "author", groupSort, groupOffset + topNGroups);  
  13.   
  14.   boolean cacheScores = true;  
  15.   double maxCacheRAMMB = 16.0;  
  16.   CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);  
  17.   searcher.search(query, cachedCollector);  
  18.   
  19.   Collection<searchgroup<string>> topGroups = c1.getTopGroups(groupOffset, fillFields);  
  20.   
  21.   if (topGroups == null) {  
  22.     // No groups matched  
  23.     return;  
  24.   }  
  25.   
  26.   HitCollector secondPassCollector = null;  
  27.   
  28.   boolean getScores = true;  
  29.   boolean getMaxScores = true;  
  30.   // 如果需要对Lucene的score进行修正,则需要重载TermSecondPassGroupingCollector  
  31.   TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector(searcher.getIndexReader(), "author", topGroups, groupSort, docSort, docOffset + docsPerGroup, getScores, getMaxScores, fillFields);  
  32.   
  33.   // Optionally compute total group count  
  34.   TermAllGroupsCollector allGroupsCollector = null;  
  35.   if (requiredTotalGroupCount) {  
  36.     allGroupsCollector = new TermAllGroupsCollector(searcher.getIndexReader(), "author");  
  37.     secondPassCollector = MultiCollector.wrap(c2, allGroupsCollector);  
  38.   } else {  
  39.     secondPassCollector = c2;  
  40.   }  
  41.   
  42.   if (cachedCollector.isCached()) {  
  43.     // Cache fit within maxCacheRAMMB, so we can replay it:  
  44.     cachedCollector.replay(secondPassCollector);  
  45.   } else {  
  46.     // Cache was too large; must re-execute query:  
  47.     searcher.search(query, secondPassCollector);  
  48.   }  
  49.   
  50.   int totalGroupCount = -1// 所有组的数量  
  51.   int totalHitCount = -1// 所有满足条件的记录数  
  52.   int totalGroupedHitCount = -1// 所有组内的满足条件的记录数(通常该值与totalHitCount是一致的)  
  53.   if (requiredTotalGroupCount) {  
  54.     totalGroupCount = allGroupsCollector.getGroupCount();  
  55.   }  
  56.   System.out.println("groupCount: " + totalGroupCount);  
  57.   
  58.   TopGroups<string> groupsResult = c2.getTopGroups(docOffset);  
  59.   totalHitCount = groupsResult.totalHitCount;  
  60.   totalGroupedHitCount = groupsResult.totalGroupedHitCount;  
  61.   System.out.println("groupsResult.totalHitCount:" + totalHitCount);  
  62.   System.out.println("groupsResult.totalGroupedHitCount:" + totalGroupedHitCount);  
  63.   
  64.   int groupIdx = 0;  
  65.   // 迭代组  
  66.   for (GroupDocs<string> groupDocs : groupsResult.groups) {  
  67.     groupIdx++;  
  68.     System.out.println("group[" + groupIdx + "]:" + groupDocs.groupValue); // 组的标识  
  69.     System.out.println("group[" + groupIdx + "]:" + groupDocs.totalHits);  // 组内的记录数  
  70.     int docIdx = 0;  
  71.     // 迭代组内的记录  
  72.     for (ScoreDoc scoreDoc : groupDocs.scoreDocs) {  
  73.       docIdx++;  
  74.       System.out.println("group[" + groupIdx + "][" + docIdx + "]:" + scoreDoc.doc + "/" + scoreDoc.score);  
  75.       Document doc = searcher.doc(scoreDoc.doc);  
  76.       System.out.println("group[" + groupIdx + "][" + docIdx + "]:" + doc);  
  77.     }  
  78.   }  
Read full article from Lucene的分组(Grouping/GroupBy)功能 « 克己服人,礼智谦让!

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts