[ lucene扩展 ] 自定义Collector实现统计功能 - MR-fox - 博客园



[ lucene扩展 ] 自定义Collector实现统计功能 - MR-fox - 博客园

从上面这一坨坨代码我们可以大概看清collector在search中的应用情况。

那么统计呢?

首先我们来分析最简单的统计――"一维统计",就只对一个字段的统计。例如统计图书每年的出版量、专利发明人发明专利数量的排行榜等。

统计的输入:检索式、统计字段

统计的输出:<统计项、数量>的集合

其中关键是我们怎么拿到统计项。这个又分成以下一种情况:

1)统计字段没有存储、不分词

我们可以使用FieldCache.DEFAULT.getStrings(reader, f);获取统计项。

2)统计字段没有存储、分词

需要通过唯一标识从数据库(如果正向信息存在数据库的话)取出统计项(字段内容),然后统计分析。可想而知效率极低。

3)统计字段存储、分词

可以通过doc.get(fieldName)取出统计项,依然比较低效

4)统计字段存储、不分词

和1)类似

因此我们如果要对某个字段进行统计,那么最好选用不分词(Index.NOT_ANALYZED),这个和排序字段的要求类似!

拿到统计项后,我们可以通过累加然后排序。(这里可以借助map)


Read full article from [ lucene扩展 ] 自定义Collector实现统计功能 - MR-fox - 博客园


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts