Lucene中的堆(Heap)[ScorerDocQueue,TopScoreDocCollector] - quweiprotoss的日志 - 网易博客



         一个经典的问题,也就是10^N个数,远超过内存的大小,如何排序。答案虽然我自己也想到了,但别人更早想到,经典做法,把文件拆成多份,然后多线程对文件分别进行排序,然后进行多路归并,多路归并时,经典做法就是用优先队列。这也是LuceneAnd操作时选择的方法,在DisjunctionSumScorer中有ScorerDocQueue scorerDocQueue,它就是一个优先队列。

         ScorerDocQueue的成员有:

/* 保存堆中的元素 */

private final HeapedScorerDoc[] heap;

/* 堆最多元素数量 */

private final int maxSize;

/* 堆中当前元素数量 */

private int size;

/* heap[1]值相同,它是为了提高速度 */

private HeapedScorerDoc topHSD;

         HeapedScoreDoc类的成员只有两个,一个是Scorer本身,另一个是Scorer的第一个doc id也就是最小的一个doc id

private void initScorerDocQueue() throws IOException {

    scorerDocQueue = new ScorerDocQueue(nrScorers);

    for (Scorer se : subScorers) {

       if (se.nextDoc() != NO_MORE_DOCS) {

           scorerDocQueue.insert(se);

       }

    }

}

         DisjunctionSumScorer类中initScorerDocQueue函数将subScorers放入优先队列中。


Read full article from Lucene中的堆(Heap)[ScorerDocQueue,TopScoreDocCollector] - quweiprotoss的日志 - 网易博客


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts