Lucene40PostingWriter - jollyjumper的专栏 - 博客频道 - CSDN.NET



控制freq和prox两个文件的输出,比较简单。

默认的skip interval是16,max skip level是10.

由源码看出还是使用的VInt编码(而不是传说中快速的PForDelta)。

存文档时docid列表的delta,如果不存文档频率,是一个delta,存的话如果是1则是(delta << 1) | 1,否则就是两个vint了(delta << 1和termDocFreq)。最后缓存跳跃表结构。

存位置也类似,不过position是看是否存payload,payloadlength是否和上次一样,offset则是看是否存offset,offsetlength是否和上次一样。如果有payload,最后写入payload。

加完一个term,会写入跳跃表结构(应该是在tim文件中)。

等一个段的所有文档全部加完,会调用flushTermsBlock,先往RamOutputStream中写入词典的freqStart,proxStart,skipStart信息,最后flush进入tip文件中。


Read full article from Lucene40PostingWriter - jollyjumper的专栏 - 博客频道 - CSDN.NET


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts