[LUCENE-3003] Move UnInvertedField into Lucene core - ASF JIRA



It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting).

Lucene's indexer (TermsHashPerField) has precisely this same problem
– every unique term must point to two (well, one if omitTFAP)
growable byte arrays. We use "slices" into a single big (paged)
byte[], where first slice is tiny and can only hold like 5 bytes, but
then points to the next slice which is a bit bigger, etc.

We could look @ refactoring that for this use too...

Though this is "just" the one-time startup cost.

Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable.

I did make it configurable to the Lucene class (you can pass it in to
ctor), but for Solr I left it using every 128th term.

Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes.

Both FieldCache and docvalues (branch) store an array-of-terms like
this (the array of offsets is packed ints).

We should also look at using an FST, which'd be the most compact but
the ord -> term lookup cost goes up.

Anyway I think we can pursue these cool ideas on new [future]


Read full article from [LUCENE-3003] Move UnInvertedField into Lucene core - ASF JIRA


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts