It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting).
Lucene's indexer (TermsHashPerField) has precisely this same problem
– every unique term must point to two (well, one if omitTFAP)
growable byte arrays. We use "slices" into a single big (paged)
byte[], where first slice is tiny and can only hold like 5 bytes, but
then points to the next slice which is a bit bigger, etc.
We could look @ refactoring that for this use too...
Though this is "just" the one-time startup cost.
Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable.
I did make it configurable to the Lucene class (you can pass it in to
ctor), but for Solr I left it using every 128th term.
Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes.
Both FieldCache and docvalues (branch) store an array-of-terms like
this (the array of offsets is packed ints).
We should also look at using an FST, which'd be the most compact but
the ord -> term lookup cost goes up.
Anyway I think we can pursue these cool ideas on new [future]
Read full article from [LUCENE-3003] Move UnInvertedField into Lucene core - ASF JIRA
No comments:
Post a Comment