Column Stride Fields aka. DocValues
Stored Fields serve a different purpose
• loading body or title fields for result rendering / highlighting
• very suited for loading multiple values
• With Stored Fields you have one indirection per document resulting in going to disk twice for each document
• on-disk random access is too slow
• remember Lucene could score millions of documents even if you just render the top 10 or 20!
Lucene can un-invert a field into FieldCache
FieldCache - is fast once loaded, once!
• Constant time lookup DocID to value
• Efficient representation
• primitive array
• low GC overhead
• loading can be slow (realtime can be a problem)
•must parse values
• builds unnecessary term dictionary
• always memory resident
• Stored Fields are not fast enough for random access
• FieldCache is fast once loaded
• abuses a reverse index
• must convert to String and from String
• requires fair amount of memory
• Lucene is missing native data-structure for primitive per-document values
A dense column based storage
• 1 value per document
• accepts primitives - no conversion from / to String
• int & long
• float & double
• byte[ ]
• each field has a DocValues Type but can still be indexed or stored
• Entirely optional
Lets look at the API - Indexing
Adding DocValues follows existing patterns, simply use Fieldable
Document doc = new Document();
float pageRank = 10.3f;
DocValuesField valuesField = new DocValuesField("pageRank");
valuesField.setFloat(pageRank);
doc.add(valuesField);
writer.addDocument(doc);
String titleText = "The quick brown fox";
Field field = new Field("title", titleText , Store.NO, Index.ANALYZED);
DocValuesField titleDV = new DocValuesField("title");
titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF);
field.setDocValues(titleDV);
Looking at the API - Search / Retrieve
IndexReader reader = ...;
DocValues values = reader.docValues("pageRank");
Source source = values.getSource();
double value = source.getFloat(x);
// still allows iterating over the RAM resident values
DocValuesEnum floatEnum = source.getEnum();
int doc;
FloatsRef ref = floatEnum.getFloat();
while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
value = ref.floats[0];
}
RAM Resident API is very similar to FieldCache
DocValuesEnum still available on RAM Resident API
Please read full article from Column Stride Fields aka. DocValues
Stored Fields serve a different purpose
• loading body or title fields for result rendering / highlighting
• very suited for loading multiple values
• With Stored Fields you have one indirection per document resulting in going to disk twice for each document
• on-disk random access is too slow
• remember Lucene could score millions of documents even if you just render the top 10 or 20!
Lucene can un-invert a field into FieldCache
FieldCache - is fast once loaded, once!
• Constant time lookup DocID to value
• Efficient representation
• primitive array
• low GC overhead
• loading can be slow (realtime can be a problem)
•must parse values
• builds unnecessary term dictionary
• always memory resident
• Stored Fields are not fast enough for random access
• FieldCache is fast once loaded
• abuses a reverse index
• must convert to String and from String
• requires fair amount of memory
• Lucene is missing native data-structure for primitive per-document values
A dense column based storage
• 1 value per document
• accepts primitives - no conversion from / to String
• int & long
• float & double
• byte[ ]
• each field has a DocValues Type but can still be indexed or stored
• Entirely optional
Lets look at the API - Indexing
Adding DocValues follows existing patterns, simply use Fieldable
Document doc = new Document();
float pageRank = 10.3f;
DocValuesField valuesField = new DocValuesField("pageRank");
valuesField.setFloat(pageRank);
doc.add(valuesField);
writer.addDocument(doc);
String titleText = "The quick brown fox";
Field field = new Field("title", titleText , Store.NO, Index.ANALYZED);
DocValuesField titleDV = new DocValuesField("title");
titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF);
field.setDocValues(titleDV);
Looking at the API - Search / Retrieve
IndexReader reader = ...;
DocValues values = reader.docValues("pageRank");
Source source = values.getSource();
double value = source.getFloat(x);
// still allows iterating over the RAM resident values
DocValuesEnum floatEnum = source.getEnum();
int doc;
FloatsRef ref = floatEnum.getFloat();
while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
value = ref.floats[0];
}
RAM Resident API is very similar to FieldCache
DocValuesEnum still available on RAM Resident API
Please read full article from Column Stride Fields aka. DocValues
No comments:
Post a Comment