Column Stride Fields aka. DocValues



Column Stride Fields aka. DocValues
Stored Fields serve a different purpose
• loading body or title fields for result rendering / highlighting 
• very suited for loading multiple values
• With Stored Fields you have one indirection per document resulting in going to disk twice for each document
• on-disk random access is too slow
• remember Lucene could score millions of documents even if you just render the top 10 or 20!

Lucene can un-invert a field into FieldCache
FieldCache - is fast once loaded, once!
• Constant time lookup DocID to value
• Efficient representation
• primitive array
• low GC overhead
• loading can be slow (realtime can be a problem)
•must parse values
• builds unnecessary term dictionary
• always memory resident

• Stored Fields are not fast enough for random access
• FieldCache is fast once loaded
• abuses a reverse index
• must convert to String and from String
• requires fair amount of memory
• Lucene is missing native data-structure for primitive per-document values

A dense column based storage
• 1 value per document
• accepts primitives - no conversion from / to String
• int & long
• float & double
• byte[ ]
• each field has a DocValues Type but can still be indexed or stored
• Entirely optional

Lets look at the API - Indexing
Adding DocValues follows existing patterns, simply use Fieldable
Document doc = new Document();
float pageRank = 10.3f;
DocValuesField valuesField = new DocValuesField("pageRank");
valuesField.setFloat(pageRank);
doc.add(valuesField);
writer.addDocument(doc);
String titleText = "The quick brown fox";
Field field = new Field("title", titleText , Store.NO, Index.ANALYZED);
DocValuesField titleDV = new DocValuesField("title");
titleDV.setBytes(new BytesRef(titleText), Type.BYTES_VAR_DEREF);
field.setDocValues(titleDV);

Looking at the API - Search / Retrieve
IndexReader reader = ...;
DocValues values = reader.docValues("pageRank");
Source source = values.getSource();
double value = source.getFloat(x);
// still allows iterating over the RAM resident values
DocValuesEnum floatEnum = source.getEnum();
int doc;
FloatsRef ref = floatEnum.getFloat();
while((doc = floatEnum.nextDoc()) != DocValuesEnum.NO_MORE_DOCS) {
value = ref.floats[0];
}
RAM Resident API is very similar to FieldCache
DocValuesEnum still available on RAM Resident API
Please read full article from Column Stride Fields aka. DocValues

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts