Hacking Lucene - The Index format - Hacker Labs



Hacking Lucene - The Index format - Hacker Labs
Field Information (.fnm)
Field info file (with suffix .fnm) records the index time attributes and field names for every field. Fields are numbered by their order in this file. Thus field zero is the first field in the file, field one the next, and so on. Note that, like document numbers, field numbers are segment relative.
Stored Fields
Field Index.fdxContains pointers to field data
Field Data.fdtThe stored fields for documents
Stored fields are the original raw text values that were given to Lucene.  This is not really part of the inverted index structure – its simply a mapping from document id’s to stored field data. Stored fields are represented by two files:
  1. The field index, or .fdx file. This contains, for each document, a pointer to its field data.
  2. The field data, or .fdt file. This contains the stored fields of each document.

Term Dictionary (.tis, .tii)
Term Dictionary.timThe term dictionary, stores term info
Term Index.tipThe index into the Term Dictionary
Frequencies.docContains the list of docs which contain each term along with frequency
Positions.posStores position information about where a term occurs in the index
Payloads.payStores additional per-position metadata information such as character offsets and user payloads
The Term Dictionary stores how to navigate the various other files for each term. At a simple, high level, the Term Dictionary will tell you where to look in the frq and prx files for further information related to that term. Terms are stored in alphabetical order for fast lookup. Further, another data structure, the Term Info Index, is designed to be read entirely into memory and used to provide random access to the “tis” file. Other book keeping (skip lists, index intervals) is also tracked with the Term Dictionary.

Term Infos (.tis)

Term Infos file (.tis) contains all terms in a segment ordered by field name then value within field. It also contain the document frequency for each term (Inverted index so how many document contain that term).

Term Info Index (.tii)

This contains every IndexIntervalth entry from the .tis file, along with its location in the “tis” file. This is designed to be read entirely into memory and used to provide random access to the “tis” file. The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.


The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document.

Position Location Data( .prx)
The .prx file contains the lists of positions that each term occurs at within documents.  If omitTf is set to true in a field, no entry for that term will be stored in this file, and if omitTf is set for all terms, the .prx files will not exist.

Read full article from Hacking Lucene - The Index format - Hacker Labs

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts