All About Programming: Hacking Lucene - The Index format

Hacking Lucene - The Index format - Hacker Labs
Field Information (.fnm)

Field info file (with suffix .fnm) records the index time attributes and field names for every field. Fields are numbered by their order in this file. Thus field zero is the first field in the file, field one the next, and so on. Note that, like document numbers, field numbers are segment relative.

Stored Fields

`Field Index`	.fdx	Contains pointers to field data
`Field Data`	.fdt	The stored fields for documents

Stored fields are the original raw text values that were given to Lucene. This is not really part of the inverted index structure – its simply a mapping from document id’s to stored field data. Stored fields are represented by two files:

The field index, or .fdx file. This contains, for each document, a pointer to its field data.
The field data, or .fdt file. This contains the stored fields of each document.

Term Dictionary (.tis, .tii)

`Term Dictionary`	.tim	The term dictionary, stores term info
`Term Index`	.tip	The index into the Term Dictionary
`Frequencies`	.doc	Contains the list of docs which contain each term along with frequency
`Positions`	.pos	Stores position information about where a term occurs in the index
`Payloads`	.pay	Stores additional per-position metadata information such as character offsets and user payloads

The Term Dictionary stores how to navigate the various other files for each term. At a simple, high level, the Term Dictionary will tell you where to look in the frq and prx files for further information related to that term. Terms are stored in alphabetical order for fast lookup. Further, another data structure, the Term Info Index, is designed to be read entirely into memory and used to provide random access to the “tis” file. Other book keeping (skip lists, index intervals) is also tracked with the Term Dictionary.

Term Infos (.tis)

Term Infos file (.tis) contains all terms in a segment ordered by field name then value within field. It also contain the document frequency for each term (Inverted index so how many document contain that term).

Term Info Index (.tii)

This contains every IndexIntervalth entry from the .tis file, along with its location in the “tis” file. This is designed to be read entirely into memory and used to provide random access to the “tis” file. The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta.

The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document.

Position Location Data( .prx)

The .prx file contains the lists of positions that each term occurs at within documents. If omitTf is set to true in a field, no entry for that term will be stored in this file, and if omitTf is set for all terms, the .prx files will not exist.

Read full article from Hacking Lucene - The Index format - Hacker Labs

Hacking Lucene - The Index format - Hacker Labs

Term Infos (.tis)

Term Info Index (.tii)

No comments:

Post a Comment

Labels

Popular Posts