1.1 An example information retrieval problem
1.1structured data: relational databaseIncidence matrix: columns are documents and rows are words.And every row can be considered as a vector, we can solute if-not problems by bitwise AND, OR or NOT.Boolean retrieval model: pose any query which is in the form of a Boolean expression. The model views each document as just a set of words.Some basic definition:Information need: like query, documents which are relevant to personal information need.Two key statistics to evaluate an IR system:1. Precision: the fraction of relevant results in the whole results to information need.2. Recall: the fraction of relevant documents in the collection were returned in the result.The term "unstructured data" refers to datawhich does not have clear, semantically overt, easy-for-a-computer structure.
1.2 A first take at building an inverted index
1.2A term-document matrix is usually sparse, so it's better to record only nonzero ones, so we need inverted index.Inverted index: an index always maps back from terms to the parts of a document where they occur. dictionary + postings listTokens and normalized tokens are loosely equivalent to words.(also terms)Sorting: become term and docID pairs to inverted index.(behind one term is frequency)and a posting can hold other information such as term frequency(the term occurs in the document) and position.The terms are sorted by alphabet and postings are sorted by docID.The dictionary can be in memory but posting lists are usually on the disk. If a part of posting lists is in memory, we can use linked list or variable length array.
Read full article from 瞬之与容对《Introduction to Information Retrieval》的笔记(23)
No comments:
Post a Comment