瞬之与容对《Introduction to Information Retrieval》的笔记(23)



  • 1.1 An example information retrieval problem
    1.1
    The term "unstructured data" refers to datawhich does not have clear, semantically overt, easy-for-a-computer structure.
    structured data: relational database
    Incidence matrix: columns are documents and rows are words.
    And every row can be considered as a vector, we can solute if-not problems by bitwise AND, OR or NOT.
    Boolean retrieval model: pose any query which is in the form of a Boolean expression. The model views each document as just a set of words.
    Some basic definition:
    Information need: like query, documents which are relevant to personal information need.
    Two key statistics to evaluate an IR system:
    1. Precision: the fraction of relevant results in the whole results to information need.
    2. Recall: the fraction of relevant documents in the collection were returned in the result.
    2011-12-03 10:01:56 2回应
  • 1.2 A first take at building an inverted index
    1.2
    A term-document matrix is usually sparse, so it's better to record only nonzero ones, so we need inverted index.
    Inverted index: an index always maps back from terms to the parts of a document where they occur.
    dictionary + postings list
    Tokens and normalized tokens are loosely equivalent to words.(also terms)
    Sorting: become term and docID pairs to inverted index.(behind one term is frequency)
    and a posting can hold other information such as term frequency(the term occurs in the document) and position.
    The terms are sorted by alphabet and postings are sorted by docID.
    The dictionary can be in memory but posting lists are usually on the disk. If a part of posting lists is in memory, we can use linked list or variable length array.

  • Read full article from 瞬之与容对《Introduction to Information Retrieval》的笔记(23)


    No comments:

    Post a Comment

    Labels

    Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

    Popular Posts