All About Programming: Salmon Run: Computing Document Similarity using Lucene Term Vectors

Then for the documents which you want to consider for your similarity computation, extract its term vector. The term vector gives you two arrays, an array of terms within this document, and the corresponding frequency of that term in this document. Using these three data structures, it is easy to construct a (sparse) document vector representing the document(s).

Using Lucene's term vectors to generate document vectors can be useful not only for similarity calculations, but for other tasks where you need document vectors, such as clustering and classification. Computing document vectors from raw data is typically more involved (requires development time and can have scalability issues) than loading data into a Lucene index.

Read full article from Salmon Run: Computing Document Similarity using Lucene Term Vectors

Salmon Run: Computing Document Similarity using Lucene Term Vectors

No comments:

Post a Comment

Labels

Popular Posts