Salmon Run: Near Duplicate Detection using MinHashing and Solr



Salmon Run: Near Duplicate Detection using MinHashing and Solr

Recently there was a discussion on detecting near duplicate documents on LinkedIn (login required). Someone suggested using NGrams, which prompted me to suggest using More Like This (MLT) queries on Solr using shingles for terms (the general idea of querying with shingles is explained here).

Turns out that this was a bit naive. For one thing, the Solr ShingleFilter works differently than you would expect. For example, for the phrase "lipstick on a pig", you would expect 3-grams to be the set of tokens {"lipstick on a", "on a pig"}. With Solr's Shinglefilter, for minShingleSize=3, maxShingleSize=3 and outputUnigrams=true, you get {"lipstick|lipstick on a", "on|on a pig"}, ie synonym tokens with the unigram as anchor and the n-gram(s) as synonyms. If you set outputUnigrams=false, no shingles are produced because there is no anchor for the synonym term. Further, since MLT works by matching tokens found by analyzing a query document with tokens found in index documents, the only way to implement my original suggestion would be a custom ShingleFilter.

While I've built custom Solr components in the past, in hindsight I think its generally a bad idea for two reasons. First the Lucene and Solr projects are quite fast moving and APIs get changed frequently. Second, custom extension points are often regarded as "expert" and there is less concern for backward compatibility for these APIs than the user-centric ones. I guess the expectation is that people doing customizations are responsible for being on top of API changes as they occur. Maybe its true in general, but for me its a potential annoyance I have to deal with each time I upgrade.

In any case, I figured out a user-centric approach to do this. Instead of analyzing the content into shingles (n-grams), we decompose the content into shingles at index time and store them in a multi-valued string (no tokenization beyond the shingling) field. When trying to find near duplicates, we search using a boolean query of shingles built from the query document. This returns a ranked list of documents where topmost document has the most shingles matching the shingles of the query document, the next one less so, and so on.

Read full article from Salmon Run: Near Duplicate Detection using MinHashing and Solr


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts