Identifying Duplicate Records with Fuzzy Matching | Mawazo



I was prompted to write this post  in response to a recent discussion thread in linkedin Hadoop Users Group regarding fuzzy string matching for duplicate record identification with Hadoop. As part of my open source Hadoop based recommendation engine project sifarish, I have a MapReduce class for fuzzy matching between  entities with multiple attributes. It’s used for content based recommendation. My initial reaction was that the  same MapReduce class could be used for duplicate detection, by modelling all the fields of a record as text field and calculating  Jaccard similarity between corresponding text fields.

On further thought, I realized that the Jaccard similarity may not be right choice and Levenshtein Distance is more appropriate. Accordingly, I added support for Levenshtein Distance based text matching to sifarish. In this post, I will show how it’s been used to detect duplicate records.

Fuzzy Entity Matching

The details of the matching algorithms can be found from my earlier posts. For an entity, various attribute types are supported including integer, double, categorical, text, time, location etc. Essentially for any pair of entities, distance is calculated between corresponding attributes. Attribute wise distances are aggregated over all the attributes of an entity to find the distance between two entities.

Why is Jaccard distance not appropriate for this use case? Given two blocks of text, Jaccard similarity depends on the number of  words common to both and the union of words in the the two text blocks. Two words will not match whether there is difference in one character or multiple characters. Levenshtein Distance gives us more fine grained text matching algorithm.


Read full article from Identifying Duplicate Records with Fuzzy Matching | Mawazo


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts