I was prompted to write this post in response to a recent discussion thread in linkedin Hadoop Users Group regarding fuzzy string matching for duplicate record identification with Hadoop. As part of my open source Hadoop based recommendation engine project sifarish, I have a MapReduce class for fuzzy matching between entities with multiple attributes. It’s used for content based recommendation. My initial reaction was that the same MapReduce class could be used for duplicate detection, by modelling all the fields of a record as text field and calculating Jaccard similarity between corresponding text fields.
On further thought, I realized that the Jaccard similarity may not be right choice and Levenshtein Distance is more appropriate. Accordingly, I added support for Levenshtein Distance based text matching to sifarish. In this post, I will show how it’s been used to detect duplicate records.
Fuzzy Entity Matching
The details of the matching algorithms can be found from my earlier posts. For an entity, various attribute types are supported including integer, double, categorical, text, time, location etc. Essentially for any pair of entities, distance is calculated between corresponding attributes. Attribute wise distances are aggregated over all the attributes of an entity to find the distance between two entities.
Why is Jaccard distance not appropriate for this use case? Given two blocks of text, Jaccard similarity depends on the number of words common to both and the union of words in the the two text blocks. Two words will not match whether there is difference in one character or multiple characters. Levenshtein Distance gives us more fine grained text matching algorithm.
Read full article from Identifying Duplicate Records with Fuzzy Matching | Mawazo
No comments:
Post a Comment