All About Programming: MinHash

MinHash | mymagnadata

here are quite a few posts on the web explaining min hash specially this and this. Both of these posts are excellent and clearly explains min hash. My objective of this post is to take an example and explain step-by-step min-hash algorithm that does not permute the input but uses n-hash functions to get similarity score.

Lets suppose there are two sets of data S1 {"THIS","IS","ME"} and S2 {"THAT","IS","ME"}. Lets create a bit map index for this set of data with rows as union of Set S1 and Set S2 and column values as bits (1,0) representing presence or absence of data row in the set

THIS 1 0
THAT 0 1
IS 1 1
ME 1 1

Bit "1″ represent presence in the data set. For example, "THIS" 1 0 means "THIS" is present in the data set S1 but not in data set S2, "IS" is present in the both data sets S1 and S2 etc. etc.

Pick n-hash functions (n is random number), n number of hash functions could be any +ve integer >= S1.size()+S2.size(). For this example lets says n = S1.size() + S2.size().

[0][1][2][3][4][5] => {2222,12332,45432,45426,2124,8656}

For each column, hash function keep a slot. With our example we have 2 columns and 6 hash functions

minHashSlots => [0][0] [0][1] [0][2] [0][3] [0][4] [0][5], [1][0] [1][1] [1][2] [1][3] [1][4] [1][5]

Follow this algorithm:

For each row, each column with value 1, for each hash function if the hash value h[row index] < minHashSlot[column][row index] then replace the minHashSlot[column][row index] = h[rowIndex]. Here is the formal alogrithm

1. prepare the bitMap
2. initialize the h[n] => with n hash functions
3. initialize minHashSlots[column count][n] with Int.maxValue
4. rowIndex = 1
5. for every row in the bit map
5.1 for every column with value 1
5.1.1 for every hash
5.1.1.1. hash[rowIndex] < minHashSlot[column][rowIndex] then set minHashSlot[column][rowIndex]=hash[rowIndex]
5.1.1.2 rowIndex++
6 minHash = count of similar hash / total hashes

I am not sure if this is correct explanation of min-hash but whatever I heard this is quite close to being correct.

Read full article from MinHash | mymagnadata

MinHash | mymagnadata

No comments:

Post a Comment

Labels

Popular Posts