All About Programming: Playing with the Mahout recommendation engine on a Hadoop cluster

Playing with the Mahout recommendation engine on a Hadoop cluster | Chimpler

The GroupLens Movie DataSet provides the rating of movies in this format. You can download it: MovieLens 100k.

u.data: contains several tuples(user_id, movie_id, rating, timestamp)

hadoop jar <MAHOUT DIRECTORY>/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input u.data --output output

With the argument “-s SIMILARITY_COOCURRENCE”, we tell the recommender which item similary formula to use. With SIMILARITY COOCURRENCE, two items(movies) are very similar if they often appear together in users’ rating. So to find the movies to recommend to a user, we need to find the 10 movies most similar to the movies the user has rated. Or said differently, if a user A gives a good rating on movie X and other users gives a good rating on movie X and movie Y, then we can recommend the movie Y to the user A.

Mahout computes the recommendations by running several Hadoop mapreduce jobs.
After 30-50 minutes, the jobs are finished and each user will have the 10 movies that she might mostly like based on the co-occurrence of each movie in users’ reviews.

To copy and merge the files from HDFS to your local filesystem, type:

hadoop fs -getmerge output output.txt

1       [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]

Each line represents the recommendation for a user. The first number is the user id and the 10 number pairs represents a movie id and a score.
If we are looking at the first line for example, it means that for the user 1, the 10 best recommendations are for the movies 845, 550, 546, 25 ,531, 529, 527, 31, 515, 514.

It’s not easy to see what those recommendation means so we wrote a small python program to show for a given user, the movies he has rated and the movies we recommend him.

The python program uses the file u.data for the list of rated movies, the file u.item to get the movie titles and output.txt to get the list of recommended movies for the user.

import sys

if len(sys.argv) != 5:

        print "Arguments: userId userDataFilename movieFilename recommendationFilename"

        sys.exit(1)

userId, userDataFilename, movieFilename, recommendationFilename = sys.argv[1:]

print "Reading Movies Descriptions"

movieFile = open(movieFilename)

movieById = {}

for line in movieFile:

        tokens = line.split("|")

        movieById[tokens[0]] = tokens[1:]

movieFile.close()

print "Reading Rated Movies"

userDataFile = open(userDataFilename)

ratedMovieIds = []

for line in userDataFile:

        tokens = line.split("\t")

        if tokens[0] == userId:

                ratedMovieIds.append((tokens[1],tokens[2]))

userDataFile.close()

print "Reading Recommendations"

recommendationFile = open(recommendationFilename)

recommendations = []

for line in recommendationFile:

        tokens = line.split("\t")

        if tokens[0] == userId:

                movieIdAndScores = tokens[1].strip("[]\n").split(",")

                recommendations = [ movieIdAndScore.split(":") for movieIdAndScore in movieIdAndScores ]

                break

recommendationFile.close()

print "Rated Movies"

print "------------------------"

for movieId, rating in ratedMovieIds:

        print "%s, rating=%s" % (movieById[movieId][0], rating)

print "------------------------"

print "Recommended Movies"

print "------------------------"

for movieId, score in recommendations:

        print "%s, score=%s" % (movieById[movieId][0], score)

print "------------------------"

Read full article from Playing with the Mahout recommendation engine on a Hadoop cluster | Chimpler

Playing with the Mahout recommendation engine on a Hadoop cluster | Chimpler

No comments:

Post a Comment

Labels

Popular Posts