Playing with the Mahout recommendation engine on a Hadoop cluster | Chimpler
Read full article from Playing with the Mahout recommendation engine on a Hadoop cluster | Chimpler
The GroupLens Movie DataSet provides the rating of movies in this format. You can download it: MovieLens 100k.
- u.data: contains several tuples(user_id, movie_id, rating, timestamp)
hadoop jar <MAHOUT DIRECTORY>/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input u.data --output output
- With the argument “-s SIMILARITY_COOCURRENCE”, we tell the recommender which item similary formula to use. With SIMILARITY COOCURRENCE, two items(movies) are very similar if they often appear together in users’ rating. So to find the movies to recommend to a user, we need to find the 10 movies most similar to the movies the user has rated. Or said differently, if a user A gives a good rating on movie X and other users gives a good rating on movie X and movie Y, then we can recommend the movie Y to the user A.Mahout computes the recommendations by running several Hadoop mapreduce jobs.
After 30-50 minutes, the jobs are finished and each user will have the 10 movies that she might mostly like based on the co-occurrence of each movie in users’ reviews.
To copy and merge the files from HDFS to your local filesystem, type:
hadoop fs -getmerge output output.txt
1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]
Each line represents the recommendation for a user. The first number is the user id and the 10 number pairs represents a movie id and a score.
If we are looking at the first line for example, it means that for the user 1, the 10 best recommendations are for the movies 845, 550, 546, 25 ,531, 529, 527, 31, 515, 514.
If we are looking at the first line for example, it means that for the user 1, the 10 best recommendations are for the movies 845, 550, 546, 25 ,531, 529, 527, 31, 515, 514.
It’s not easy to see what those recommendation means so we wrote a small python program to show for a given user, the movies he has rated and the movies we recommend him.
The python program uses the file u.data for the list of rated movies, the file u.item to get the movie titles and output.txt to get the list of recommended movies for the user.import
sys
if
len
(sys.argv) !
=
5
:
print
"Arguments: userId userDataFilename movieFilename recommendationFilename"
sys.exit(
1
)
userId, userDataFilename, movieFilename, recommendationFilename
=
sys.argv[
1
:]
print
"Reading Movies Descriptions"
movieFile
=
open
(movieFilename)
movieById
=
{}
for
line
in
movieFile:
tokens
=
line.split(
"|"
)
movieById[tokens[
0
]]
=
tokens[
1
:]
movieFile.close()
print
"Reading Rated Movies"
userDataFile
=
open
(userDataFilename)
ratedMovieIds
=
[]
for
line
in
userDataFile:
tokens
=
line.split(
"\t"
)
if
tokens[
0
]
=
=
userId:
ratedMovieIds.append((tokens[
1
],tokens[
2
]))
userDataFile.close()
print
"Reading Recommendations"
recommendationFile
=
open
(recommendationFilename)
recommendations
=
[]
for
line
in
recommendationFile:
tokens
=
line.split(
"\t"
)
if
tokens[
0
]
=
=
userId:
movieIdAndScores
=
tokens[
1
].strip(
"[]\n"
).split(
","
)
recommendations
=
[ movieIdAndScore.split(
":"
)
for
movieIdAndScore
in
movieIdAndScores ]
break
recommendationFile.close()
print
"Rated Movies"
print
"------------------------"
for
movieId, rating
in
ratedMovieIds:
print
"%s, rating=%s"
%
(movieById[movieId][
0
], rating)
print
"------------------------"
print
"Recommended Movies"
print
"------------------------"
for
movieId, score
in
recommendations:
print
"%s, score=%s"
%
(movieById[movieId][
0
], score)
print
"------------------------"
No comments:
Post a Comment