Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's
Recommender engines try to infer tastes and preferences for a user based on his or her past actions and similarities to other users. In addition, recommender engines try to identify unknown items that might be of interest to users.
Apache Mahout is usable in a wide range of machine-learning and-data mining algorithms. However, Mahout has a specific focus on collaborative filtering (recommender engines), clustering, and classification.
Figure 1: Six users (1001-1006) and their how much they like items (9001-9015).
Read full article from Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's
Recommender engines try to infer tastes and preferences for a user based on his or her past actions and similarities to other users. In addition, recommender engines try to identify unknown items that might be of interest to users.
People follow patterns to like and dislike. For example, people usually tend to like things that are similar to other things they like, and they usually tend to like things that similar people like. Recommendation algorithms use these patterns to predict likes and dislikes. It is possible to generate recommendations based on either users or items.
Apache Mahout is usable in a wide range of machine-learning and-data mining algorithms. However, Mahout has a specific focus on collaborative filtering (recommender engines), clustering, and classification.
"Collaborative filtering" recommenders, such as the one I'm going to look at, require you to specify a relationship between the users and the items. The collaborative filtering recommender engine doesn't need to know details about the properties for each item to produce a recommendation. Mahout provides a collaborative filtering framework that enables you to use a simple input, and generate recommendations based on this input. In addition, you can build a domain-specific content-based recommender that considers the specific attributes of either the items or the users on top of the framework that Mahout provides.
Each user has one or more scores that indicate their preference for each item ID. The score is a value from 1 to 10. The item IDs start with a 9 prefix to easily differentiate them from the user IDs. Figure 1 shows the six users (blue circles) with relationships to the different items (orange circles) and the score values represented by lines with different colors according to the following ranges:
- Score value from 1 to 4: the user dislikes the item (red solid line).
- Score value from 5 to 7: the user likes the item, but isn't excited with the item and has some criticisms (red dashed line).
- Score value from 8 to 10: the user really likes the item (green line).
Figure 1: Six users (1001-1006) and their how much they like items (9001-9015).
#userId, itemId, score
1001,9001,10
1001,9002,1
1001,9003,9
Working with a Generic-User-Based Recommender
The
org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender
class implements a recommender that uses a DataModel
and a UserNeighborhood
to produce recommendations. The org.apache.mahout.cf.taste.model.DataModel
implementations represent a repository of information about users and their associated preferences for items. I will use the CSV file created above as the DataModel
. Theorg.apache.mahout.cf.taste.neighborhood.UserNeighborhood
implementations to compute a neighborhood of users similar to a given user and the recommender engine can use this neighborhood to compute recommendations.
Mahout greatly simplifies extracting recommendations and relationships from input datasets.
public
static
void
main(String[] args)
throws
Exception {
// Create a data source from the CSV file
File userPreferencesFile =
new
File(
"data/dataset1.csv"
);
DataModel dataModel =
new
FileDataModel(userPreferencesFile);
UserSimilarity userSimilarity =
new
PearsonCorrelationSimilarity(dataModel);
UserNeighborhood userNeighborhood =
new
NearestNUserNeighborhood(
2
, userSimilarity, dataModel);
// Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
Recommender genericRecommender =
new
GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
// Generate a list of 3 recommended items for user 1001
List<RecommendedItem> itemRecommendations = genericRecommender.recommend(
1001
,
3
);
// Display the item recommendations generated by the recommendation engine
for
(RecommendedItem recommendedItem : itemRecommendations) {
System.out.println(recommendedItem);
}
}
How the Calculation Works
The
org.apache.mahout.cf.taste.impl.model.file.FileDataModel.FileDataModel
constructor receives the File
instance containing the preferences data.
Then, the code uses the
FileDataModel
instance to create an instance of theorg.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity
class. This class provides an implementation of the Pearson correlation. For example, for two users, named user1
and user2
, PearsonCorrelationSimilarity
calculates the following values:sumSquareUser1
: Sum of the square of all the preference values foruser1
.sumSquareUser2
: Sum of the square of all the preference values foruser2
.sumUser1XUser2
: Sum of the product of the preference values foruser1
anduser2
, for all the items that include preferences from both users.
Then,
PearsonCorrelationSimilarity
calculates the correlation with the following formula:sumUser1XUser2 / sqrt(sumSquareUser1 * sumY2)
. This way, this correlation shifts the user preference values to make each of their means equal to 0, and it is equivalent to the cosine similarity. You can interpret this correlation as the cosine of the angle between two vectors generated with the user preference values.
Next, the code uses the
FileDataModel
and the PearsonCorrelationSimilarity
instances to create an instance of theorg.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.NearestNUserNeighborhood
class. This class computes a neighborhood consisting of the two nearest users to a given user because the n
argument that defines the neighborhood size is set to 2
. There are many other constructors for this class that allow you to specify values for additional arguments.
The code creates a generic-user-based recommender (
org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender.GenericUserBasedRecommender
) instance with the FileDataModel
, the NearestNUserNeighborhood
, and thePearsonCorrelationSimilarity
instances. Then, it is simply necessary to call the recommend method for the new GenericUserBasedRecommender
instance with the user ID and the desired number of recommendations to generate. This method returns aList<org.apache.mahout.cf.taste.recommender.RecommendedItem>
. Each RecommendedItem
instance encapsulates a recommended item and includes the item ID (recommendedItem.getItemID()
) and a float value (recommendedItem.getValue()
) that expresses the strength of the preference. A simple for
loop displays each RecommendedItem
in the console.
This shows how you can use one of the Mahout recommender engines with just a few lines of code. In my example, the code uses a simple CSV file as the data source, but it is just as easy to work with larger and more complex data sources. In addition, several Mahout features run on top of Apache Hadoop and take advantage of its great scalability. In the next article, I'll discuss more-advanced machine learning algorithms included in Apache Mahout — which you you can also use with just a few lines of code.
double computeResult(int n, double sumXY, double sumX2, double sumY2, double sumXYdiff2) {
if (n == 0) {
return Double.NaN;
}
// Note that sum of X and sum of Y don't appear here since they are assumed to be 0;
// the data is assumed to be centered.
double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
if (denominator == 0.0) {
// One or both parties has -all- the same ratings;
// can't really say much similarity under this measure
return Double.NaN;
}
return sumXY / denominator;
}
Please read full article from Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's
Read full article from Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's
No comments:
Post a Comment