Apache Mahout: Scalable machine learning and data mining



Apache Mahout: Scalable machine learning and data mining
Mahout's recommenders expect interactions between users and items as input. The easiest way to supply such data to Mahout is in the form of a textfile, where every line has the format userID,itemID,value. Here userID and itemID refer to a particular user and a particular item, and value denotes the strength of the interaction (e.g. the rating given to a movie).
1,10,1.0
1,11,2.0
1,12,5.0
https://github.com/jpatanooga/Caduceus/blob/master/src/tv/floe/caduceus/mahout/cf/taste/samples/SampleRecommender.java
public static void main(String[] args) throws IOException, TasteException {
DataModel model = new FileDataModel( new File( "data/mahout/cf/sample_recommender_data.csv" ) ); // load model
UserSimilarity similarity = new PearsonCorrelationSimilarity( model );
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model );
Recommender recommender = new GenericUserBasedRecommender( model, neighborhood, similarity );
List<RecommendedItem> recommendations = recommender.recommend(1, 1);
for ( RecommendedItem recommendation : recommendations ) {
System.out.println( recommendation );
}
}

Creating a user-based recommender

Create a class called SampleRecommender with a main method.
The first thing we have to do is load the data from the file. Mahout's recommenders use an interface called DataModel to handle interaction data. You can load our made up interactions like this:
DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
In this example, we want to create a user-based recommender. The idea behind this approach is that when we want to compute recommendations for a particular users, we look for other users with a similar taste and pick the recommendations from their items. For finding similar users, we have to compare their interactions. There are several methods for doing this. One popular method is to compute the correlation coefficient between their interactions. In Mahout, you use this method as follows:
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
The next thing we have to do is to define which similar users we want to leverage for the recommender. For the sake of simplicity, we'll use all that have a similarity greater than 0.1. This is implemented via a ThresholdUserNeighborhood:
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
Now we have all the pieces to create our recommender:
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
We can easily ask the recommender for recommendations now. If we wanted to get three items recommended for the user withuserID 2, we would do it like this:
List recommendations = recommender.recommend(2, 3);
for (RecommendedItem recommendation : recommendations) {
  System.out.println(recommendation);
}

Evaluation

You might ask yourself, how to make sure that your recommender returns good results. Unfortunately, the only way to be really sure about the quality is by doing an A/B test with real users in a live system.
We can however try to get a feel of the quality, by statistical offline evaluation.
One way to check whether the recommender returns good results is by doing a hold-out test. We partition our dataset into two sets: a trainingset consisting of 90% of the data and a testset consisting of 10%. Then we train our recommender using the training set and look how well it predicts the unknown interactions in the testset.
To test our recommender, we create a class called EvaluateRecommender with a main method and add an inner class calledMyRecommenderBuilder that implements the RecommenderBuilder interface. We implement the buildRecommender method and make it setup our user-based recommender:
UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel);
return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
Now we have to create the code for the test. We'll check how much the recommender misses the real interaction strength on average. We employ an AverageAbsoluteDifferenceRecommenderEvaluator for this. The following code shows how to put the pieces together and run a hold-out test:
DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder builder = new MyRecommenderBuilder();
double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
System.out.println(result);
Note: if you run this test multiple times, you will get different results, because the splitting into trainingset and testset is done randomly.
Please read full article from Apache Mahout: Scalable machine learning and data mining
Read full article from Apache Mahout: Scalable machine learning and data mining

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts