Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's



Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's
Recommender engines try to infer tastes and preferences for a user based on his or her past actions and similarities to other users. In addition, recommender engines try to identify unknown items that might be of interest to users.



People follow patterns to like and dislike. For example, people usually tend to like things that are similar to other things they like, and they usually tend to like things that similar people like. Recommendation algorithms use these patterns to predict likes and dislikes. It is possible to generate recommendations based on either users or items.

Apache Mahout is usable in a wide range of machine-learning and-data mining algorithms. However, Mahout has a specific focus on collaborative filtering (recommender engines), clustering, and classification.

"Collaborative filtering" recommenders, such as the one I'm going to look at, require you to specify a relationship between the users and the items. The collaborative filtering recommender engine doesn't need to know details about the properties for each item to produce a recommendation. Mahout provides a collaborative filtering framework that enables you to use a simple input, and generate recommendations based on this input. In addition, you can build a domain-specific content-based recommender that considers the specific attributes of either the items or the users on top of the framework that Mahout provides.

Each user has one or more scores that indicate their preference for each item ID. The score is a value from 1 to 10. The item IDs start with a 9 prefix to easily differentiate them from the user IDs. Figure 1 shows the six users (blue circles) with relationships to the different items (orange circles) and the score values represented by lines with different colors according to the following ranges:
  • Score value from 1 to 4: the user dislikes the item (red solid line).
  • Score value from 5 to 7: the user likes the item, but isn't excited with the item and has some criticisms (red dashed line).
  • Score value from 8 to 10: the user really likes the item (green line).
Mahout
Figure 1: Six users (1001-1006) and their how much they like items (9001-9015).
#userId, itemId, score
1001,9001,10
1001,9002,1
1001,9003,9

Working with a Generic-User-Based Recommender

The org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender class implements a recommender that uses a DataModel and a UserNeighborhood to produce recommendations. The org.apache.mahout.cf.taste.model.DataModel implementations represent a repository of information about users and their associated preferences for items. I will use the CSV file created above as the DataModel. Theorg.apache.mahout.cf.taste.neighborhood.UserNeighborhood implementations to compute a neighborhood of users similar to a given user and the recommender engine can use this neighborhood to compute recommendations.

Mahout greatly simplifies extracting recommendations and relationships from input datasets. 
  public static void main(String[] args) throws Exception {
      // Create a data source from the CSV file
      File userPreferencesFile = new File("data/dataset1.csv");
      DataModel dataModel = new FileDataModel(userPreferencesFile);
      
      UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
      UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(2, userSimilarity, dataModel);
 
      // Create a generic user based recommender with the dataModel, the userNeighborhood and the userSimilarity
      Recommender genericRecommender =  new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
 
      // Generate a list of 3 recommended items for user 1001
      List<RecommendedItem> itemRecommendations = genericRecommender.recommend(1001, 3);
 
      // Display the item recommendations generated by the recommendation engine
      for (RecommendedItem recommendedItem : itemRecommendations) {
          System.out.println(recommendedItem);
      }
  }
How the Calculation Works
Theorg.apache.mahout.cf.taste.impl.model.file.FileDataModel.FileDataModel constructor receives the File instance containing the preferences data.
Then, the code uses the FileDataModel instance to create an instance of theorg.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity class. This class provides an implementation of the Pearson correlation. For example, for two users, named user1and user2PearsonCorrelationSimilarity calculates the following values:
  • sumSquareUser1: Sum of the square of all the preference values for user1.
  • sumSquareUser2: Sum of the square of all the preference values for user2.
  • sumUser1XUser2: Sum of the product of the preference values for user1 and user2, for all the items that include preferences from both users.
Then, PearsonCorrelationSimilarity calculates the correlation with the following formula:sumUser1XUser2 / sqrt(sumSquareUser1 * sumY2). This way, this correlation shifts the user preference values to make each of their means equal to 0, and it is equivalent to the cosine similarity. You can interpret this correlation as the cosine of the angle between two vectors generated with the user preference values.
Next, the code uses the FileDataModel and the PearsonCorrelationSimilarity instances to create an instance of theorg.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.NearestNUserNeighborhoodclass. This class computes a neighborhood consisting of the two nearest users to a given user because the n argument that defines the neighborhood size is set to 2. There are many other constructors for this class that allow you to specify values for additional arguments.
The code creates a generic-user-based recommender (org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender.GenericUserBasedRecommender) instance with the FileDataModel, the NearestNUserNeighborhood, and thePearsonCorrelationSimilarity instances. Then, it is simply necessary to call the recommend method for the new GenericUserBasedRecommender instance with the user ID and the desired number of recommendations to generate. This method returns aList<org.apache.mahout.cf.taste.recommender.RecommendedItem>. Each RecommendedIteminstance encapsulates a recommended item and includes the item ID (recommendedItem.getItemID()) and a float value (recommendedItem.getValue()) that expresses the strength of the preference. A simple for loop displays each RecommendedItem in the console.
This shows how you can use one of the Mahout recommender engines with just a few lines of code. In my example, the code uses a simple CSV file as the data source, but it is just as easy to work with larger and more complex data sources. In addition, several Mahout features run on top of Apache Hadoop and take advantage of its great scalability. In the next article, I'll discuss more-advanced machine learning algorithms included in Apache Mahout — which you you can also use with just a few lines of code.
  double computeResult(int n, double sumXY, double sumX2, double sumY2, double sumXYdiff2) {
    if (n == 0) {
      return Double.NaN;
    }
    // Note that sum of X and sum of Y don't appear here since they are assumed to be 0;
    // the data is assumed to be centered.
    double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
    if (denominator == 0.0) {
      // One or both parties has -all- the same ratings;
      // can't really say much similarity under this measure
      return Double.NaN;
    }
    return sumXY / denominator;
  }

Read full article from Machine Learning with Apache Mahout: The Lay of the Land | Dr Dobb's

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts