Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler



Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler
lassification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains.

In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category.  The category with the highest probability is the one the document is most likely to belong to.

In our tutorial, we will limit the tweets to deals by getting the tweets containing the hashtags #deal, #deals and #discount. We will classify them in the following categories:
  • apparel (clothes, shoes, watches, …)
  • art (Book, DVD, Music, …)
  • camera
  • event (travel, concert, …)
  • health (beauty, spa, …)
  • home (kitchen, furniture, garden, …)
tech (computer, laptop, tablet, …)
mvn clean package assembly:single
To transform this into a training set, you can use your favorite editor and add the category of the tweet at the beginning of the line followed by a tab character:

tech    308215054011194110      Limited 3-Box $20 BOGO, Supreme $9 BOGO, PTC Basketball $10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome $7 http://t.co/WMdbNFLvVZ #deals
Make sure to use tab between the category and the tweet id and between the tweet id and the tweet message.
For the classifier to work properly, this set must have at least 50 tweets messages in each category.

Training the model with Mahout

$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToSeq data/tweets-train.tsv tweets-seq
Code to convert tweet tsv to sequence file
hadoop fs -put tweets-seq tweets-seq
We can run mahout to transform the training sets into vectors using tfidf weights(term frequency x document frequency):
mahout seq2sparse -i tweets-seq -o tweets-vectors
It will generate the following files in HDFS in the directory tweets-vectors:
  • df-count: sequence file with association word id => number of document containing this word
  • dictionary.file-0: sequence file with association word => word id
  • frequency.file-0: sequence file with association word id => word count
  • tf-vectors: sequence file with the term frequency for each document
  • tfidf-vectors: sequence file with association document id => tfidf weight for each word in the document
  • tokenized-documents: sequence file with association document id => list of words
  • wordcount: sequence file with association word => word count
In order to do the training and check that the classification works fine, Mahout splits the set into two sets: a training set and a testing set:
mahout split -i tweets-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
We use the training set to train the classifier:
$ mahout trainnb -i train-vectors -el -li labelindex -o model -ow -c
It creates the model(matrix word id x label id) and a label index(association label and label id).
To test that the classifier is working properly on the training set:
mahout testnb -i train-vectors -m model -l labelindex -ow -o tweets-testing -c
mahout testnb -i test-vectors -m model -l labelindex -ow -o tweets-testing -c
If the percentage of correctly classified instance is too low, you might need to improve your training set by adding more tweets or by changing your categories to not have too many similar categories or by removing categories that are used very rarely. After you are done with your changes, you would need to restart the training process.

To use the classifier to classify new documents, we would need to copy several files from HDFS:
  • model (matrix word id x label id)
  • labelindex (mapping between a label and its id)
  • dictionary.file-0 (mapping between a word and its id)
  • df-count (document frequency: number of documents each word is appearing in)
$ hadoop fs -get labelindex labelindex
$ hadoop fs -get model model
$ hadoop fs -get tweets-vectors/dictionary.file-0 dictionary.file-0
$ hadoop fs -getmerge tweets-vectors/df-count df-count
java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.Classifier model labelindex dictionary.file-0 df-count data/tweets-to-classify.tsv
public class Classifier {
 
    public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {
        Map<String, Integer> dictionnary = new HashMap<String, Integer>();
        for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
            dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
        }
        return dictionnary;
    }
 
    public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
        Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
        for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
            documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
        }
        return documentFrequency;
    }
 
    public static void main(String[] args) throws Exception {
        if (args.length < 5) {           System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency] ");             return;         }       String modelPath = args[0];         String labelIndexPath = args[1];        String dictionaryPath = args[2];        String documentFrequencyPath = args[3];         String tweetsPath = args[4];                Configuration configuration = new Configuration();      // model is a matrix (wordId, labelId) => probability score
        NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);
 
        StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);
 
        // labels is a map label => classId
        Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));
        Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath));
        Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));
 
        // analyzer used to extract word from tweet
        Analyzer analyzer = new DefaultAnalyzer();
 
        int labelCount = labels.size();
        int documentCount = documentFrequency.get(-1).intValue();
 
        System.out.println("Number of labels: " + labelCount);
        System.out.println("Number of documents in training set: " + documentCount);
        BufferedReader reader = new BufferedReader(new FileReader(tweetsPath));
        while(true) {
            String line = reader.readLine();
            if (line == null) {
                break;
            }
 
            String[] tokens = line.split("\t", 2);
            String tweetId = tokens[0];
            String tweet = tokens[1];
 
            System.out.println("Tweet: " + tweetId + "\t" + tweet);
 
            Multiset words = ConcurrentHashMultiset.create();
 
            // extract words from tweet
            TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(tweet));
            CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
            ts.reset();
            int wordCount = 0;
            while (ts.incrementToken()) {
                if (termAtt.length() > 0) {
                    String word = ts.getAttribute(CharTermAttribute.class).toString();
                    Integer wordId = dictionary.get(word);
                    // if the word is not in the dictionary, skip it
                    if (wordId != null) {
                        words.add(word);
                        wordCount++;
                    }
                }
            }
 
            // create vector wordId => weight using tfidf
            Vector vector = new RandomAccessSparseVector(10000);
            TFIDF tfidf = new TFIDF();
            for (Multiset.Entry entry:words.entrySet()) {
                String word = entry.getElement();
                int count = entry.getCount();
                Integer wordId = dictionary.get(word);
                Long freq = documentFrequency.get(wordId);
                double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
                vector.setQuick(wordId, tfIdfValue);
            }
            // With the classifier, we get one score for each label
            // The label with the highest score is the one the tweet is more likely to
            // be associated to
            Vector resultVector = classifier.classifyFull(vector);
            double bestScore = -Double.MAX_VALUE;
            int bestCategoryId = -1;
            for(Element element: resultVector) {
                int categoryId = element.index();
                double score = element.get();
                if (score > bestScore) {
                    bestScore = score;
                    bestCategoryId = categoryId;
                }
                System.out.print("  " + labels.get(categoryId) + ": " + score);
            }
            System.out.println(" => " + labels.get(bestCategoryId));
        }
    }
}
In this post, we only study one Mahout classifier among many others: SGD, SVM, Neural Network, Random Forests, 

$ hadoop fs -text [FILE_NAME]
export HADOOP_CLASSPATH=[MAHOUT_DIR]/mahout-math-0.7.jar:[MAHOUT_DIR]/mahout-examples-0.7-job.jar
mahout seqdumper -i [FILE_NAME]

View words which are the most representative of each categories

You can use the class TopCategoryWords that shows the top 10 words of each category.
public class TopCategoryWords {
 
    public static Map<Integer, String> readInverseDictionnary(Configuration conf, Path dictionnaryPath) {
        Map<Integer, String> inverseDictionnary = new HashMap<Integer, String>();
        for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
            inverseDictionnary.put(pair.getSecond().get(), pair.getFirst().toString());
        }
        return inverseDictionnary;
    }
 
    public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
        Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
        for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
            documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
        }
        return documentFrequency;
    }
 
    public static class WordWeight implements Comparable {
        private int wordId;
        private double weight;
 
        public WordWeight(int wordId, double weight) {
            this.wordId = wordId;
            this.weight = weight;
        }
 
        public int getWordId() {
            return wordId;
        }
 
        public Double getWeight() {
            return weight;
        }
 
        @Override
        public int compareTo(WordWeight w) {
            return -getWeight().compareTo(w.getWeight());
        }
    }
 
    public static void main(String[] args) throws Exception {
        if (args.length < 4) {           System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency]");          return;         }       String modelPath = args[0];         String labelIndexPath = args[1];        String dictionaryPath = args[2];        String documentFrequencyPath = args[3];                 Configuration configuration = new Configuration();      // model is a matrix (wordId, labelId) => probability score
        NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);
 
        StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);
 
        // labels is a map label => classId
        Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));
        Map<Integer, String> inverseDictionary = readInverseDictionnary(configuration, new Path(dictionaryPath));
        Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));
 
        int labelCount = labels.size();
        int documentCount = documentFrequency.get(-1).intValue();
 
        System.out.println("Number of labels: " + labelCount);
        System.out.println("Number of documents in training set: " + documentCount);
 
        for(int labelId = 0 ; labelId < model.numLabels() ; labelId++) {
            SortedSet wordWeights = new TreeSet();
            for(int wordId = 0 ; wordId < model.numFeatures() ; wordId++) {              WordWeight w = new WordWeight(wordId, model.weight(labelId, wordId));               wordWeights.add(w);             }           System.out.println("Top 10 words for label " + labels.get(labelId));            int i = 0;          for(WordWeight w: wordWeights) {                System.out.println(" - " + inverseDictionary.get(w.getWordId())                         + ": " + w.getWeight());                i++;                if (i >= 10) {
                    break;
                }
            }
        }
    }
}
hadoop fs -rmr \*

Errors

When running the script to convert the tweet TSV message, I got the following errors:
Skip line: tech 309167277155168257      Easy web hosting. $4.95 -  http://t.co/0oUGS6Oj0e  - Review/Coupon- http://t.co/zdgH4kv5sv  #wordpress #deal #bluehost #blue host
Skip line: art 309167270989541376      Beautiful Jan Royce Conant Drawing of Jamaica - 1982 - Rare CT Artist - Animals #CPTV #EBAY #FineArt #Deals http://t.co/MUZf5aixMz
Make sure that the category and the tweet id are followed by a tab character and not spaces.
Read full article from Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts