Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler
lassification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains.

In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category. The category with the highest probability is the one the document is most likely to belong to.

In our tutorial, we will limit the tweets to deals by getting the tweets containing the hashtags #deal, #deals and #discount. We will classify them in the following categories:

apparel (clothes, shoes, watches, …)
art (Book, DVD, Music, …)
camera
event (travel, concert, …)
health (beauty, spa, …)
home (kitchen, furniture, garden, …)

tech (computer, laptop, tablet, …)
mvn clean package assembly:single
To transform this into a training set, you can use your favorite editor and add the category of the tweet at the beginning of the line followed by a tab character:

tech    308215054011194110      Limited 3-Box $20 BOGO, Supreme $9 BOGO, PTC Basketball $10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome $7 http://t.co/WMdbNFLvVZ #deals

Make sure to use tab between the category and the tweet id and between the tweet id and the tweet message.

For the classifier to work properly, this set must have at least 50 tweets messages in each category.

Training the model with Mahout

$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToSeq data/tweets-train.tsv tweets-seq

Code to convert tweet tsv to sequence file

hadoop fs -put tweets-seq tweets-seq

We can run mahout to transform the training sets into vectors using tfidf weights(term frequency x document frequency):

mahout seq2sparse -i tweets-seq -o tweets-vectors

It will generate the following files in HDFS in the directory tweets-vectors:

df-count: sequence file with association word id => number of document containing this word
dictionary.file-0: sequence file with association word => word id
frequency.file-0: sequence file with association word id => word count
tf-vectors: sequence file with the term frequency for each document
tfidf-vectors: sequence file with association document id => tfidf weight for each word in the document
tokenized-documents: sequence file with association document id => list of words
wordcount: sequence file with association word => word count

In order to do the training and check that the classification works fine, Mahout splits the set into two sets: a training set and a testing set:

mahout split -i tweets-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

We use the training set to train the classifier:

$ mahout trainnb -i train-vectors -el -li labelindex -o model -ow -c

It creates the model(matrix word id x label id) and a label index(association label and label id).

To test that the classifier is working properly on the training set:

mahout testnb -i train-vectors -m model -l labelindex -ow -o tweets-testing -c

mahout testnb -i test-vectors -m model -l labelindex -ow -o tweets-testing -c

If the percentage of correctly classified instance is too low, you might need to improve your training set by adding more tweets or by changing your categories to not have too many similar categories or by removing categories that are used very rarely. After you are done with your changes, you would need to restart the training process.

To use the classifier to classify new documents, we would need to copy several files from HDFS:

model (matrix word id x label id)
labelindex (mapping between a label and its id)
dictionary.file-0 (mapping between a word and its id)
df-count (document frequency: number of documents each word is appearing in)

$ hadoop fs -get labelindex labelindex
$ hadoop fs -get model model
$ hadoop fs -get tweets-vectors/dictionary.file-0 dictionary.file-0
$ hadoop fs -getmerge tweets-vectors/df-count df-count

java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.Classifier model labelindex dictionary.file-0 df-count data/tweets-to-classify.tsv

public class Classifier {

    public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {

        Map<String, Integer> dictionnary = new HashMap<String, Integer>();

        for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {

            dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());

}

        return dictionnary;

}

    public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {

        Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();

        for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {

            documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());

}

        return documentFrequency;

}

    public static void main(String[] args) throws Exception {

        if (args.length < 5) {           System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency] ");             return;         }       String modelPath = args[0];         String labelIndexPath = args[1];        String dictionaryPath = args[2];        String documentFrequencyPath = args[3];         String tweetsPath = args[4];                Configuration configuration = new Configuration();      // model is a matrix (wordId, labelId) => probability score

        NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

        StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

        // labels is a map label => classId

        Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));

        Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath));

        Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));

        // analyzer used to extract word from tweet

        Analyzer analyzer = new DefaultAnalyzer();

        int labelCount = labels.size();

        int documentCount = documentFrequency.get(-1).intValue();

        System.out.println("Number of labels: " + labelCount);

        System.out.println("Number of documents in training set: " + documentCount);

        BufferedReader reader = new BufferedReader(new FileReader(tweetsPath));

        while(true) {

            String line = reader.readLine();

            if (line == null) {

                break;

}

            String[] tokens = line.split("\t", 2);

            String tweetId = tokens[0];

            String tweet = tokens[1];

            System.out.println("Tweet: " + tweetId + "\t" + tweet);

            Multiset words = ConcurrentHashMultiset.create();

            // extract words from tweet

            TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(tweet));

            CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

            ts.reset();

            int wordCount = 0;

            while (ts.incrementToken()) {

                if (termAtt.length() > 0) {

                    String word = ts.getAttribute(CharTermAttribute.class).toString();

                    Integer wordId = dictionary.get(word);

                    // if the word is not in the dictionary, skip it

                    if (wordId != null) {

                        words.add(word);

                        wordCount++;

}

}

}

            // create vector wordId => weight using tfidf

            Vector vector = new RandomAccessSparseVector(10000);

            TFIDF tfidf = new TFIDF();

            for (Multiset.Entry entry:words.entrySet()) {

                String word = entry.getElement();

                int count = entry.getCount();

                Integer wordId = dictionary.get(word);

                Long freq = documentFrequency.get(wordId);

                double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);

                vector.setQuick(wordId, tfIdfValue);

}

            // With the classifier, we get one score for each label

            // The label with the highest score is the one the tweet is more likely to

            // be associated to

            Vector resultVector = classifier.classifyFull(vector);

            double bestScore = -Double.MAX_VALUE;

            int bestCategoryId = -1;

            for(Element element: resultVector) {

                int categoryId = element.index();

                double score = element.get();

                if (score > bestScore) {

                    bestScore = score;

                    bestCategoryId = categoryId;

}

                System.out.print("  " + labels.get(categoryId) + ": " + score);

}

            System.out.println(" => " + labels.get(bestCategoryId));

}

}

}

In this post, we only study one Mahout classifier among many others: SGD, SVM, Neural Network, Random Forests,

$ hadoop fs -text [FILE_NAME]

export HADOOP_CLASSPATH=[MAHOUT_DIR]/mahout-math-0.7.jar:[MAHOUT_DIR]/mahout-examples-0.7-job.jar

mahout seqdumper -i [FILE_NAME]

View words which are the most representative of each categories

You can use the class TopCategoryWords that shows the top 10 words of each category.

public class TopCategoryWords {

    public static Map<Integer, String> readInverseDictionnary(Configuration conf, Path dictionnaryPath) {

        Map<Integer, String> inverseDictionnary = new HashMap<Integer, String>();

        for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {

            inverseDictionnary.put(pair.getSecond().get(), pair.getFirst().toString());

}

        return inverseDictionnary;

}

    public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {

        Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();

        for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {

            documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());

}

        return documentFrequency;

}

    public static class WordWeight implements Comparable {

        private int wordId;

        private double weight;

        public WordWeight(int wordId, double weight) {

            this.wordId = wordId;

            this.weight = weight;

}

        public int getWordId() {

            return wordId;

}

        public Double getWeight() {

            return weight;

}

        @Override

        public int compareTo(WordWeight w) {

            return -getWeight().compareTo(w.getWeight());

}

}

    public static void main(String[] args) throws Exception {

        if (args.length < 4) {           System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency]");          return;         }       String modelPath = args[0];         String labelIndexPath = args[1];        String dictionaryPath = args[2];        String documentFrequencyPath = args[3];                 Configuration configuration = new Configuration();      // model is a matrix (wordId, labelId) => probability score

        NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

        StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

        // labels is a map label => classId

        Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));

        Map<Integer, String> inverseDictionary = readInverseDictionnary(configuration, new Path(dictionaryPath));

        Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));

        int labelCount = labels.size();

        int documentCount = documentFrequency.get(-1).intValue();

        System.out.println("Number of labels: " + labelCount);

        System.out.println("Number of documents in training set: " + documentCount);

        for(int labelId = 0 ; labelId < model.numLabels() ; labelId++) {

            SortedSet wordWeights = new TreeSet();

            for(int wordId = 0 ; wordId < model.numFeatures() ; wordId++) {              WordWeight w = new WordWeight(wordId, model.weight(labelId, wordId));               wordWeights.add(w);             }           System.out.println("Top 10 words for label " + labels.get(labelId));            int i = 0;          for(WordWeight w: wordWeights) {                System.out.println(" - " + inverseDictionary.get(w.getWordId())                         + ": " + w.getWeight());                i++;                if (i >= 10) {

                    break;

}

}

}

}

}

hadoop fs -rmr \*

Errors

When running the script to convert the tweet TSV message, I got the following errors:

Skip line: tech 309167277155168257      Easy web hosting. $4.95 -  http://t.co/0oUGS6Oj0e  - Review/Coupon- http://t.co/zdgH4kv5sv  #wordpress #deal #bluehost #blue host
Skip line: art 309167270989541376      Beautiful Jan Royce Conant Drawing of Jamaica - 1982 - Rare CT Artist - Animals #CPTV #EBAY #FineArt #Deals http://t.co/MUZf5aixMz

Make sure that the category and the tweet id are followed by a tab character and not spaces.
Read full article from Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler

Training the model with Mahout

View words which are the most representative of each categories

Errors

No comments:

Post a Comment

Labels

Popular Posts