Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler
lassification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains.
In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category. The category with the highest probability is the one the document is most likely to belong to.
mvn clean package assembly:single
To transform this into a training set, you can use your favorite editor and add the category of the tweet at the beginning of the line followed by a tab character:
In this post, we only study one Mahout classifier among many others: SGD, SVM, Neural Network, Random Forests,
Read full article from Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler
lassification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains.
In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category. The category with the highest probability is the one the document is most likely to belong to.
In our tutorial, we will limit the tweets to deals by getting the tweets containing the hashtags #deal, #deals and #discount. We will classify them in the following categories:
- apparel (clothes, shoes, watches, …)
- art (Book, DVD, Music, …)
- camera
- event (travel, concert, …)
- health (beauty, spa, …)
- home (kitchen, furniture, garden, …)
mvn clean package assembly:single
To transform this into a training set, you can use your favorite editor and add the category of the tweet at the beginning of the line followed by a tab character:
tech 308215054011194110 Limited 3-Box $20 BOGO, Supreme $9 BOGO, PTC Basketball $10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome $7 http://t.co/WMdbNFLvVZ #deals
Make sure to use tab between the category and the tweet id and between the tweet id and the tweet message.
For the classifier to work properly, this set must have at least 50 tweets messages in each category.
Training the model with Mahout
$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToSeq data/tweets-train.tsv tweets-seq
Code to convert tweet tsv to sequence file
hadoop fs -put tweets-seq tweets-seq
We can run mahout to transform the training sets into vectors using tfidf weights(term frequency x document frequency):
mahout seq2sparse -i tweets-seq -o tweets-vectors
It will generate the following files in HDFS in the directory tweets-vectors:
- df-count: sequence file with association word id => number of document containing this word
- dictionary.file-0: sequence file with association word => word id
- frequency.file-0: sequence file with association word id => word count
- tf-vectors: sequence file with the term frequency for each document
- tfidf-vectors: sequence file with association document id => tfidf weight for each word in the document
- tokenized-documents: sequence file with association document id => list of words
- wordcount: sequence file with association word => word count
In order to do the training and check that the classification works fine, Mahout splits the set into two sets: a training set and a testing set:
mahout split -i tweets-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
We use the training set to train the classifier:
$ mahout trainnb -i train-vectors -el -li labelindex -o model -ow -c
It creates the model(matrix word id x label id) and a label index(association label and label id).
To test that the classifier is working properly on the training set:
mahout testnb -i train-vectors -m model -l labelindex -ow -o tweets-testing -c
mahout testnb -i test-vectors -m model -l labelindex -ow -o tweets-testing -cIf the percentage of correctly classified instance is too low, you might need to improve your training set by adding more tweets or by changing your categories to not have too many similar categories or by removing categories that are used very rarely. After you are done with your changes, you would need to restart the training process.
To use the classifier to classify new documents, we would need to copy several files from HDFS:
- model (matrix word id x label id)
- labelindex (mapping between a label and its id)
- dictionary.file-0 (mapping between a word and its id)
- df-count (document frequency: number of documents each word is appearing in)
$ hadoop fs -get labelindex labelindex $ hadoop fs -get model model $ hadoop fs -get tweets-vectors/dictionary.file-0 dictionary.file-0 $ hadoop fs -getmerge tweets-vectors/df-count df-count
java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.Classifier model labelindex dictionary.file-0 df-count data/tweets-to-classify.tsv
public
class
Classifier {
public
static
Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {
Map<String, Integer> dictionnary =
new
HashMap<String, Integer>();
for
(Pair<Text, IntWritable> pair :
new
SequenceFileIterable<Text, IntWritable>(dictionnaryPath,
true
, conf)) {
dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
}
return
dictionnary;
}
public
static
Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
Map<Integer, Long> documentFrequency =
new
HashMap<Integer, Long>();
for
(Pair<IntWritable, LongWritable> pair :
new
SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath,
true
, conf)) {
documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
}
return
documentFrequency;
}
public
static
void
main(String[] args)
throws
Exception {
if
(args.length <
5
) { System.out.println(
"Arguments: [model] [label index] [dictionnary] [document frequency] "
);
return
; } String modelPath = args[
0
]; String labelIndexPath = args[
1
]; String dictionaryPath = args[
2
]; String documentFrequencyPath = args[
3
]; String tweetsPath = args[
4
]; Configuration configuration =
new
Configuration();
// model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(
new
Path(modelPath), configuration);
StandardNaiveBayesClassifier classifier =
new
StandardNaiveBayesClassifier(model);
// labels is a map label => classId
Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration,
new
Path(labelIndexPath));
Map<String, Integer> dictionary = readDictionnary(configuration,
new
Path(dictionaryPath));
Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration,
new
Path(documentFrequencyPath));
// analyzer used to extract word from tweet
Analyzer analyzer =
new
DefaultAnalyzer();
int
labelCount = labels.size();
int
documentCount = documentFrequency.get(-
1
).intValue();
System.out.println(
"Number of labels: "
+ labelCount);
System.out.println(
"Number of documents in training set: "
+ documentCount);
BufferedReader reader =
new
BufferedReader(
new
FileReader(tweetsPath));
while
(
true
) {
String line = reader.readLine();
if
(line ==
null
) {
break
;
}
String[] tokens = line.split(
"\t"
,
2
);
String tweetId = tokens[
0
];
String tweet = tokens[
1
];
System.out.println(
"Tweet: "
+ tweetId +
"\t"
+ tweet);
Multiset words = ConcurrentHashMultiset.create();
// extract words from tweet
TokenStream ts = analyzer.reusableTokenStream(
"text"
,
new
StringReader(tweet));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.
class
);
ts.reset();
int
wordCount =
0
;
while
(ts.incrementToken()) {
if
(termAtt.length() >
0
) {
String word = ts.getAttribute(CharTermAttribute.
class
).toString();
Integer wordId = dictionary.get(word);
// if the word is not in the dictionary, skip it
if
(wordId !=
null
) {
words.add(word);
wordCount++;
}
}
}
// create vector wordId => weight using tfidf
Vector vector =
new
RandomAccessSparseVector(
10000
);
TFIDF tfidf =
new
TFIDF();
for
(Multiset.Entry entry:words.entrySet()) {
String word = entry.getElement();
int
count = entry.getCount();
Integer wordId = dictionary.get(word);
Long freq = documentFrequency.get(wordId);
double
tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
vector.setQuick(wordId, tfIdfValue);
}
// With the classifier, we get one score for each label
// The label with the highest score is the one the tweet is more likely to
// be associated to
Vector resultVector = classifier.classifyFull(vector);
double
bestScore = -Double.MAX_VALUE;
int
bestCategoryId = -
1
;
for
(Element element: resultVector) {
int
categoryId = element.index();
double
score = element.get();
if
(score > bestScore) {
bestScore = score;
bestCategoryId = categoryId;
}
System.out.print(
" "
+ labels.get(categoryId) +
": "
+ score);
}
System.out.println(
" => "
+ labels.get(bestCategoryId));
}
}
}
$ hadoop fs -text [FILE_NAME]
export HADOOP_CLASSPATH=[MAHOUT_DIR]/mahout-math-0.7.jar:[MAHOUT_DIR]/mahout-examples-0.7-job.jar
mahout seqdumper -i [FILE_NAME]
View words which are the most representative of each categories
You can use the class TopCategoryWords that shows the top 10 words of each category.
public
class
TopCategoryWords {
public
static
Map<Integer, String> readInverseDictionnary(Configuration conf, Path dictionnaryPath) {
Map<Integer, String> inverseDictionnary =
new
HashMap<Integer, String>();
for
(Pair<Text, IntWritable> pair :
new
SequenceFileIterable<Text, IntWritable>(dictionnaryPath,
true
, conf)) {
inverseDictionnary.put(pair.getSecond().get(), pair.getFirst().toString());
}
return
inverseDictionnary;
}
public
static
Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
Map<Integer, Long> documentFrequency =
new
HashMap<Integer, Long>();
for
(Pair<IntWritable, LongWritable> pair :
new
SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath,
true
, conf)) {
documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
}
return
documentFrequency;
}
public
static
class
WordWeight
implements
Comparable {
private
int
wordId;
private
double
weight;
public
WordWeight(
int
wordId,
double
weight) {
this
.wordId = wordId;
this
.weight = weight;
}
public
int
getWordId() {
return
wordId;
}
public
Double getWeight() {
return
weight;
}
@Override
public
int
compareTo(WordWeight w) {
return
-getWeight().compareTo(w.getWeight());
}
}
public
static
void
main(String[] args)
throws
Exception {
if
(args.length <
4
) { System.out.println(
"Arguments: [model] [label index] [dictionnary] [document frequency]"
);
return
; } String modelPath = args[
0
]; String labelIndexPath = args[
1
]; String dictionaryPath = args[
2
]; String documentFrequencyPath = args[
3
]; Configuration configuration =
new
Configuration();
// model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(
new
Path(modelPath), configuration);
StandardNaiveBayesClassifier classifier =
new
StandardNaiveBayesClassifier(model);
// labels is a map label => classId
Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration,
new
Path(labelIndexPath));
Map<Integer, String> inverseDictionary = readInverseDictionnary(configuration,
new
Path(dictionaryPath));
Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration,
new
Path(documentFrequencyPath));
int
labelCount = labels.size();
int
documentCount = documentFrequency.get(-
1
).intValue();
System.out.println(
"Number of labels: "
+ labelCount);
System.out.println(
"Number of documents in training set: "
+ documentCount);
for
(
int
labelId =
0
; labelId < model.numLabels() ; labelId++) {
SortedSet wordWeights =
new
TreeSet();
for
(
int
wordId =
0
; wordId < model.numFeatures() ; wordId++) { WordWeight w =
new
WordWeight(wordId, model.weight(labelId, wordId)); wordWeights.add(w); } System.out.println(
"Top 10 words for label "
+ labels.get(labelId));
int
i =
0
;
for
(WordWeight w: wordWeights) { System.out.println(
" - "
+ inverseDictionary.get(w.getWordId()) +
": "
+ w.getWeight()); i++;
if
(i >=
10
) {
break
;
}
}
}
}
}
hadoop fs -rmr \*
Errors
When running the script to convert the tweet TSV message, I got the following errors:
Skip line: tech 309167277155168257 Easy web hosting. $4.95 - http://t.co/0oUGS6Oj0e - Review/Coupon- http://t.co/zdgH4kv5sv #wordpress #deal #bluehost #blue host Skip line: art 309167270989541376 Beautiful Jan Royce Conant Drawing of Jamaica - 1982 - Rare CT Artist - Animals #CPTV #EBAY #FineArt #Deals http://t.co/MUZf5aixMzMake sure that the category and the tweet id are followed by a tab character and not spaces.
Read full article from Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler
No comments:
Post a Comment