All About Programming: Text categorization with Lucene and Solr

Text categorization with Lucene and Solr
Let the algorithm assign one or more labels (classes) to some item given some previous knowledge
l Spam filter
l Tagging system
l Digit recognition system
l Text categorization

l Lucene already has a lot of features for common information retrieval needs
l Postings
l Term vectors
l Statistics
l Positions
l TF / IDF
l maybe Payloads
l etc.
l We may avoid bringing in new components
to do classification just leveraging what we
get for free from Lucene

l Lucene has so many features stored you can take advantage of for free
l Therefore writing the classification algorithm is relatively simple
l In many cases you’re just not adding anything to the architecture
l Your Lucene index was already there for searching l Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm
l And it is fast enough

Classifier API
l Training
l void train(atomicReader, contentField, classField, analyzer) throws IOException

K Nearest neighbor classifier
l Fairly simple classification algorithm
l Given some new unseen item
l I search in my knowledge base the k items which are nearer to the new one
l I get the k classes assigned to the k nearest items
l I assign to the new item the class that is most frequent in the k returned items

K Nearest neighbor classifier
l How can we do this in Lucene?
l We have VSM for representing documents as
vectors and eventually find distances
l Lucene MoreLikeThis module can do a lot for it
l Given a new document
l It’s represented as a MoreLikeThisQuery which filters
out too frequent words and helps on keeping only the
relevant tokens for finding the neighbors
l The query is executed returning only the first k results
l The result is then browsed in order to find the most
frequent class and that is then assigned with a score
of classFreq / k

Naïve Bayes classifier
l Slightly more complicated
l Based on probabilities
l C = argmax( P(d|c) * P(c) )
l P(d|c) : likelihood
l P(c) : prior
l With some assumptions:
l bag of words assumption: positions don't matter
l conditional independence: the feature probabilities
are independent given a class

Things to consider - bootstrapping
l How are your first documents classified?
l Manually
l Categories are already there in the documents
l Someone is explicitly charged to do that (e.g. article
authors) at some point in time
l (semi) automatically
l Using some existing service / library
l With or without human supervision
l In either case the classifier needs something to
be fed with to be effective

As specific search services
l A classification based more like this
l While indexing
l For automatic text categorization

Automatic text categorization
l Once a doc reaches Solr
l We can use the Lucene classifiers to automate assigning document’s category
l We can leverage existing Solr facilites for enhancing the indexing pipeline
l An UpdateChain can be decorated with one or more UpdateRequestProcessors

CategorizationUpdateRequestProcessorFactory
CategorizationUpdateRequestProcessor
l void processAdd(AddUpdateCommand
cmd) throws IOException
l String text = solrInputDocument.getFieldValue(“text”);
l String class = classifier.assignClass(text);
l solrInputDocument.addField(“cat”, class);
l Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances

CategorizationUpdateRequestProcessor
l Finer grained control
l Use automatic text categorization only if a value
does not exist for the “cat” field
l Add the classifier output class to the “cat” field only if it’s above a certain score

Implement a MaxEnt Lucene based classifier
l which takes into account words correlation

Please read full article from Text categorization with Lucene and Solr

Text categorization with Lucene and Solr

No comments:

Post a Comment

Labels

Popular Posts