Text categorization with Lucene and Solr
Let the algorithm assign one or more labels (classes) to some item given some previous knowledge
l Spam filter
l Tagging system
l Digit recognition system
l Text categorization
l Lucene already has a lot of features for common information retrieval needs
l Postings
l Term vectors
l Statistics
l Positions
l TF / IDF
l maybe Payloads
l etc.
l We may avoid bringing in new components
to do classification just leveraging what we
get for free from Lucene
l Lucene has so many features stored you can take advantage of for free
l Therefore writing the classification algorithm is relatively simple
l In many cases you’re just not adding anything to the architecture
l Your Lucene index was already there for searching l Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm
l And it is fast enough
Classifier API
l Training
l void train(atomicReader, contentField, classField, analyzer) throws IOException
K Nearest neighbor classifier
l Fairly simple classification algorithm
l Given some new unseen item
l I search in my knowledge base the k items which are nearer to the new one
l I get the k classes assigned to the k nearest items
l I assign to the new item the class that is most frequent in the k returned items
K Nearest neighbor classifier
l How can we do this in Lucene?
l We have VSM for representing documents as
vectors and eventually find distances
l Lucene MoreLikeThis module can do a lot for it
l Given a new document
l It’s represented as a MoreLikeThisQuery which filters
out too frequent words and helps on keeping only the
relevant tokens for finding the neighbors
l The query is executed returning only the first k results
l The result is then browsed in order to find the most
frequent class and that is then assigned with a score
of classFreq / k
Naïve Bayes classifier
l Slightly more complicated
l Based on probabilities
l C = argmax( P(d|c) * P(c) )
l P(d|c) : likelihood
l P(c) : prior
l With some assumptions:
l bag of words assumption: positions don't matter
l conditional independence: the feature probabilities
are independent given a class
Things to consider - bootstrapping
l How are your first documents classified?
l Manually
l Categories are already there in the documents
l Someone is explicitly charged to do that (e.g. article
authors) at some point in time
l (semi) automatically
l Using some existing service / library
l With or without human supervision
l In either case the classifier needs something to
be fed with to be effective
As specific search services
l A classification based more like this
l While indexing
l For automatic text categorization
Automatic text categorization
l Once a doc reaches Solr
l We can use the Lucene classifiers to automate assigning document’s category
l We can leverage existing Solr facilites for enhancing the indexing pipeline
l An UpdateChain can be decorated with one or more UpdateRequestProcessors
CategorizationUpdateRequestProcessorFactory
CategorizationUpdateRequestProcessor
l void processAdd(AddUpdateCommand
cmd) throws IOException
l String text = solrInputDocument.getFieldValue(“text”);
l String class = classifier.assignClass(text);
l solrInputDocument.addField(“cat”, class);
l Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances
CategorizationUpdateRequestProcessor
l Finer grained control
l Use automatic text categorization only if a value
does not exist for the “cat” field
l Add the classifier output class to the “cat” field only if it’s above a certain score
Implement a MaxEnt Lucene based classifier
l which takes into account words correlation
Please read full article from Text categorization with Lucene and Solr
Let the algorithm assign one or more labels (classes) to some item given some previous knowledge
l Spam filter
l Tagging system
l Digit recognition system
l Text categorization
l Lucene already has a lot of features for common information retrieval needs
l Postings
l Term vectors
l Statistics
l Positions
l TF / IDF
l maybe Payloads
l etc.
l We may avoid bringing in new components
to do classification just leveraging what we
get for free from Lucene
l Lucene has so many features stored you can take advantage of for free
l Therefore writing the classification algorithm is relatively simple
l In many cases you’re just not adding anything to the architecture
l Your Lucene index was already there for searching l Lucene index is, to some extent, already a model which we just need to “query” with the proper algorithm
l And it is fast enough
Classifier API
l Training
l void train(atomicReader, contentField, classField, analyzer) throws IOException
K Nearest neighbor classifier
l Fairly simple classification algorithm
l Given some new unseen item
l I search in my knowledge base the k items which are nearer to the new one
l I get the k classes assigned to the k nearest items
l I assign to the new item the class that is most frequent in the k returned items
K Nearest neighbor classifier
l How can we do this in Lucene?
l We have VSM for representing documents as
vectors and eventually find distances
l Lucene MoreLikeThis module can do a lot for it
l Given a new document
l It’s represented as a MoreLikeThisQuery which filters
out too frequent words and helps on keeping only the
relevant tokens for finding the neighbors
l The query is executed returning only the first k results
l The result is then browsed in order to find the most
frequent class and that is then assigned with a score
of classFreq / k
Naïve Bayes classifier
l Slightly more complicated
l Based on probabilities
l C = argmax( P(d|c) * P(c) )
l P(d|c) : likelihood
l P(c) : prior
l With some assumptions:
l bag of words assumption: positions don't matter
l conditional independence: the feature probabilities
are independent given a class
Things to consider - bootstrapping
l How are your first documents classified?
l Manually
l Categories are already there in the documents
l Someone is explicitly charged to do that (e.g. article
authors) at some point in time
l (semi) automatically
l Using some existing service / library
l With or without human supervision
l In either case the classifier needs something to
be fed with to be effective
As specific search services
l A classification based more like this
l While indexing
l For automatic text categorization
Automatic text categorization
l Once a doc reaches Solr
l We can use the Lucene classifiers to automate assigning document’s category
l We can leverage existing Solr facilites for enhancing the indexing pipeline
l An UpdateChain can be decorated with one or more UpdateRequestProcessors
CategorizationUpdateRequestProcessorFactory
CategorizationUpdateRequestProcessor
l void processAdd(AddUpdateCommand
cmd) throws IOException
l String text = solrInputDocument.getFieldValue(“text”);
l String class = classifier.assignClass(text);
l solrInputDocument.addField(“cat”, class);
l Every now and then need to retrain to get latest stuff in the current index, but that can be done in the background without affecting performances
CategorizationUpdateRequestProcessor
l Finer grained control
l Use automatic text categorization only if a value
does not exist for the “cat” field
l Add the classifier output class to the “cat” field only if it’s above a certain score
Implement a MaxEnt Lucene based classifier
l which takes into account words correlation
Please read full article from Text categorization with Lucene and Solr
No comments:
Post a Comment