Classifying Content with Apache SOLR | Zaizi



Classifying Content with Apache SOLR | Zaizi

Apache Solr is a popular, scalable and fault tolerant open source enterprise search platform built on Apache Lucene. Enterprise content management systems such as Alfresco and Drupal use Apache Solr to provide search capabilities to the end user. 

Lucene's classification module provides two classification algorithms namely K-Nearest Neighbour (KNN) and Naive Bayes to enable text classification using the content and associated metadata. 

K-nearest neighbour algorithm uses Apache Solr More Like This (MLT) feature to classify new text documents, based on the categories of existing similar documents with content and available metadata. Naive Bayes algorithm is using exact term frequency information already available to classify new documents using probabilistic approach (calculating conditional probabilities in Baysian statistics).

There is a suggestion to implement MaxEnt based classifier which takes word correlation into account, as a future enhancement in Lucene classification. 

Pros and cons

The advantage of Lucene document classification is that all the available documents, including the most recent, are considered when it comes to assigning a category for new document. The simplicity of the implementation is another benefit. 

Moreover, we can improve the semantic consistency of organising documents using this approach. For example, if there are articles about the sport Cricket, users might use different terms as categories for documents (E.g., sport, Sports, games, game etc.). This approach will come up with already available categories for new documents thus reducing this inconsistency. 

However, since most of the required processing to train the system with existing documents also takes place during inference time, the computation time can be significantly higher.

If the Apache Solr classification module incorrectly classifies a document and a user doesn't correct it, then that error will influence the accuracy of the categorization of new documents as both human generated categories and all the machine generated categories are taken into consideration to classify new documents. To overcome this disadvantage, we can provide the machine generated categories as a suggestion (not as set category) to end user. This is a way of incorporating user feedback into a machine learning application. If end user feels that the category is incorrect he can reject setting the suggested category for the document added. 

We tested Lucene classification using 20 Newsgroup dataset which consist of a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

During the evaluation, we found that KNN algorithm is much faster than Naive Bayes algorithm when it comes to inferring categories for new documents. When the input size is increasing, computation time of Naive Bayes algorithm is also largely increasing (approximately by 2X).  Comparatively, computation time for KNN slightly increases with input size. 

Further, when it comes to accuracy, KNN is more accurate than Naive bayes algorithm.  Accuracy of the KNN algorithm is around 80% for Newsgroup dataset, whereas accuracy for Naive Bayes algorithm is around 70%. For KNN, K-fold cross validation can also be performed for different K values to find the optimal K value for better accuracy. As a norm, in KNN, K is set to to the square root of the number of training patterns/samples. 

Accordingly, we have decided that KNN algorithm is more suitable for our solution than Naive Bayes algorithm. 

Implementation

A custom Apache Solr Update request processor was implemented to enable document classification. Each document added to Apache Solr, goes through Apache Solr update request processors. So, we have developed a custom update request processor (SOLR plug-in) to classify each document.

Once the document is received by the custom Apache Solr update request processor, it will be classified by Lucene classifier using either Naive Bayes or KNN algorithm and the suggested category value can be applied to a new Apache Solr field for classification


Read full article from Classifying Content with Apache SOLR | Zaizi


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts