All About Programming: Text Analytics in Enterprise Search

Text Analytics in Enterprise Search - Daniel Ling
Document Categorization

 To assign a label to the document / content / data.
 Labels for the category or for the sentiment.
 Threshold values for matching a category before labeling.
 Statistics and “knowledge” from previous examples can be used.

 Mallet and the process of setup and train:
Training the component, Mallet (Machine Learning for Language Toolkit).
• Alternative components includes Lucene (TFIDF) index
(MoreLikeThis), OpenNLP, Textcat, Classifier4j.
 Running the new documents against the model/index of trained
documents.
 Training from interface, adhoc, or index pre-categorized

Document Summarization

Summarize a document, at index time or on-demand.
 Leverage from the knowledge and term statistics of the document
and the index.
 Picks the “most important” sentences based on the statistics and
displays those.

Example Solution: Document Summarization

Custom RequestHandler that receives document ID and field to summarize.
 Custom Search Component making the selection of top sentences.
 Selecting a subset of sentences and sends these back in a field.

Please read full article from Text Analytics in Enterprise Search - Daniel Ling

Text Analytics in Enterprise Search - Daniel Ling

No comments:

Post a Comment

Labels

Popular Posts