Text Analytics in Enterprise Search - Daniel Ling



Text Analytics in Enterprise Search - Daniel Ling
Document Categorization

 To assign a label to the document / content / data.
 Labels for the category or for the sentiment.
 Threshold values for matching a category before labeling.
 Statistics and “knowledge” from previous examples can be used.


 Mallet and the process of setup and train:
Training the component, Mallet (Machine Learning for Language Toolkit).
• Alternative components includes Lucene (TFIDF) index
(MoreLikeThis), OpenNLP, Textcat, Classifier4j.
 Running the new documents against the model/index of trained
documents.
 Training from interface, adhoc, or index pre-categorized

Document Summarization

Summarize a document, at index time or on-demand.
 Leverage from the knowledge and term statistics of the document
and the index.
 Picks the “most important” sentences based on the statistics and
displays those.

Example Solution: Document Summarization

Custom RequestHandler that receives document ID and field to summarize.
 Custom Search Component making the selection of top sentences.
 Selecting a subset of sentences and sends these back in a field.

Please read full article from Text Analytics in Enterprise Search - Daniel Ling

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts