[SOLR-3975] Document Summarization toolkit, using LSA techniques - ASF JIRA



This package analyzes sentences and words as used across sentences to rank the most important sentences and words. The general topic is called "document summarization" and is a popular research topic in textual analysis.

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look at the large gray box marked 'Document Summary'. This has a table of statistics about the analysis, the three most important sentences, and several of the most important words in the documents. The sentences have the important words in italics.

The code is packaged as a search component and as an analysis handler. The /browse demo uses the search component, and you can also post raw text to http://localhost:8983/solr/collection1/analysis/summary. Here is a sample command:

curl -s "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml" --data-binary @$FILE -H 'Content-type:application/xml'  

This is an implementation of LSA-based document summarization. A short explanation and a long evaluation are described in my blog, Uncle Lance's Ultra Whiz Bang, starting here: http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html


Read full article from [SOLR-3975] Document Summarization toolkit, using LSA techniques - ASF JIRA


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts