Lucene 4 is Super Convenient for Developing NLP Tools



Lucene 4 is Super Convenient for Developing NLP Tools
Lucene 4.0 classes that I used for developing this system are as follows:
  • IndexSearcher, TermQuery, TopDocs
    This system calculates similarities of synonym candidates that consist of nouns extracted from keywords and their descriptions. The system determines that the candidate is a synonym of keyword if similarity is bigger than a threshold value and output it to a CSV file.
    But how I calculate the similarity of a keyword and its synonym candidate. This system determines the similarity by calculating the similarity of keyword description Aa and dictionary entry description set {Ab} that are written using synonym candidates.
    Thus, I have to find {Ab} where I used classes such as IndexSearcher, TermQuery, and TopDocsto to search description field using synonym candidate.
  • PriorityQueue
    Next, I have to pick out “feature word” from Aa and {Ab} to calculate similarity of the two. In order to do so, I select N most important words to structure feature vector. Here, I use TF*IDF of the target word as their degree of importance. See the above SlideShare for the detail. Here, I use PriorityQueue to select “N most important words”
  • DocsEnum, TotalHitCountCollector
    I used TF*IDF to calculate weight to extract the above feature word and used DocsEnum.freq() to obtain TF. docFreq (number of articles including synonym candidate), which is a required parameter to obtain IDF, has been calculated by passing TotalHitCountCollector to the search() method of IndexSearcher.
  • Terms, TermsEnum
    I use these classes to search “description” field for synonym candidates.
These are usage examples for Lucene 4.0 on this system. I also believe Lucene will be a great help for NLP tool developers as well. For lexical knowledge obtention task using Bootstrap, for example, I can use a cycle (1: pattern extraction, 2: pattern selection, 3: instance extraction, 4: instance selection) to obtain knowledge from a small number of seed instances. I believe that you can replace pattern extraction and instance extraction with a simple search task if you use Lucene for these tasks.
Please read full article from Lucene 4 is Super Convenient for Developing NLP Tools

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts