Lucene 4 is Super Convenient for Developing NLP Tools
Lucene 4.0 classes that I used for developing this system are as follows:
- IndexSearcher, TermQuery, TopDocs
This system calculates similarities of synonym candidates that consist of nouns extracted from keywords and their descriptions. The system determines that the candidate is a synonym of keyword if similarity is bigger than a threshold value and output it to a CSV file.
But how I calculate the similarity of a keyword and its synonym candidate. This system determines the similarity by calculating the similarity of keyword description Aa and dictionary entry description set {Ab} that are written using synonym candidates.
Thus, I have to find {Ab} where I used classes such as IndexSearcher, TermQuery, and TopDocsto to search description field using synonym candidate. - PriorityQueue
Next, I have to pick out “feature word” from Aa and {Ab} to calculate similarity of the two. In order to do so, I select N most important words to structure feature vector. Here, I use TF*IDF of the target word as their degree of importance. See the above SlideShare for the detail. Here, I use PriorityQueue to select “N most important words” - DocsEnum, TotalHitCountCollector
I used TF*IDF to calculate weight to extract the above feature word and used DocsEnum.freq() to obtain TF. docFreq (number of articles including synonym candidate), which is a required parameter to obtain IDF, has been calculated by passing TotalHitCountCollector to the search() method of IndexSearcher. - Terms, TermsEnum
I use these classes to search “description” field for synonym candidates.
These are usage examples for Lucene 4.0 on this system. I also believe Lucene will be a great help for NLP tool developers as well. For lexical knowledge obtention task using Bootstrap, for example, I can use a cycle (1: pattern extraction, 2: pattern selection, 3: instance extraction, 4: instance selection) to obtain knowledge from a small number of seed instances. I believe that you can replace pattern extraction and instance extraction with a simple search task if you use Lucene for these tasks.
Please read full article from Lucene 4 is Super Convenient for Developing NLP Tools
No comments:
Post a Comment