Lucene 同义词



Lucene 同义词
在Lucene4.6中通过SynonymFilterFactory实现中文同义词非常方便,只需几行代码和一个同义词词典。这个词典还能在Lucene中实现一定程度的拼写纠错,提升搜索体验。在下面这个例子中我们从磁盘载入一个同义词词典,并且对“其实hankcs似好人”这句话进行stream化以供索引,同时还对其中的拼写错误“似->是”做出纠正。

首先是位于./data/synonyms.txt路径下的同义词词典:

1
2
3
我,俺,hankcs
似,is,are => 是
好人,好心人,热心人
可以看出上面有两种词典格式:

通过,分割的可拓展同义词

比如“我,俺,hankcs”代表着这三个词是同义词,并且任何一个词可以被expand(拓展)为其他三个。如果expand设为false的话,则这三个词都会被统一替换为第一个词,也就是“我”。

通过=>收缩的不可拓展同义词

比如“似,is,are => 是”代表这三个词同义,并且无视expand参数,统一会被替换为“是”
    private static void displayTokens(TokenStream ts) throws IOException
    {
        CharTermAttribute termAttr = ts.addAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttribute = ts.addAttribute(OffsetAttribute.class);
        ts.reset();
        while (ts.incrementToken())
        {
            String token = termAttr.toString();
            System.out.print(offsetAttribute.startOffset() + "-" + offsetAttribute.endOffset() + "[" + token + "] ");
        }
        System.out.println();
        ts.end();
        ts.close();
    }
 
    public static void main(String[] args) throws Exception
    {
        String testInput = "其实 hankcs 似 好人";
        Version ver = Version.LUCENE_46;
        Map<String, String> filterArgs = new HashMap<String, String>();
        filterArgs.put("luceneMatchVersion", ver.toString());
        filterArgs.put("synonyms""./data/synonyms.txt");
        filterArgs.put("expand""true");
        SynonymFilterFactory factory = new SynonymFilterFactory(filterArgs);
        factory.inform(new FilesystemResourceLoader());
        WhitespaceAnalyzer whitespaceAnalyzer = new WhitespaceAnalyzer(ver);
        TokenStream ts = factory.create(whitespaceAnalyzer.tokenStream("someField", testInput));
        displayTokens(ts);
    }
输出:
0-2[其实] 3-9[我] 3-9[俺] 3-9[hankcs] 10-11[是] 12-14[好人] 12-14[好心人] 12-14[热心人]
由于 我 俺 hankcs 三个词是同一个意思,所以它们被视为同一个term,并且它们的偏移相同,都是3->9,这个长度取决于原来的词 hankcs 的长度。
Please read full article from Lucene 同义词

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts