Lucene4.3进阶开发之纯阳无极(十九) - 三劫散仙 - ITeye技术网站



Lucene4.3进阶开发之纯阳无极(十九) - 三劫散仙 - ITeye技术网站

那么首先,探讨下分词器的词形还原和词干提取的对搜索的意义?在这之前,先看下两者的概念: (stemming)是抽取词的词干或词根形式(不一定能够表达完整语义)。词形还原和词干提取是词形规范化的两类 句子: i have two cats 分词器如果什么都没有做: 本篇,散仙,会参考源码分析一下,关于德语分词中中如何做的词干提取,先看下德语的分词声明: List list=new ArrayList(); list.add("player");//这里面的词,不会被做词干抽取,词形还原 CharArraySet ar=new CharArraySet(Version.LUCENE_43,list , true); //分词器的第二个参数是禁用词参数,第三个参数是排除不做词形转换,或单复数的词 GermanAnalyzer sa=new GermanAnalyzer(Version.LUCENE_43,null,ar); 接着,我们具体看下,在德语的分词器中,都经过了哪几部分的过滤处理: OK,我们从源码中得知,在Lucene4.x中对德语的分词也做了向前和向后兼容,现在我们主要关注在lucene4.x之后的版本如何的词形转换,下面分别看下      result = new GermanNormalizationFilter(result);       result = new GermanLightStemFilter(result); 这两个类的功能: package org.apache.lucene.analysis.de; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License");

Read full article from Lucene4.3进阶开发之纯阳无极(十九) - 三劫散仙 - ITeye技术网站


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts