All About Programming: Lucene4.3进阶开发之纯阳无极（十九） - 三劫散仙

Lucene4.3进阶开发之纯阳无极（十九） - 三劫散仙 - ITeye技术网站

那么首先，探讨下分词器的词形还原和词干提取的对搜索的意义？在这之前，先看下两者的概念：（stemming）是抽取词的词干或词根形式（不一定能够表达完整语义）。词形还原和词干提取是词形规范化的两类句子： i have two cats 分词器如果什么都没有做：本篇，散仙，会参考源码分析一下，关于德语分词中中如何做的词干提取，先看下德语的分词声明： List list=new ArrayList(); list.add("player");//这里面的词，不会被做词干抽取，词形还原 CharArraySet ar=new CharArraySet(Version.LUCENE_43,list , true); //分词器的第二个参数是禁用词参数，第三个参数是排除不做词形转换，或单复数的词 GermanAnalyzer sa=new GermanAnalyzer(Version.LUCENE_43,null,ar); 接着，我们具体看下，在德语的分词器中，都经过了哪几部分的过滤处理： OK，我们从源码中得知，在Lucene4.x中对德语的分词也做了向前和向后兼容，现在我们主要关注在lucene4.x之后的版本如何的词形转换，下面分别看下 result = new GermanNormalizationFilter(result); result = new GermanLightStemFilter(result); 这两个类的功能： package org.apache.lucene.analysis.de; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License");

Read full article from Lucene4.3进阶开发之纯阳无极（十九） - 三劫散仙 - ITeye技术网站

Lucene4.3进阶开发之纯阳无极（十九） - 三劫散仙 - ITeye技术网站

No comments:

Post a Comment

Labels

Popular Posts