Lucene增强功能:Payload的应用
可以在文档中,对给定词出现在文档的出现的权重信息(egg在文档1与文档中,以foods来衡量,文档1更相关),可以在索引之前处理一下,为egg增加payload信息,例如:
可以在文档中,对给定词出现在文档的出现的权重信息(egg在文档1与文档中,以foods来衡量,文档1更相关),可以在索引之前处理一下,为egg增加payload信息,例如:
- 文档1:egg|0.984 tomato potato bread
- 文档2:egg|0.356 book potato bread
然后再进行索引,通过Lucene提供的PayloadTermQuery就能够分辨出上述egg这个Term的不同。在Lucene中,实际上是将我们存储的Payload数据,如上述“|”分隔后面的数字,乘到了tf上,然后在进行权重的计算。
下面,我们再看一下,增加一个Field来存储Payload数据,而源文档不需要进行修改,或者,我们可以在索引之前对文档进行一个处理,例如分类,通过分类可以给不同的文档所属类别的不同程度,计算一个Payload数值。
为了能够使用存储的Payload数据信息,结合上面提出的实例,我们需要按照如下步骤去做:
第一,待索引数据处理
例如,增加category这个Field存储类别信息,content这个Field存储上面的内容:
- 文档1:
- new Field("category", "foods|0.984 shopping|0.503", Field.Store.YES, Field.Index.ANALYZED)
- new Field("content", "egg tomato potato bread", Field.Store.YES, Field.Index.ANALYZED)
- 文档2:
- new Field("category", "foods|0.356 shopping|0.791", Field.Store.YES, Field.Index.ANALYZED)
- new Field("content", "egg book potato bread", Field.Store.YES, Field.Index.ANALYZED)
第二,实现解析Payload数据的Analyzer
由于Payload信息存储在category这个Field中,多个类别之间使用空格分隔,每个类别内容是以“|”分隔的,所以我们的Analyzer就要能够解析它。Lucene提供了DelimitedPayloadTokenFilter,能够处理具有分隔符的情况。我们的实现如下所示:
- public class PayloadAnalyzer extends Analyzer {
- private PayloadEncoder encoder;
- PayloadAnalyzer(PayloadEncoder encoder) {
- this.encoder = encoder;
- }
- @SuppressWarnings("deprecation")
- public TokenStream tokenStream(String fieldName, Reader reader) {
- TokenStream result = new WhitespaceTokenizer(reader); // 用来解析空格分隔的各个类别
- result = new DelimitedPayloadTokenFilter(result, '|', encoder); // 在上面分词的基础上,在进行Payload数据解析
- return result;
- }
- }
第三, 实现Similarity计算得分
Lucene中Similarity类中提供了scorePayload方法,用于计算Payload值来对文档贡献得分,我们重写了该方法,实现如下所示:
- public class PayloadSimilarity extends DefaultSimilarity {
- private static final long serialVersionUID = 1L;
- @Override
- public float scorePayload(int docId, String fieldName, int start, int end,
- byte[] payload, int offset, int length) {
- return PayloadHelper.decodeFloat(payload, offset);
- }
- }
第四,创建索引
在创建索引的时候,需要使用到我们上面实现的Analyzer和Similarity,代码如下所示:
- private IndexWriter indexWriter = null;
- private final Analyzer analyzer = new PayloadAnalyzer(new FloatEncoder()); // 使用PayloadAnalyzer,并指定Encoder
- private final Similarity similarity = new PayloadSimilarity(); // 实例化一个PayloadSimilarity
- private IndexWriterConfig config = null;
- public PayloadIndexing(String indexPath) throws CorruptIndexException, LockObtainFailedException, IOException {
- File indexFile = new File(indexPath);
- config = new IndexWriterConfig(Version.LUCENE_31, analyzer);
- config.setOpenMode(OpenMode.CREATE).setSimilarity(similarity); // 设置计算得分的Similarity
- indexWriter = new IndexWriter(FSDirectory.open(indexFile), config);
- }
第五,查询
查询的时候,我们可以构造PayloadTermQuery来进行查询
- private IndexReader indexReader;
- private IndexSearcher searcher;
- public PayloadSearching(String indexPath) throws CorruptIndexException, IOException {
- indexReader = IndexReader.open(NIOFSDirectory.open(new File(indexPath)), true);
- searcher = new IndexSearcher(indexReader);
- searcher.setSimilarity(new PayloadSimilarity()); // 设置自定义的PayloadSimilarity
- }
- public ScoreDoc[] search(String qsr) throws ParseException, IOException {
- int hitsPerPage = 10;
- BooleanQuery bq = new BooleanQuery();
- for(String q : qsr.split(" ")) {
- bq.add(createPayloadTermQuery(q), Occur.MUST);
- }
- TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, true);
- searcher.search(bq, collector);
- ScoreDoc[] hits = collector.topDocs().scoreDocs;
- for (int i = 0; i < hits.length; i++) {
- int docId = hits[i].doc; // 文档编号
- Explanation explanation = searcher.explain(bq, docId);
- System.out.println(explanation.toString());
- }
- return hits;
- }
- public void display(ScoreDoc[] hits, int start, int end) throws CorruptIndexException, IOException {
- end = Math.min(hits.length, end);
- for (int i = start; i < end; i++) {
- Document doc = searcher.doc(hits[i].doc);
- int docId = hits[i].doc; // 文档编号
- float score = hits[i].score; // 文档得分
- System.out.println(docId + "\t" + score + "\t" + doc + "\t");
- }
- }
- public void close() throws IOException {
- searcher.close();
- indexReader.close();
- }
- private PayloadTermQuery createPayloadTermQuery(String item) {
- PayloadTermQuery ptq = null;
- if(item.indexOf("^")!=-1) {
- String[] a = item.split("\\^");
- String field = a[0].split(":")[0];
- String token = a[0].split(":")[1];
- ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction());
- ptq.setBoost(Float.parseFloat(a[1].trim()));
- } else {
- String field = item.split(":")[0];
- String token = item.split(":")[1];
- ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction());
- }
- return ptq;
- }
- public static void main(String[] args) throws ParseException, IOException {
- int start = 0, end = 10;
- // String queries = "category:foods^123.0 content:bread^987.0";
- String queries = "category:foods content:egg";
- PayloadSearching payloadSearcher = new PayloadSearching("E:\\index");
- payloadSearcher.display(payloadSearcher.search(queries), start, end);
- payloadSearcher.close();
- }
我们可以看到,除了在计算category权重的时候,tf上乘了一个Payload值以外,其他的都相同,也就是说,我们预期使用的Payload为文档(ID=0)贡献了得分,排名靠前了。否则,如果不使用Payload的话,查询结果中两个文档的得分是相同的(可以模拟设置他们的Payload值相同,测试一下看看)
Please read full article from Lucene增强功能:Payload的应用
No comments:
Post a Comment