All About Programming: Lucene增强功能：Payload的应用

Lucene增强功能：Payload的应用

Lucene增强功能：Payload的应用
可以在文档中，对给定词出现在文档的出现的权重信息（egg在文档1与文档中，以foods来衡量，文档1更相关），可以在索引之前处理一下，为egg增加payload信息，例如：

文档1：egg|0.984 tomato potato bread
文档2：egg|0.356 book potato bread

然后再进行索引，通过Lucene提供的PayloadTermQuery就能够分辨出上述egg这个Term的不同。在Lucene中，实际上是将我们存储的Payload数据，如上述“|”分隔后面的数字，乘到了tf上，然后在进行权重的计算。

下面，我们再看一下，增加一个Field来存储Payload数据，而源文档不需要进行修改，或者，我们可以在索引之前对文档进行一个处理，例如分类，通过分类可以给不同的文档所属类别的不同程度，计算一个Payload数值。

为了能够使用存储的Payload数据信息，结合上面提出的实例，我们需要按照如下步骤去做：

第一，待索引数据处理

例如，增加category这个Field存储类别信息，content这个Field存储上面的内容：

文档1：
new Field("category", "foods|0.984 shopping|0.503", Field.Store.YES, Field.Index.ANALYZED)
new Field("content", "egg tomato potato bread", Field.Store.YES, Field.Index.ANALYZED)
文档2：
new Field("category", "foods|0.356 shopping|0.791", Field.Store.YES, Field.Index.ANALYZED)
new Field("content", "egg book potato bread", Field.Store.YES, Field.Index.ANALYZED)

第二，实现解析Payload数据的Analyzer

由于Payload信息存储在category这个Field中，多个类别之间使用空格分隔，每个类别内容是以“|”分隔的，所以我们的Analyzer就要能够解析它。Lucene提供了DelimitedPayloadTokenFilter，能够处理具有分隔符的情况。我们的实现如下所示：

public class PayloadAnalyzer extends Analyzer {
private PayloadEncoder encoder;
PayloadAnalyzer(PayloadEncoder encoder) {
this.encoder = encoder;
}
@SuppressWarnings("deprecation")
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader); // 用来解析空格分隔的各个类别
result = new DelimitedPayloadTokenFilter(result, '|', encoder); // 在上面分词的基础上，在进行Payload数据解析
return result;
}
}

第三，实现Similarity计算得分

Lucene中Similarity类中提供了scorePayload方法，用于计算Payload值来对文档贡献得分，我们重写了该方法，实现如下所示：

public class PayloadSimilarity extends DefaultSimilarity {
private static final long serialVersionUID = 1L;
@Override
public float scorePayload(int docId, String fieldName, int start, int end,
byte[] payload, int offset, int length) {
return PayloadHelper.decodeFloat(payload, offset);
}
}

第四，创建索引

在创建索引的时候，需要使用到我们上面实现的Analyzer和Similarity，代码如下所示：

private IndexWriter indexWriter = null;
private final Analyzer analyzer = new PayloadAnalyzer(new FloatEncoder()); // 使用PayloadAnalyzer，并指定Encoder
private final Similarity similarity = new PayloadSimilarity(); // 实例化一个PayloadSimilarity
private IndexWriterConfig config = null;
public PayloadIndexing(String indexPath) throws CorruptIndexException, LockObtainFailedException, IOException {
File indexFile = new File(indexPath);
config = new IndexWriterConfig(Version.LUCENE_31, analyzer);
config.setOpenMode(OpenMode.CREATE).setSimilarity(similarity); // 设置计算得分的Similarity
indexWriter = new IndexWriter(FSDirectory.open(indexFile), config);
}

第五，查询

查询的时候，我们可以构造PayloadTermQuery来进行查询

private IndexReader indexReader;
private IndexSearcher searcher;
public PayloadSearching(String indexPath) throws CorruptIndexException, IOException {
indexReader = IndexReader.open(NIOFSDirectory.open(new File(indexPath)), true);
searcher = new IndexSearcher(indexReader);
searcher.setSimilarity(new PayloadSimilarity()); // 设置自定义的PayloadSimilarity
}
public ScoreDoc[] search(String qsr) throws ParseException, IOException {
int hitsPerPage = 10;
BooleanQuery bq = new BooleanQuery();
for(String q : qsr.split(" ")) {
bq.add(createPayloadTermQuery(q), Occur.MUST);
}
TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, true);
searcher.search(bq, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc; // 文档编号
Explanation explanation = searcher.explain(bq, docId);
System.out.println(explanation.toString());
}
return hits;
}
public void display(ScoreDoc[] hits, int start, int end) throws CorruptIndexException, IOException {
end = Math.min(hits.length, end);
for (int i = start; i < end; i++) {
Document doc = searcher.doc(hits[i].doc);
int docId = hits[i].doc; // 文档编号
float score = hits[i].score; // 文档得分
System.out.println(docId + "\t" + score + "\t" + doc + "\t");
}
}
public void close() throws IOException {
searcher.close();
indexReader.close();
}
private PayloadTermQuery createPayloadTermQuery(String item) {
PayloadTermQuery ptq = null;
if(item.indexOf("^")!=-1) {
String[] a = item.split("\\^");
String field = a[0].split(":")[0];
String token = a[0].split(":")[1];
ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction());
ptq.setBoost(Float.parseFloat(a[1].trim()));
} else {
String field = item.split(":")[0];
String token = item.split(":")[1];
ptq = new PayloadTermQuery(new Term(field, token), new AveragePayloadFunction());
}
return ptq;
}
public static void main(String[] args) throws ParseException, IOException {
int start = 0, end = 10;
// String queries = "category:foods^123.0 content:bread^987.0";
String queries = "category:foods content:egg";
PayloadSearching payloadSearcher = new PayloadSearching("E:\\index");
payloadSearcher.display(payloadSearcher.search(queries), start, end);
payloadSearcher.close();
}

我们可以看到，除了在计算category权重的时候，tf上乘了一个Payload值以外，其他的都相同，也就是说，我们预期使用的Payload为文档（ID=0）贡献了得分，排名靠前了。否则，如果不使用Payload的话，查询结果中两个文档的得分是相同的（可以模拟设置他们的Payload值相同，测试一下看看）

Please read full article from Lucene增强功能：Payload的应用

Lucene增强功能：Payload的应用

第一，待索引数据处理

第二，实现解析Payload数据的Analyzer

第三，实现Similarity计算得分

第四，创建索引

第五，查询

No comments:

Post a Comment

Labels

Popular Posts

Lucene增强功能：Payload的应用

第一，待索引数据处理

第二，实现解析Payload数据的Analyzer

第三， 实现Similarity计算得分

第四，创建索引

第五，查询

No comments:

Post a Comment

Labels

Popular Posts

第三，实现Similarity计算得分