All About Programming: The SpanQuery

SpanQuerys allow for nested, positional restrictions when matching documents in Lucene. SpanQuery’s are much like PhraseQuerys or MultiPhraseQuerys in that they all restrict term matches by position, but SpanQuerys can be much more expressive.

The basic SpanQuery units are the SpanTermQuery and the SpanNearQuery.

A SpanTermQuery is the most basic SpanQuery, and simply lets you specify a field, term, and boost by passing in a Term, just like a TermQuery. SpanTermQuery is used as a basic building block in building up combining SpanQuery classes, like SpanNearQuery.

A SpanNearQuery will look to find a number of SpanQuerys within a given distance from each other. You can specify that the spans must come in the order specified, or that order should not be considered. These SpanQuerys could be any number of TermQuerys, other SpanNearQuerys, or one of the other SpanQuerys mentioned below. You can nest arbitrarily, eg SpanNearQuerys can contain other SpanNearQuerys that contain still other SpanNearQuerys, etc.

Say we want to find lucene within 5 positions of doug, with doug following lucene (order matters) – you could use the following SpanQuery:

new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

The SpanNearQuery constructor takes an array of SpanQuerys, the distance allowed between spans, and a boolean indicating whether order (as indicated by the order of the SpanQuery array) is required.

This time we are looking for doug within 5 after lucene and then hadoop within 4 after the lucene -> doug span.

SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
  new SpanTermQuery(new Term(FIELD, "lucene")),
  new SpanTermQuery(new Term(FIELD, "doug"))},
  5,
  true);

new SpanNearQuery(new SpanQuery[] {
  spanNear,
  new SpanTermQuery(new Term(FIELD, "hadoop"))},
  4,
  true);

SpanFirstQuery

The SpanFirstQuery lets you specify that a matching Spans end position must come before a given position passed to the SpanFirstQuery. In other words, it allows you to search for Spans that start and end within the first n positions of the document.

Distance is measured from the end of span1 to the start of span2, but the order restriction only means that the start of span2 must come after the start of span1.

This is because Lucene defines Spans as non overlapping. This means every Span must start after (at least one position after) the last Span started

SpanQuerys do not do exhaustive matching – but if there is at least one match, they will find it.

Read full article from The SpanQuery – Lucidworks

The SpanQuery - Lucidworks

No comments:

Post a Comment

Labels

Popular Posts