The anatomy of a Lucene Tokenizer
A term is the unit of search in Lucene. A Lucene document comprises of a set of terms. Tokenization means splitting up a string into tokens, or terms.
The only method you need to implement is public boolean incrementToken(). incrementToken returns false for EOF, true otherwise.
Tokenizers generally take a Reader input in the constructor, which is the source to be tokenized.
With each invocation of incrementToken(), the Tokenizer is expected to return new tokens, by setting the values of TermAttributes. This happens by adding TermAttributes to the superclass, usually as fields in the Tokenizer. e.g.
Here, a CharTermAttribute is added to the superclass. A CharTermAttribute stored the term text.
Please read full article from The anatomy of a Lucene Tokenizer
A term is the unit of search in Lucene. A Lucene document comprises of a set of terms. Tokenization means splitting up a string into tokens, or terms.
The only method you need to implement is public boolean incrementToken(). incrementToken returns false for EOF, true otherwise.
Tokenizers generally take a Reader input in the constructor, which is the source to be tokenized.
With each invocation of incrementToken(), the Tokenizer is expected to return new tokens, by setting the values of TermAttributes. This happens by adding TermAttributes to the superclass, usually as fields in the Tokenizer. e.g.
public
class
MyCustomTokenizer
extends
Tokenizer {
2.
private
final
CharTermAttribute termAtt = addAttribute(CharTermAttribute.
class
);
public
boolean
incrementToken() {
02.
if
(done)
return
false
;
03.
done =
true
;
04.
int
upto =
0
;
05.
char
[] buffer =
new
char
[
512
];
06.
while
(
true
) {
07.
final
int
length = input.read(buffer, upto, buffer.length - upto);
// input is the reader set in the ctor
08.
if
(length == -
1
)
break
;
09.
upto += length;
10.
sb.append(buffer);
11.
}
12.
termAtt.append(sb.toString());
13.
return
true
;
14.
}
No comments:
Post a Comment