All About Programming: Testing Solr schema, analyzers and tokenization

Testing Solr schema, analyzers and tokenization - Pathbreak Developer Notebook

Using tests to tune accuracy of search results is very critical. Accuracy of search results depends to a great extent on the analyzers, tokenizers and filters used in the Solr schema.

Testing and refining their behaviour on a standalone Solr server is unproductive and time consuming, involving cycles of deleting documents, stopping server, changing schema, restarting server, and reindexing documents.

It would be desirable if these analyzer tweaks can be tested quickly on small fragments of text to ascertain how they’ll be tokenized and searched, before modifying the solr schema.

Unit testing Solr tokenizers, token filters and char filters
public static void main(String[] args) {
try {
StringReader inputText = new StringReader("RUNNING runnable");
Map<String, String> tkargs = new HashMap<String, String>();
tkargs.put("luceneMatchVersion", "LUCENE_33");
TokenizerFactory tkf = new WhitespaceTokenizerFactory();
tkf.init(tkargs);
Tokenizer tkz = tkf.create(inputText);
LowerCaseFilterFactory lcf = new LowerCaseFilterFactory();
lcf.init(tkargs);
TokenStream lcts = lcf.create(tkz);
TokenFilterFactory fcf = new SnowballPorterFilterFactory();
Map<String, String> params = new HashMap<String, String>();
params.put("language", "English");
fcf.init(params);
TokenStream ts = fcf.create(lcts);
CharTermAttribute termAttrib = (CharTermAttribute) ts
.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) {
String term = termAttrib.toString();
System.out.println(term);
}
} catch (Exception e) {
e.printStackTrace();
}
}
Functional testing of Solr schema.xml

For functional tests, it would be more useful if the actual Solr search model itself is tested, instead of testing individual tokenizer chains.

The snippet below shows how the schema.xml can be loaded and then an analysis done on piece of text input and a dummy field, to examine the resulting query and index tokens:

public static void main(String[] args) {
try {
InputSource solrCfgIs = new InputSource(new FileReader(
"solr/conf/solrconfig.xml"));
SolrConfig solrConfig = new SolrConfig(null, solrCfgIs);
InputSource solrSchemaIs = new InputSource(new FileReader(
"solr/conf/schema.xml"));
IndexSchema solrSchema = new IndexSchema(solrConfig, null,
solrSchemaIs);
Map<String, FieldType> fieldTypes = solrSchema.getFieldTypes();
for (Iterator<Entry<String, FieldType>> iter = fieldTypes
.entrySet().iterator(); iter.hasNext();) {
Entry<String, FieldType> entry = iter.next();
FieldType fldType = entry.getValue();
Analyzer analyzer = fldType.getAnalyzer();
System.out.println(entry.getKey() + ":" + analyzer.toString());
}
String inputText = "Proof of the pudding lies in its eating";
FieldType fieldTypeText = fieldTypes.get("text_en");
System.out.println("Indexing analysis:");
Analyzer analyzer = fieldTypeText.getAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("dummyfield",
new StringReader(inputText));
CharTermAttribute termAttr = (CharTermAttribute) tokenStream
.getAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
System.out.println(termAttr.toString());
}
System.out.println("nnQuerying analysis:");
Analyzer qryAnalyzer = fieldTypeText.getQueryAnalyzer();
TokenStream qrytokenStream = qryAnalyzer.tokenStream("dummyfield",
new StringReader(inputText));
CharTermAttribute termAttr2 = (CharTermAttribute) qrytokenStream
.getAttribute(CharTermAttribute.class);
while (qrytokenStream.incrementToken()) {
System.out.println(termAttr2.toString());
}
} catch (Exception e) {
e.printStackTrace();
}
}
Read full article from Testing Solr schema, analyzers and tokenization - Pathbreak Developer Notebook

Testing Solr schema, analyzers and tokenization - Pathbreak Developer Notebook

No comments:

Post a Comment

Labels

Popular Posts