Anyway, I decided to get familiar with the UIMA API by solving a toy problem. Assume a website which allows searching for names of people and organizations with optional (and partial) addresses to narrow the search. Behind the scenes, asume an index which stores city, state and zipcode as separate indexed fields. The query string is parsed using a UIMA aggregate analysis engine (AE) composed of a pipeline of three primitive AEs, for parsing the zipcode, state and city respectively. The end result of the analysis is the term with token offset information for each of these entities. I haven't gone as far as the query parser (a CAS Consumer in UIMA), so in this post I show the various descriptors and annotator code that parse the query string and extract the entities from it.
UIMA Background
For those not familiar with UIMA, its a framework developed by IBM and donated to Apache. UIMA is currently in the Apache incubator. For details, you should refer to the UIMA Tutorial and Developer's Guide, but if you want a really quick (and possibly incomplete) tour, here it is. The basic building block that you build is a primitive Analysis Engine (AE). Each primitive AE needs to have an annotation type and an annotator. The type is defined as an XML file and a tool called JCasGen used to generate the POJO representing the type and annotation. The annotator is written next, and an XML descriptor created. The framework instantiates the annotator using the AE XML descriptor. Aggregate AEs are defined as XML files, and define chains of primitive AEs.
UIMA comes with an Eclipse plug in, which provides tools to build the XML using fill-in forms. Its probably advisable to use that because the XML is quite complex, at least initially.
Read full article from Salmon Run: Smart Query Parsing with UIMA
No comments:
Post a Comment