Salmon Run: Smart Query Parsing with UIMA



Anyway, I decided to get familiar with the UIMA API by solving a toy problem. Assume a website which allows searching for names of people and organizations with optional (and partial) addresses to narrow the search. Behind the scenes, asume an index which stores city, state and zipcode as separate indexed fields. The query string is parsed using a UIMA aggregate analysis engine (AE) composed of a pipeline of three primitive AEs, for parsing the zipcode, state and city respectively. The end result of the analysis is the term with token offset information for each of these entities. I haven't gone as far as the query parser (a CAS Consumer in UIMA), so in this post I show the various descriptors and annotator code that parse the query string and extract the entities from it.

UIMA Background

For those not familiar with UIMA, its a framework developed by IBM and donated to Apache. UIMA is currently in the Apache incubator. For details, you should refer to the UIMA Tutorial and Developer's Guide, but if you want a really quick (and possibly incomplete) tour, here it is. The basic building block that you build is a primitive Analysis Engine (AE). Each primitive AE needs to have an annotation type and an annotator. The type is defined as an XML file and a tool called JCasGen used to generate the POJO representing the type and annotation. The annotator is written next, and an XML descriptor created. The framework instantiates the annotator using the AE XML descriptor. Aggregate AEs are defined as XML files, and define chains of primitive AEs.

UIMA comes with an Eclipse plug in, which provides tools to build the XML using fill-in forms. Its probably advisable to use that because the XML is quite complex, at least initially.

Read full article from Salmon Run: Smart Query Parsing with UIMA


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts