All About Programming: UIMA Tutorial and Developers' Guides

UIMA Tutorial and Developers' Guides

UIMA Tutorial and Developers' Guides
In Eclipse, expand the uimaj-examples project in the Package Explorer view, and browse to the filedescriptors/tutorial/ex1/TutorialTypeSystem.xml. Right-click on the file in the navigator and select Open With → Component Descriptor Editor.

Descriptions can be included with types and features. In this example, there is a description associated with the buildingfeature. To see it, hover the mouse over the feature.

The built-in Annotation type declares three fields (called Features in CAS terminology). The features begin and end store the character offsets of the span of text to which the annotation refers. The feature sofa (Subject of Analysis) indicates which document the begin and end offsets point into

Generating Java Source Files for CAS Types

When you save a descriptor that you have modified, the Component Descriptor Editor will automatically generate Java classes corresponding to the types that are defined in that descriptor (unless this has been disabled), using a utility called JCasGen. These Java classes will have the same name (including package) as the CAS types, and will have get and set methods for each of the features that you have defined.

This feature is enabled/disabled using the UIMA menu pulldown (or the Eclipse Preferences → UIMA). If automatic running of JCasGen is not happening, please make sure the option is checked:

If you are not using the Component Descriptor Editor, you will need to generate these Java classes by using the JCasGen tool. JCasGen reads a Type System Descriptor XML file and generates the corresponding Java classes that you can then use in your annotator code. To launch JCasGen, run the jcasgen shell script located in the /bin directory of the UIMA SDK installation.

Developing Your Annotator Code

Annotator implementations all implement a standard interface (AnalysisComponent), having several methods, the most important of which are:

initialize,
process, and
destroy.

initialize is called by the framework once when it first creates an instance of the annotator class. process is called once per item being processed. destroy may be called by the application when it is done using your annotator. There is a default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which has implementations of all required methods except for the process method.

Our annotator class extends the JCasAnnotator_ImplBase; most annotators that use the JCas will extend from this class, so they only have to implement the process method.

The only method that we are required to implement is process. This method is typically called once for each document that is being analyzed. This method takes one argument, which is a JCas instance; this holds the document to be analyzed and all of the analysis results.

public void process(JCas aJCas) {
  // get document text
  String docText = aJCas.getDocumentText();
  // search for Yorktown room numbers
  Matcher matcher = mYorktownPattern.matcher(docText);
  int pos = 0;
  while (matcher.find(pos)) {
    // found one - create annotation
    RoomNumber annotation = new RoomNumber(aJCas);
    annotation.setBegin(matcher.start());
    annotation.setEnd(matcher.end());
    annotation.setBuilding("Yorktown");
    annotation.addToIndexes();
    pos = matcher.end();
  }｝

Finally, we call annotation.addToIndexes() to add the new annotation to the indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps an index of all annotations in their order from beginning to end of the document. Subsequent annotators or applications use the indexes to iterate over the annotations.

If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators, using the indexes.

You can also call addToIndexes() on Feature Structures that are not subtypes ofuima.tcas.Annotation, but these will not be sorted in any particular way. If you want to specify a sort order, you can define your own custom indexes in the CAS: see UIMA References Chapter 4, CAS Referenceand Section 2.4.1.5, “Index Definition” for details.

Creating the XML Descriptor

The UIMA architecture requires that descriptive information about an annotator be represented in an XML file and provided along with the annotator class file(s) to the UIMA framework at run time. This XML file is called an Analysis Engine Descriptor. The descriptor includes:

Name, description, version, and vendor
The annotator's inputs and outputs, defined in terms of the types in a Type System Descriptor
Declaration of the configuration parameters that the annotator accepts

The Component Descriptor Editor plugin, which we previously used to edit the Type System descriptor, can also be used to edit Analysis Engine Descriptors.

The other two pages that need to be filled out are the Type System page and the Capabilities page.

To specify this, we add this type system to the Analysis Engine's list of Imported Type Systems, using the Type System page's right side panel.

Although capabilities come in sets, having multiple sets is deprecated; here we're just using one set. The RoomNumberAnnotator is very simple. It requires no input types, as it operates directly on the document text -- which is supplied as a part of the CAS initialization (and which is always assumed to be present). It produces only one output type (RoomNumber), and it sets the value of the building feature on that type. This is all represented on the Capabilities page.

The Sofas section allows you to specify the names of additional subjects of analysis. This capability and the Sofa Mappings at the bottom are advanced topics, described in Chapter 5, Annotations, Artifacts, and Sofas.

Testing Your Annotator

The UIMA SDK includes a tool called the Document Analyzer that will allow us to do this. To run the Document Analyzer, execute the documentAnalyzer shell script that is in the bin directory of your UIMA SDK installation, or, if you are using the example Eclipse project, execute the “UIMA Document Analyzer” run configuration supplied with that project. (To do this, click on the menu bar Run → Run ... → and under Java Applications in the left box, click on UIMA Document Analyzer.)

Configuration Parameters

UIMA allows annotators to declare configuration parameters in their descriptors. The descriptor also specifies default values for the parameters, though these can be overridden at runtime.

Accessing Parameter Values from the Annotator Code
The class org.apache.uima.tutorial.ex2.RoomNumberAnnotator has overridden the initialize method. The initialize method is called by the UIMA framework when the annotator is instantiated, so it is a good place to read configuration parameter values.

public void initialize(UimaContext aContext) 
        throws ResourceInitializationException {
  super.initialize(aContext);
  
  // Get config. parameter values  
  String[] patternStrings = 
        (String[]) aContext.getConfigParameterValue("Patterns");
  mLocations = 
        (String[]) aContext.getConfigParameterValue("Locations");

  // compile regular expressions
  mPatterns = new Pattern[patternStrings.length];
  for (int i = 0; i < patternStrings.length; i++) {
    mPatterns[i] = Pattern.compile(patternStrings[i]);
  }
}

Configuration parameter values are accessed through the UimaContext. As you will see in subsequent sections of this chapter, the UimaContext is the annotator's access point for all of the facilities provided by the UIMA framework – for example logging and external resource access.

The UimaContext's getConfigParameterValue method takes the name of the parameter as an argument; this must match one of the parameters declared in the descriptor. The return value of this method is a Java Object, whose type corresponds to the declared type of the parameter. It is up to the annotator to cast it to the appropriate type, String[] in this case.

If there is a problem retrieving the parameter values, the framework throws an exception. Generally annotators don't handle these, and just let them propagate up.

Supporting Reconfiguration

If you take a look at the Javadocs (located in the docs/api directory) for org.apache.uima.analysis_component.AnaysisComponent (which our annotator implements indirectly through JCasAnnotator_ImplBase), you will see that there is a reconfigure() method, which is called by the containing application through the UIMA framework, if the configuration parameter values are changed.

Overriding Configuration Parameter Settings

There are two ways that the value assigned to a configuration parameter can be overridden. An aggregate may declare a parameter that overrides one or more of the parameters in one or more of its delegates. The aggregate must also define a value for the parameter, unless the parameter is itself overridden by a setting in the parent aggregate.

An alternative method that avoids these strict hierarchical override constraints is to associate an external global name with a parameter and to assign values to these external names in an external properties file. With this approach a particular parameter setting can be easily shared by multiple descriptors, even across different applications. For applications with many levels of descriptor nesting it avoids the need to edit aggregate override definitions when the location of an annotator in the hierarchy is changed.

The UIMA framework supports this convention using the UimaContext object. If you access a logger instance usinggetContext().getLogger() within an Annotator, the logger name will be the fully qualified name of the Annotator implementation class.

getContext().getLogger().log(Level.FINEST,"Found: " + annotation);

If no logging configuration file is provided (see next section), the Java Virtual Machine defaults would be used, which typically set the level to INFO and higher messages, and direct output to the console.

If you specify the standard UIMA SDK Logger.properties, the output will be directed to a file named uima.log, in the current working directory (often the “project” directory when running from Eclipse, for instance).

Also, you can set the Eclipse preferences for the workspace to automatically refresh (Window → Preferences → General → Workspace, then click the “refresh automatically” checkbox.

Specifying the Logging Configuration

The standard UIMA logger uses the underlying Java 1.4 logging mechanism. You can use the APIs that come with that to configure the logging. In addition, the standard Java 1.4 logging initialization mechanisms will look for a Java System Property named java.util.logging.config.file and if found, will use the value of this property as the name of a standard“properties” file, for setting the logging level.

java "-Djava.util.logging.config.file=C:/Program Files/apache-uima/config/Logger.properties"

If you are using Eclipse to launch your application, you can set this property in the VM arguments section of the Arguments tab of the run configuration screen. If you've set an environment variable UIMA_HOME, you could for example, use the string: "-Djava.util.logging.config.file=${env_var:UIMA_HOME}/config/Logger.properties".

If you running the .bat or .sh files in the UIMA SDK's bin directory, you can specify the location of your logger configuration file by setting the UIMA_LOGGER_CONFIG_FILE environment variable prior to running the script, for example (on Windows):

set UIMA_LOGGER_CONFIG_FILE=C:/myapp/MyLogger.properties

Building Aggregate Analysis Engines
Combining Annotators

The UIMA SDK makes it very easy to combine any sequence of Analysis Engines to form an Aggregate Analysis Engine. This is done through an XML descriptor; no Java code is required!

We are going to combine the TutorialDateTime annotator with the RoomNumberAnnotator to create an aggregate Analysis Engine.

To add a component, you can click the “Add” button and browse to its descriptor. You can also click the “Find AE” button and search for an Analysis Engine in your Eclipse workspace.

The “AddRemote” button is used for adding components which run remotely (for example, on another machine using a remote networking connection).

The order of the components in the left pane does not imply an order of execution. The order of execution, or “flow” is determined in the “Component Engine Flow” section on the right. UIMA supports different types of algorithms (including user-definable) for determining the flow. Here we pick the simplest: FixedFlow

If you look at the “Type System” page of the Component Descriptor Editor, you will see that it displays the type system but is not editable. The Type System of an Aggregate Analysis Engine is automatically computed by merging the Type Systems of all of its components.

Note that it is not automatically assumed that all outputs of each component Analysis Engine (AE) are passed through as outputs of the aggregate AE. If, for example, the TutorialDateTime annotator also produced Word and Sentence annotations, but those were not of interest as output in this case, we can exclude them from the list of outputs.

AAEs can also contain CAS Consumers

In addition to aggregating Analysis Engines, Aggregates can also contain CAS Consumers (see Chapter 2, Collection Processing Engine Developer's Guide, or even a mixture of these components with regular Analysis Engines. The UIMA Examples has an example of an Aggregate which contains both an analysis engine and a CAS consumer, in examples/descriptors/MixedAggregate.xml.

Analysis Engines support the collectionProcessComplete method, which is particularly important for many CAS Consumers. If an application (or a Collection Processing Engine) calls collectionProcessComplete on an aggregate, the framework will deliver that call to all of the components of the aggregate. If you use one of the built-in flow types (fixedFlow or capabilityLanguageFlow), then the order specified in that flow will be the same order in which the collectionProcessComplete calls are made to the components. If a custom flow is used, then the calls will be made in arbitrary order.

Reading the Results of Previous Annotators
So far, we have been looking at annotators that look directly at the document text. However, annotators can also use the results of other annotators. One useful thing we can do at this point is look for the co-occurrence of a Date, a RoomNumber, and two Times – and annotate that as a Meeting.

The CAS maintains indexes of annotations, and from an index you can obtain an iterator that allows you to step through all annotations of a particular type. Here's some example code that would iterate over all of the TimeAnnot annotations in the JCas:

The UIMA SDK include several other examples you may find interesting, including

SimpleTokenAndSentenceAnnotator – a simple tokenizer and sentence annotator.
XmlDetagger – A multi-sofa annotator that does XML detagging. Multiple Sofas (Subjects of Analysis) are described in a later – see Chapter 6, Multiple CAS Views of an Artifact. Reads XML data from the input Sofa (named "xmlDocument"); this data can be stored in the CAS as a string or array, or it can be a URI to a remote file. The XML is parsed using the JVM's default parser, and the plain-text content is written to a new sofa called "plainTextDocument".
PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache Derby database.

Contract: Annotator Methods Called by the Framewor

The UIMA framework ensures that an Annotator instance is called by only one thread at a time. An instance never has to worry about running some method on one thread, and then asynchronously being called using another thread. This approach simplifies the design of annotators – they do not have to be designed to support multi-threading. When multiple threading is wanted, for performance, multiple instances of the Annotator are created, each one running on just one thread.

Throwing Exceptions from Annotators

try {
  mPatterns[i] = Pattern.compile(patternStrings[i]);
} 
catch (PatternSyntaxException e) {
  throw new ResourceInitializationException(
     MESSAGE_DIGEST, "regex_syntax_error",
     new Object[]{patternStrings[i]}, e);
}

Accessing External Resources
A better way to create these sharable Java objects and initialize them via external disk or URL sources is through the ResourceManager component.

<externalResourceDependency>
  <key>AcronymTable</key> 
  <description>Table of acronyms and their expanded forms.</description> 
  <interfaceName>
    org.apache.uima.tutorial.ex6.StringMapResource
  </interfaceName> 
</externalResourceDependency>

The interface name (org.apache.uima.tutorial.ex6.StringMapResource) is the Java interface through which the annotator accesses the data. Specifying an interface name is optional. If you do not specify an interface name, annotators will instead get an interface which can provide direct access to the data resource (file or URL) that is associated with this external resource.

Accessing the Resource from the UimaContext

StringMapResource mMap = 
  (StringMapResource)getContext().getResourceObject("AcronymTable");

The object returned from the getResourceObject method will implement the interface declared in the<interfaceName> section of the descriptor, StringMapResource in this case. The annotator code does not need to know the location of external data that may be used to initilize this object, nor the Java class that might be used to read the data and implement the StringMapResource interface.

Note that if we did not specify a Java interface in our descriptor, our annotator could directly access the resource data as follows:

InputStream stream = getContext().getResourceAsStream("AcronymTable");

If necessary, the annotator could also determine the location of the resource file, by calling:

URI uri = getContext().getResourceURI("AcronymTable");

These last two options are only available in the case where the descriptor does not declare a Java interface.

The methods for getting access to resources include getResourceURL. That method returns a URL, which may contain spaces encoded as %20. url.getPath() would return the path without decoding these %20 into spaces. getResourceURI on the other hand, returns a URI, and the uri.getPath() does do the conversion of %20 into spaces. See also getResourceFilePath, which does a getResourceURI followed by uri.getPath().

Declaring Resources and Bindings

Refer back to the top window in the Resources page of the Component Descriptor Editor. This is where we specify the location of the resource data, and the Java class used to read the data.

<resourceManagerConfiguration>
  <externalResources>
    <externalResource>
      <name>UimaAcronymTableFile</name> 
      <description>
         A table containing UIMA acronyms and their expanded forms.
      </description> 
      <fileResourceSpecifier>
        <fileUrl>file:org/apache/uima/tutorial/ex6/uimaAcronyms.txt
        </fileUrl> 
      </fileResourceSpecifier>
      <implementationName>
         org.apache.uima.tutorial.ex6.StringMapResource_impl
      </implementationName> 
    </externalResource>
  </externalResources>

  <externalResourceBindings>
    <externalResourceBinding>
      <key>AcronymTable</key>    
      <resourceName>UimaAcronymTableFile</resourceName> 
    </externalResourceBinding>
  </externalResourceBindings>
</resourceManagerConfiguration>

Better is a relative URL, which will be looked up within the classpath (and/or datapath), as used in this example. In this case, the file org/apache/uima/tutorial/ex6/uimaAcronyms.txt is located in uimaj-examples.jar, which is in the classpath. If you look in this file you will see the definitions of several UIMA acronyms.

The second section of the XML declares an externalResourceBinding, which connects the key AcronymTable, declared in the annotator's external resource dependency, to the actual resource name UimaAcronymTableFile.

When the Analysis Engine is initialized, it creates a single instance of StringMapResource_impl and loads it with the contents of the data file. This means that the framework calls the instance's load method, passing it an instance of DataResource, from which you can obtain a stream or URI/URL of the external resource that was declared in the external resource; for resources where loading does not make sense, you can implement a load method which ignores its argument and just returns, or performes whatever initialization is appropriate at startup time.

The Sofa Feature Structure
Information about a Sofa is contained in a special built-in Feature Structure of type uima.cas.Sofa.

Features of the Sofa type include

SofaID: Every Sofa in a CAS has a unique SofaID. SofaIDs are the primary handle for access. This ID is often the same as the name string given to the Sofa by the developer, but it can be mapped to a different name (see Section 6.4, “Sofa Name Mapping”.
Mime type: This string feature can be used to describe the type of the data represented by a Sofa. It is not used by the framework; the framework provides APIs to set and get its value.
Sofa Data: The Sofa data itself. This data can be resident in the CAS or it can be a reference to data outside the CAS.

UIMA Tutorial and Developers' Guides

Specifying the Logging Configuration

No comments:

Post a Comment

Labels

Popular Posts