All About Programming: Tika Official Docs

Tika Official Docs

void parse(
    InputStream stream, ContentHandler handler, Metadata metadata,
    ParseContext context) throws IOException, SAXException, TikaException;

The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document.

Input metadata

A client application should be able to include metadata like the file name or declared content type with the document to be parsed. The parser implementation can use this information to better guide the parsing process.

The parsed content of the document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express structured content of the document and SAX events enable streamed processing.

Dealing with the raw SAX events can be a bit complex, so Apache Tika comes with a number of utility classes that can be used to process and convert the event stream to other representations.

For example, the BodyContentHandler class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string.

Another useful class is ParsingReader that uses a background thread to parse the document and returns the extracted text content as a character stream:

InputStream stream = ...; // the document to be parsed
Reader reader = new ParsingReader(parser, stream, ...);
try {
    ...;                  // read the document text using the reader
} finally {
    reader.close();       // the document stream is closed automatically
}

Parse context

The final argument to the parse method is used to inject context-specific information to the parsing process. This is useful for example when dealing with locale-specific date and number formats in Microsoft Excel spreadsheets. Another important use of the parse context is passing in the delegate parser instance to be used by two-phase parsers like the PackageParser subclasses. Some parser classes allow customization of the parsing process through strategy objects in the parse context.

The goal of Tika is to reuse existing parser libraries like PDFBox or Apache POIas much as possible, and so most of the parser classes in Tika are adapters to such external libraries.

Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the AutoDetectParser class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.

http://tika.apache.org/1.6/formats.html

Please read full article from Tika Official Docs

Tika Official Docs

Parse context

No comments:

Post a Comment

Labels

Popular Posts