RecursiveMetadata - Tika Wiki



RecursiveMetadata - Tika Wiki

If you parse an archive (zip, tar, etc.) the parsed document contains other documents, and any of those documents could also be archives containing other documents, and so on. The example on this page shows you how to do the following:
  • Set up the parse context so nested documents will be parsed.
  • Wrap the AutoDetectParser so you can get the text and metadata for each nested document.
This example writes the metadata and body text for each nested document to standard output.
public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }
   }

   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }

       @Override
       public void parse(
               InputStream stream, ContentHandler ignore,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           ContentHandler content = new BodyContentHandler();
           super.parse(stream, content, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
           System.out.println("----");
           System.out.println(content.toString());
       }
   }

Setting up Recursive Parsing

  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);
The example starts by setting up recursive parsing. If you are parsing text files, word documents, etc. then you'll never notice if recursive parsing is enable or not. If you are parsing containers like zip files and tar.gz files, the only way to get the text for the files contained by the containers is to enable recursive parsing.

The way to enable recursive parsing is to create a ParseContext and add a parser to it as shown on the line context.set(Parser.class, parser). This is the parser that will be used to parse any nested documents.

Parsing a File

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }
The rest of the main function parses a file. The parser used to parse the root document is the same parser that was added to the ParseContext as the parser to use for nested documents.

Looking at the Tika API (http://tika.apache.org/0.7/api/), I don't see a DefaultHandler class or a TikaInputStream. In the place of DefaultHandler you could use BodyContentHandler, and in the place of TikaInputStream you could use FileInputStream.

RecursiveMetadataParser parse

       public void parse(
               InputStream stream, ContentHandler ignore,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           ContentHandler content = new BodyContentHandler();
           super.parse(stream, content, metadata, context);
           System.out.println("----");
           System.out.println(metadata);
           System.out.println("----");
           System.out.println(content.toString());
       }
\
   }
The parse method is where you get access to the metadata and the body text. When the parser set in ParseContext is used to parse a nested document, a new Metadata object is created and passed to the parse method. Since the example put a RecursiveMetadataParser in the ParseContext,RecursiveMetadataParser's parse method is called. Before calling super.parse, the metadata object is empty. After super.parse returns, the metadata object contains all of the metadata the decorated parser found and System.out.println(metadata) prints all of the metadata to standard output.

By creating a new BodyContentHandler and passing that to super.parse, the text for each document is captured without mixing it with text from other documents.


The great thing about AutoDetectParser is that it can parse and extract text from almost anything. In particular, it can parse zip, tar, tar.bz2, and other archives that contain documents. If you have a zip file with 100 text files in it, using Jukka's example code you can get the text and metadata for each file nested inside of the zip file. What you might not expect is that you also get metadata and body text for the zip file itself.


If you aren't interested in seeing text and metadata for the zip file itself, you'll want to take a look at metadata.get(Metadata.CONTENT_TYPE)) for each file Tika parses so you can skip the archives themselves. For a zip file, the content type is "application/zip".

Read full article from RecursiveMetadata - Tika Wiki

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts