All About Programming: Tika--Extracting Distinct Items from a Compound Document

When Tika hits an embedded document, it goes to the ParseContext to see if you have supplied a recursing parser. If you have, it'll use that to process any embedded resources. If you haven't, it'll skip.

So, what you probably want to do is something like:

public static class HandleEmbeddedParser extends AbstractParser {     public List<File> found = new ArrayList<File>();     Set<MediaType> getSupportedTypes(ParseContext context) {         // Return what you want to handle         HashSet<MediaType> types = new HashSet<MediaType>();         types.put(MediaType.application("pdf"));         types.put(MediaType.application("zip"));         return types;     }     void parse(          InputStream stream, ContentHandler handler,          Metadata metadata, ParseContext context     ) throws IOException {         // Do something with the child documents         // eg save to disk         File f = File.createTempFile("tika","tmp");         found.add(f);           FileOutputStream fout = new FileOutputStream(f);         IOUtils.copy(stream,fout);         fout.close();     }  }    ParseContext context = new ParseContext();  context.set(Parser.class, new HandleEmbeddedParser();  parser.parse(....);

Read full article from Tika--Extracting Distinct Items from a Compound Document - Stack Overflow

Tika--Extracting Distinct Items from a Compound Document - Stack Overflow

No comments:

Post a Comment

Labels

Popular Posts