When Tika hits an embedded document, it goes to the ParseContext to see if you have supplied a recursing parser. If you have, it'll use that to process any embedded resources. If you haven't, it'll skip.
So, what you probably want to do is something like:
public static class HandleEmbeddedParser extends AbstractParser { public List<File> found = new ArrayList<File>(); Set<MediaType> getSupportedTypes(ParseContext context) { // Return what you want to handle HashSet<MediaType> types = new HashSet<MediaType>(); types.put(MediaType.application("pdf")); types.put(MediaType.application("zip")); return types; } void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context ) throws IOException { // Do something with the child documents // eg save to disk File f = File.createTempFile("tika","tmp"); found.add(f); FileOutputStream fout = new FileOutputStream(f); IOUtils.copy(stream,fout); fout.close(); } } ParseContext context = new ParseContext(); context.set(Parser.class, new HandleEmbeddedParser(); parser.parse(....);
Read full article from Tika--Extracting Distinct Items from a Compound Document - Stack Overflow
No comments:
Post a Comment