You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to
The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.
The simplest code would be something like:
final AutoDetectParser parser = new AutoDetectParser(); public class ExtractParser extends AbstractParser { private int att = 0; public Set<MediaType> getSupportedTypes(ParseContext context) { // Everything AutoDetect parser does return parser.getSupportedTypes(context); } public void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { // Stream to a new file File f = new File("out-" + (++att) + ".bin"); FileOutputStream fout = new FileOutputStream(f); IOUtils.copy(strea, fout); fout.closee(); } } InputStream input = new FileInputStream(new File("1.docx")); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); context.set(Parser.class, extractParser); parser.parse(input, handler, metadata, context); You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly
Read full article from java - get embedded resourses in doc files using apache tika - Stack Overflow
No comments:
Post a Comment