All About Programming: java - get embedded resourses in doc files using apache tika

You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to

The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.

The simplest code would be something like:

final AutoDetectParser parser = new AutoDetectParser();    public class ExtractParser extends AbstractParser {     private int att = 0;     public Set<MediaType> getSupportedTypes(ParseContext context) {       // Everything AutoDetect parser does       return parser.getSupportedTypes(context);     }     public void parse(          InputStream stream, ContentHandler handler,          Metadata metadata, ParseContext context)          throws IOException, SAXException, TikaException {        // Stream to a new file        File f = new File("out-" + (++att) + ".bin");        FileOutputStream fout = new FileOutputStream(f);        IOUtils.copy(strea, fout);        fout.closee();     }  }    InputStream input = new FileInputStream(new File("1.docx"));  Metadata metadata = new Metadata();  ParseContext context = new ParseContext();  context.set(Parser.class, extractParser);  parser.parse(input, handler, metadata, context);

You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly

Read full article from java - get embedded resourses in doc files using apache tika - Stack Overflow

java - get embedded resourses in doc files using apache tika - Stack Overflow

No comments:

Post a Comment

Labels

Popular Posts