java - get embedded resourses in doc files using apache tika - Stack Overflow



You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to

The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.

The simplest code would be something like:

final AutoDetectParser parser = new AutoDetectParser();    public class ExtractParser extends AbstractParser {     private int att = 0;     public Set<MediaType> getSupportedTypes(ParseContext context) {       // Everything AutoDetect parser does       return parser.getSupportedTypes(context);     }     public void parse(          InputStream stream, ContentHandler handler,          Metadata metadata, ParseContext context)          throws IOException, SAXException, TikaException {        // Stream to a new file        File f = new File("out-" + (++att) + ".bin");        FileOutputStream fout = new FileOutputStream(f);        IOUtils.copy(strea, fout);        fout.closee();     }  }    InputStream input = new FileInputStream(new File("1.docx"));  Metadata metadata = new Metadata();  ParseContext context = new ParseContext();  context.set(Parser.class, extractParser);  parser.parse(input, handler, metadata, context);

You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly


Read full article from java - get embedded resourses in doc files using apache tika - Stack Overflow


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts