Tika--Extracting Distinct Items from a Compound Document - Stack Overflow



When Tika hits an embedded document, it goes to the ParseContext to see if you have supplied a recursing parser. If you have, it'll use that to process any embedded resources. If you haven't, it'll skip.

So, what you probably want to do is something like:

public static class HandleEmbeddedParser extends AbstractParser {     public List<File> found = new ArrayList<File>();     Set<MediaType> getSupportedTypes(ParseContext context) {         // Return what you want to handle         HashSet<MediaType> types = new HashSet<MediaType>();         types.put(MediaType.application("pdf"));         types.put(MediaType.application("zip"));         return types;     }     void parse(          InputStream stream, ContentHandler handler,          Metadata metadata, ParseContext context     ) throws IOException {         // Do something with the child documents         // eg save to disk         File f = File.createTempFile("tika","tmp");         found.add(f);           FileOutputStream fout = new FileOutputStream(f);         IOUtils.copy(stream,fout);         fout.close();     }  }    ParseContext context = new ParseContext();  context.set(Parser.class, new HandleEmbeddedParser();  parser.parse(....);

Read full article from Tika--Extracting Distinct Items from a Compound Document - Stack Overflow


No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts