java - Can I use Tika for content extraction on Google App Engine? - Stack Overflow



Short Answer:

No not out of the box for much of its functionality.
Too much code relies on File objects and creating temporary File objects which don't exist in GAE.

Long Answer:

It is open source you can hack at the code and change the stuff that calls methods that take File objects to take InputStream objects instead, and then you can process things that live in the Blobstore or GCS.
Here is an example I am hacking at now:
@NotNull  public static Metadata readMetadata(@NotNull File file) throws JpegProcessingException, IOException  {      JpegSegmentReader segmentReader = new JpegSegmentReader(file);      return extractMetadataFromJpegSegmentReader(segmentReader.getSegmentData());  }
Where there is this perfectly good call that isn't tied to the File object:
@NotNull  public static Metadata readMetadata(@NotNull InputStream inputStream, final boolean waitForBytes) throws JpegProcessingException  {      JpegSegmentReader segmentReader = new JpegSegmentReader(inputStream, waitForBytes);      return extractMetadataFromJpegSegmentReader(segmentReader.getSegmentData());  }

Some parse() methods will create temp files if you pass in the File type directly or if you created TikaInputStream from a file. You can also trigger it by calling getFile() or getFileChannel() on TikaInputStream. So you may be able to control it by creating the TikaInputStream yourself and avoiding using a File object in the process (ie loading the file into memory first or streaming it somehow). However, if the parser implementation calls getFile() or getFileChannel() for you then you're out of luck, short of implementing the parser yourself.

Read full article from java - Can I use Tika for content extraction on Google App Engine? - Stack Overflow

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts