Content mining with Apache Tika



Content mining with Apache Tika
To extract metadata or content by running Tika from the command line, use the prepackaged jar file. For example, this command outputs the contents of the file test.doc to standard output in text format:
java -jar tika-app-1.4.jar --text test.doc
If you just want the file's metadata, again in text format, try:
java -jar tika-app-1.4.jar --metadata test.doc
Short forms of these commands are also available; run java -jar tika-app-1.4.jar --help to get the full list of available options. You can output the content information in HTML (replace --text with --html) or XHTML (replace --text with --xml) if you prefer. You can output the metadata as JSON (replace --metadata with --json) or XMP (replace --metadata with --xmp).
You can also hook Tika into a standard Unix pipeline, as with any other Unix-style command. For example, you can use cURL to fetch a file, parse its content into HTML using Tika, and then send that HTML output to a file:
curl http://example.com/test.doc | java -jar tika-app-1.4.jar --html > test.html
In addition to working with metadata and content, Tika can also detect the file type and even the language that a file is written in. This can be useful if metadata is lacking:
$ java -jar tika-app-1.4.jar --detect test.doc 
application/rtf
$ java -jar tika-app-1.4.jar --language test_french.doc 

You could also use the filetype detection output to hook a file into another pipeline or another part of a Java app. Tika can even handle metadata from files that contain EXIF information.
Please read full article from Content mining with Apache Tika

No comments:

Post a Comment

Labels

Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

Popular Posts