All About Programming: Content mining with Apache Tika

Content mining with Apache Tika

To extract metadata or content by running Tika from the command line, use the prepackaged jar file. For example, this command outputs the contents of the file test.doc to standard output in text format:

java -jar tika-app-1.4.jar --text test.doc

If you just want the file's metadata, again in text format, try:

java -jar tika-app-1.4.jar --metadata test.doc

Short forms of these commands are also available; run java -jar tika-app-1.4.jar --help to get the full list of available options. You can output the content information in HTML (replace --text with --html) or XHTML (replace --text with --xml) if you prefer. You can output the metadata as JSON (replace --metadata with --json) or XMP (replace --metadata with --xmp).

You can also hook Tika into a standard Unix pipeline, as with any other Unix-style command. For example, you can use cURL to fetch a file, parse its content into HTML using Tika, and then send that HTML output to a file:

curl http://example.com/test.doc | java -jar tika-app-1.4.jar --html > test.html

In addition to working with metadata and content, Tika can also detect the file type and even the language that a file is written in. This can be useful if metadata is lacking:

$ java -jar tika-app-1.4.jar --detect test.doc 
application/rtf
$ java -jar tika-app-1.4.jar --language test_french.doc

You could also use the filetype detection output to hook a file into another pipeline or another part of a Java app. Tika can even handle metadata from files that contain EXIF information.

Please read full article from Content mining with Apache Tika

Content mining with Apache Tika

No comments:

Post a Comment

Labels

Popular Posts