Content mining with Apache Tika
To extract metadata or content by running Tika from the command line, use the prepackaged jar file. For example, this command outputs the contents of the file test.doc to standard output in text format:
java -jar tika-app-1.4.jar --text test.doc
If you just want the file's metadata, again in text format, try:
java -jar tika-app-1.4.jar --metadata test.doc
Short forms of these commands are also available; run
java -jar tika-app-1.4.jar --help
to get the full list of available options. You can output the content information in HTML (replace --text
with --html
) or XHTML (replace --text
with --xml
) if you prefer. You can output the metadata as JSON (replace --metadata
with --json
) or XMP (replace --metadata
with --xmp
).
You can also hook Tika into a standard Unix pipeline, as with any other Unix-style command. For example, you can use cURL to fetch a file, parse its content into HTML using Tika, and then send that HTML output to a file:
curl http://example.com/test.doc | java -jar tika-app-1.4.jar --html > test.html
In addition to working with metadata and content, Tika can also detect the file type and even the language that a file is written in. This can be useful if metadata is lacking:
$ java -jar tika-app-1.4.jar --detect test.doc application/rtf $ java -jar tika-app-1.4.jar --language test_french.doc
You could also use the filetype detection output to hook a file into another pipeline or another part of a Java app. Tika can even handle metadata from files that contain EXIF information.
Please read full article from Content mining with Apache Tika
No comments:
Post a Comment