All About Programming: Tika JAX-RS Server

http://wiki.apache.org/tika/TikaJAXRS

tika's JSR 311 network server, tika-server. The server package uses the Apache CXF framework that provides an implementation of JAX-RS for Java. The Tika server component builds to a standalone package in Tika, tika-server.

mvn install
cd ./tika-server/target/
java -jar tika-server-x.x.jar

java -jar tika-server-x.x.jar --host=intranet.local --port=12345

All services that take files use HTTP "PUT" requests. Original file must be sent in request body without any additional encoding (do not use multipart/form-data or other containers).

Information services (eg defined mimetypes, defined parsers etc) work with HTML "GET" requests.

You may optionally specify content type in "Content-Type" header. If you do not specify mime type, Tika will use its detectors to guess it.

You may specify additional identifier in URL after resource name, like "/tika/my-file-i-sent-to-tika-resource" for "/tika" resource. Tikaserver uses this name only for logging, so you may put there file name, UUID or any other identifier (do not forget to url-encode any special characters).

$ curl -X PUT -d @zipcode.csv http://localhost:9998/meta --header "Content-Type: text/csv"
$ curl -T price.xls http://localhost:9998/meta

Returns:

"Content-Encoding","ISO-8859-2"
"Content-Type","text/plain"

HTTP PUTs a document to the /tika service and you get back the extracted text. HTTP GET prints a greeting stating the server is up.

curl -X GET http://localhost:9998/tika

$ curl -X PUT -d @GeoSPARQL.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/html"
$ curl -T price.xls http://localhost:9998/tika --header "Accept: text/plain"

HTTP PUTs a document and uses the Default Detector from Tika to identify its MIME/media type. The caveat here is that providing a hint for the filename can increase the quality of detection.

curl -X PUT -d @TODO.rtf http://localhost:9998/detect/stream

PUT a CSV file without filename hint and get back text/plain

$ curl -X PUT --upload-file foo.csv http://localhost:9998/detect/stream

PUT a CSV file with filename hint and get back text/csv

$ curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream

PUT zip file and get back met file zip

$ curl -X PUT -d @foo.zip http://localhost:9998/unpacker --header "Content-type: application/zip"

PUT doc file and get back met file tar

$ curl -T Doc1_ole.doc -H "Accept: application/x-tar" http://localhost:9998/unpacker > /var/tmp/x.tar

"All" resource

Get text, metadata and attachments in one request.

$ curl -T Doc1_ole.doc http://localhost:9998/all > /var/tmp/x.zip

/mime-types

/detectors

Extracting A Document From A URL

It is possible to use a remote file with TikaJAXRS by downloading it via its URL first then piping it to the appropriate service:

$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/meta
$ curl -s "http://url/to/my.file" | curl -X PUT -T - http://localhost:9998/tika

The caveat with above is that it fetches the entire file, so large files such as video can take some time to download. Therefore, you may wish to use curl to get preliminary information (content type, name and size) about the file before you proceed:

$ curl -I http://url/to/my.file

If the file should be parsed (E.g. you only want to get information about mp3s, mp4s and PDFs), send it on to TikaJAXRS.

Please read full article from http://wiki.apache.org/tika/TikaJAXRS

Tika JAX-RS Server

PUT a CSV file without filename hint and get back text/plain

PUT a CSV file with filename hint and get back text/csv

PUT zip file and get back met file zip

PUT doc file and get back met file tar

"All" resource

No comments:

Post a Comment

Labels

Popular Posts