All About Programming: Metadata extraction with Apache Tika

Metadata extraction with Apache Tika

Tika defines a standard API and makes use of existing libraries like POI and PDFBox for it's content extraction. While writing this post the current release of Tika is version 0.6 and the following file formats are already supported:

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

I want to see what kind of EXIF information can be retrieved from an image by using Tika.
The most important part of the above code example is using the JpegParser to parse the .JPG file and the creation of the Metadata object with the appropriate information.

Of course in the above test case I only test for the current Camera Model, but the Metadata object holds much more information then just that. Viewing all the fields found in the metadata of the image can be achieved quite easily by using for instance the following method.

private void listAvailableMetaDataFields(final Metadata metadata) {
    for(int i = 0; i <metadata.names().length; i++) {
        String name = metadata.names()[i];
        System.out.println(name + " : " + metadata.get(name));
    }
}

Read full article from Metadata extraction with Apache Tika

Metadata extraction with Apache Tika

No comments:

Post a Comment

Labels

Popular Posts