Package it.unimi.di.mg4j.document.tika

This package contains classes that expose Tika parsers as MG4J factories.

See:
          Description

Class Summary
AbstractSimpleTikaDocumentFactory An abstract document factory that provides an implementation for AbstractSimpleTikaDocumentFactory.getDocument(InputStream, Reference2ObjectMap) and AbstractSimpleTikaDocumentFactory.fields().
AbstractTikaDocumentFactory An abstract document factory that provides the mapping from field names to field indices.
AutoDetectDocumentFactory A document factory that automatically detect the type of the document content.
EPUBDocumentFactory A document factory for the epub format.
GreedyTikaField The set of all Tika metadata represented as a single field inside MG4J.
HtmlDocumentFactory A document factory for the HTML format.
MSOfficeDocumentFactory A document factory for the Microsoft Office format.
OOXMLDocumentFactory A document factory for the OOXML format.
OpenDocumentDocumentFactory A document factory for the Open Document format.
PdfDocumentFactory A document factory for the PDF format.
RTFDocumentFactory A document factory for the RTF format.
TextDocumentFactory A document factory for the text format; the character set will be autodetected.
TikaField A Tika field represented inside MG4J.
XMLDocumentFactory A document factory for XML.
 

Package it.unimi.di.mg4j.document.tika Description

This package contains classes that expose Tika parsers as MG4J factories. Each type of Tika metadata is mapped, when possible, to an MG4J field. However, when using an AutoDetectDocumentFactory or any other factory in which metadata fields are user-definable or otherwise variable, it is impossible to provide a static listing of all available fields, as they depend on the actual factory used to parse the document. In this case, an instance of a GreedyTikaField is used to return some useful data to the caller by (essentially) concatenating the string representations of all metadata fields.