Package it.unimi.di.big.mg4j.document.tika
This package contains classes that expose Tika
parsers as MG4J factories.
Each type of Tika metadata is mapped, when possible, to an MG4J field.
However, when using an
AutoDetectDocumentFactory
or any other factory in which
metadata fields are user-definable or otherwise variable, it is impossible to
provide a static listing of all available fields, as they depend on the
actual factory used to parse the document. In this case, an instance of
a GreedyTikaField
is used to return some useful data to the caller
by (essentially) concatenating the string representations of all metadata fields.-
Class Summary Class Description AbstractSimpleTikaDocumentFactory An abstract document factory that provides an implementation forAbstractSimpleTikaDocumentFactory.getDocument(InputStream, Reference2ObjectMap)
andAbstractSimpleTikaDocumentFactory.fields()
.AbstractTikaDocumentFactory An abstract document factory that provides the mapping from field names to field indices.AutoDetectDocumentFactory A document factory that automatically detect the type of the document content.EPUBDocumentFactory A document factory for the epub format.GreedyTikaField The set of all Tika metadata represented as a single field inside MG4J.HtmlDocumentFactory A document factory for the HTML format.MSOfficeDocumentFactory A document factory for the Microsoft Office format.OOXMLDocumentFactory A document factory for the OOXML format.OpenDocumentDocumentFactory A document factory for the Open Document format.PdfDocumentFactory A document factory for the PDF format.RTFDocumentFactory A document factory for the RTF format.TextDocumentFactory A document factory for the text format; the character set will be autodetected.TikaField A Tika field represented inside MG4J.XMLDocumentFactory A document factory for XML.