Class AbstractSimpleTikaDocumentFactory
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentFactory
-
- it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
-
- it.unimi.di.big.mg4j.document.tika.AbstractTikaDocumentFactory
-
- it.unimi.di.big.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
-
- All Implemented Interfaces:
DocumentFactory
,FlyweightPrototype<DocumentFactory>
,Serializable
- Direct Known Subclasses:
AutoDetectDocumentFactory
,EPUBDocumentFactory
,HtmlDocumentFactory
,MSOfficeDocumentFactory
,OOXMLDocumentFactory
,OpenDocumentDocumentFactory
,PdfDocumentFactory
,RTFDocumentFactory
,TextDocumentFactory
,XMLDocumentFactory
public abstract class AbstractSimpleTikaDocumentFactory extends AbstractTikaDocumentFactory
An abstract document factory that provides an implementation forgetDocument(InputStream, Reference2ObjectMap)
andfields()
. Moreover, it gets aWordReader
object using thePropertyBasedDocumentFactory.MetadataKeys.WORDREADER
property.Concrete subclasses must provide a
getParser()
method and may optionally overridemetadataFields()
(which currently returns the empty list) to return the list of Tika fields provided by this factory. Note thatgetParser()
should return always the same instance, as Tika parsers are immutable and thread-safe.- Author:
- Salvatore Insalaco
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
PropertyBasedDocumentFactory.MetadataKeys
-
Nested classes/interfaces inherited from interface it.unimi.di.big.mg4j.document.DocumentFactory
DocumentFactory.FieldType
-
-
Field Summary
-
Fields inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
-
-
Constructor Summary
Constructors Constructor Description AbstractSimpleTikaDocumentFactory()
AbstractSimpleTikaDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
AbstractSimpleTikaDocumentFactory(Properties properties)
AbstractSimpleTikaDocumentFactory(String[] property)
-
Method Summary
Modifier and Type Method Description DocumentFactory
copy()
protected List<TikaField>
fields()
Returns the list of Tika fields (they will be mapped to MG4J fields whose index is their index in the list).Document
getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream.protected abstract org.apache.tika.parser.Parser
getParser()
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.protected List<? extends TikaField>
metadataFields()
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.protected boolean
parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)
Parses a property with given key and value, adding it to the given map.-
Methods inherited from class it.unimi.di.big.mg4j.document.tika.AbstractTikaDocumentFactory
fieldIndex, fieldName, fieldType, numberOfFields
-
Methods inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentFactory
ensureFieldIndex
-
-
-
-
Constructor Detail
-
AbstractSimpleTikaDocumentFactory
public AbstractSimpleTikaDocumentFactory()
-
AbstractSimpleTikaDocumentFactory
public AbstractSimpleTikaDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
-
AbstractSimpleTikaDocumentFactory
public AbstractSimpleTikaDocumentFactory(Properties properties) throws org.apache.commons.configuration.ConfigurationException
- Throws:
org.apache.commons.configuration.ConfigurationException
-
AbstractSimpleTikaDocumentFactory
public AbstractSimpleTikaDocumentFactory(String[] property) throws org.apache.commons.configuration.ConfigurationException
- Throws:
org.apache.commons.configuration.ConfigurationException
-
-
Method Detail
-
parseProperty
protected boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata) throws org.apache.commons.configuration.ConfigurationException
Description copied from class:PropertyBasedDocumentFactory
Parses a property with given key and value, adding it to the given map.Currently this implementation just parses the
PropertyBasedDocumentFactory.MetadataKeys.LOCALE
property.Subclasses should do their own parsing, returing true in case of success and returning
super.parseProperty()
otherwise.- Overrides:
parseProperty
in classPropertyBasedDocumentFactory
- Parameters:
key
- the property key.values
- the property value; this is an array, because properties may have a list of comma-separated values.metadata
- the metadata map.- Returns:
- true if the property was parsed correctly, false if it was ignored.
- Throws:
org.apache.commons.configuration.ConfigurationException
-
fields
protected List<TikaField> fields()
Description copied from class:AbstractTikaDocumentFactory
Returns the list of Tika fields (they will be mapped to MG4J fields whose index is their index in the list).- Specified by:
fields
in classAbstractTikaDocumentFactory
- Returns:
- the list of Tika fields.
-
getDocument
public Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata) throws IOException
Description copied from interface:DocumentFactory
Returns the document obtained by parsing the given byte stream.The parameter
metadata
actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in aPropertyBasedDocumentFactory
). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.- Parameters:
rawContent
- the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.metadata
- a map from enums (e.g., keys taken inPropertyBasedDocumentFactory
) to various kind of objects.- Returns:
- the document obtained by parsing the given character sequence.
- Throws:
IOException
-
metadataFields
protected List<? extends TikaField> metadataFields()
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.- Returns:
- the list of Tika fields that this factory provides.
-
getParser
protected abstract org.apache.tika.parser.Parser getParser()
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.- Returns:
- the parser to be used to parse this kind of documents.
-
copy
public DocumentFactory copy()
-
-