it.unimi.di.mg4j.document.tika
Class AbstractSimpleTikaDocumentFactory

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
          extended by it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
              extended by it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable
Direct Known Subclasses:
AutoDetectDocumentFactory, EPUBDocumentFactory, HtmlDocumentFactory, MSOfficeDocumentFactory, OOXMLDocumentFactory, OpenDocumentDocumentFactory, PdfDocumentFactory, RTFDocumentFactory, TextDocumentFactory, XMLDocumentFactory

public abstract class AbstractSimpleTikaDocumentFactory
extends AbstractTikaDocumentFactory

An abstract document factory that provides an implementation for getDocument(InputStream, Reference2ObjectMap) and fields(). Moreover, it gets a WordReader object using the PropertyBasedDocumentFactory.MetadataKeys.WORDREADER property.

Concrete subclasses must provide a getParser() method and may optionally override metadataFields() (which currently returns the empty list) to return the list of Tika fields provided by this factory. Note that getParser() should return always the same instance, as Tika parsers are immutable and thread-safe.

Author:
Salvatore Insalaco
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
PropertyBasedDocumentFactory.MetadataKeys
 
Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
AbstractSimpleTikaDocumentFactory()
           
AbstractSimpleTikaDocumentFactory(Properties properties)
           
AbstractSimpleTikaDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
           
AbstractSimpleTikaDocumentFactory(String[] property)
           
 
Method Summary
 DocumentFactory copy()
           
protected  List<TikaField> fields()
          Returns the list of Tika fields (they will be mapped to MG4J fields whose index is their index in the list).
 Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
          Returns the document obtained by parsing the given byte stream.
protected abstract  org.apache.tika.parser.Parser getParser()
          The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.
protected  List<? extends TikaField> metadataFields()
          The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.
protected  boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)
          Parses a property with given key and value, adding it to the given map.
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
fieldIndex, fieldName, fieldType, numberOfFields
 
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

AbstractSimpleTikaDocumentFactory

public AbstractSimpleTikaDocumentFactory()

AbstractSimpleTikaDocumentFactory

public AbstractSimpleTikaDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)

AbstractSimpleTikaDocumentFactory

public AbstractSimpleTikaDocumentFactory(Properties properties)
                                  throws ConfigurationException
Throws:
ConfigurationException

AbstractSimpleTikaDocumentFactory

public AbstractSimpleTikaDocumentFactory(String[] property)
                                  throws ConfigurationException
Throws:
ConfigurationException
Method Detail

parseProperty

protected boolean parseProperty(String key,
                                String[] values,
                                Reference2ObjectMap<Enum<?>,Object> metadata)
                         throws ConfigurationException
Description copied from class: PropertyBasedDocumentFactory
Parses a property with given key and value, adding it to the given map.

Currently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE property.

Subclasses should do their own parsing, returing true in case of success and returning super.parseProperty() otherwise.

Overrides:
parseProperty in class PropertyBasedDocumentFactory
Parameters:
key - the property key.
values - the property value; this is an array, because properties may have a list of comma-separated values.
metadata - the metadata map.
Returns:
true if the property was parsed correctly, false if it was ignored.
Throws:
ConfigurationException

fields

protected List<TikaField> fields()
Description copied from class: AbstractTikaDocumentFactory
Returns the list of Tika fields (they will be mapped to MG4J fields whose index is their index in the list).

Specified by:
fields in class AbstractTikaDocumentFactory
Returns:
the list of Tika fields.

getDocument

public Document getDocument(InputStream rawContent,
                            Reference2ObjectMap<Enum<?>,Object> metadata)
                     throws IOException
Description copied from interface: DocumentFactory
Returns the document obtained by parsing the given byte stream.

The parameter metadata actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in a PropertyBasedDocumentFactory). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.

Parameters:
rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
Returns:
the document obtained by parsing the given character sequence.
Throws:
IOException

metadataFields

protected List<? extends TikaField> metadataFields()
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.

Returns:
the list of Tika fields that this factory provides.

getParser

protected abstract org.apache.tika.parser.Parser getParser()
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.

Returns:
the parser to be used to parse this kind of documents.

copy

public DocumentFactory copy()