it.unimi.di.mg4j.document.tika
Class AutoDetectDocumentFactory

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
          extended by it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
              extended by it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
                  extended by it.unimi.di.mg4j.document.tika.AutoDetectDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable

public class AutoDetectDocumentFactory
extends AbstractSimpleTikaDocumentFactory

A document factory that automatically detect the type of the document content.

The metadata that will be tentatively parsed are Metadata.TITLE and GreedyTikaField.NAME: the latter will contain all Tika fields Object.toString()'d and concatenated.

Author:
Salvatore Insalaco
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
PropertyBasedDocumentFactory.MetadataKeys
 
Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
AutoDetectDocumentFactory()
           
AutoDetectDocumentFactory(Properties properties)
           
AutoDetectDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
           
AutoDetectDocumentFactory(String[] property)
           
 
Method Summary
protected  org.apache.tika.parser.Parser getParser()
          The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.
protected  List<? extends TikaField> metadataFields()
          The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
copy, fields, getDocument, parseProperty
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
fieldIndex, fieldName, fieldType, numberOfFields
 
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

AutoDetectDocumentFactory

public AutoDetectDocumentFactory()

AutoDetectDocumentFactory

public AutoDetectDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)

AutoDetectDocumentFactory

public AutoDetectDocumentFactory(Properties properties)
                          throws ConfigurationException
Throws:
ConfigurationException

AutoDetectDocumentFactory

public AutoDetectDocumentFactory(String[] property)
                          throws ConfigurationException
Throws:
ConfigurationException
Method Detail

getParser

protected org.apache.tika.parser.Parser getParser()
Description copied from class: AbstractSimpleTikaDocumentFactory
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.

Specified by:
getParser in class AbstractSimpleTikaDocumentFactory
Returns:
the parser to be used to parse this kind of documents.

metadataFields

protected List<? extends TikaField> metadataFields()
Description copied from class: AbstractSimpleTikaDocumentFactory
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.

Overrides:
metadataFields in class AbstractSimpleTikaDocumentFactory
Returns:
the list of Tika fields that this factory provides.