it.unimi.di.mg4j.document.tika
Class MSOfficeDocumentFactory

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
          extended by it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
              extended by it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
                  extended by it.unimi.di.mg4j.document.tika.MSOfficeDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable

public class MSOfficeDocumentFactory
extends AbstractSimpleTikaDocumentFactory

A document factory for the Microsoft Office format.

The only metadata that will be parsed is GreedyTikaField.NAME.

Author:
Salvatore Insalaco
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
PropertyBasedDocumentFactory.MetadataKeys
 
Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
MSOfficeDocumentFactory()
           
MSOfficeDocumentFactory(Properties properties)
           
MSOfficeDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
           
MSOfficeDocumentFactory(String[] property)
           
 
Method Summary
protected  org.apache.tika.parser.Parser getParser()
          The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.
protected  List<? extends TikaField> metadataFields()
          The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
copy, fields, getDocument, parseProperty
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
fieldIndex, fieldName, fieldType, numberOfFields
 
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

MSOfficeDocumentFactory

public MSOfficeDocumentFactory()

MSOfficeDocumentFactory

public MSOfficeDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)

MSOfficeDocumentFactory

public MSOfficeDocumentFactory(Properties properties)
                        throws ConfigurationException
Throws:
ConfigurationException

MSOfficeDocumentFactory

public MSOfficeDocumentFactory(String[] property)
                        throws ConfigurationException
Throws:
ConfigurationException
Method Detail

metadataFields

protected List<? extends TikaField> metadataFields()
Description copied from class: AbstractSimpleTikaDocumentFactory
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.

Overrides:
metadataFields in class AbstractSimpleTikaDocumentFactory
Returns:
the list of Tika fields that this factory provides.

getParser

protected org.apache.tika.parser.Parser getParser()
Description copied from class: AbstractSimpleTikaDocumentFactory
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.

Specified by:
getParser in class AbstractSimpleTikaDocumentFactory
Returns:
the parser to be used to parse this kind of documents.