it.unimi.di.mg4j.document.tika
Class HtmlDocumentFactory

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocumentFactory
      extended by it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
          extended by it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
              extended by it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
                  extended by it.unimi.di.mg4j.document.tika.HtmlDocumentFactory
All Implemented Interfaces:
DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable

public class HtmlDocumentFactory
extends AbstractSimpleTikaDocumentFactory

A document factory for the HTML format.

The metadata that will be tentatively parsed is Metadata.TITLE.

Author:
Salvatore Insalaco
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
PropertyBasedDocumentFactory.MetadataKeys
 
Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory
DocumentFactory.FieldType
 
Field Summary
 
Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
 
Constructor Summary
HtmlDocumentFactory()
           
HtmlDocumentFactory(Properties properties)
           
HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
           
HtmlDocumentFactory(String[] property)
           
 
Method Summary
protected  org.apache.tika.parser.Parser getParser()
          The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.
protected  List<TikaField> metadataFields()
          The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
copy, fields, getDocument, parseProperty
 
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
fieldIndex, fieldName, fieldType, numberOfFields
 
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory
ensureFieldIndex, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

HtmlDocumentFactory

public HtmlDocumentFactory()

HtmlDocumentFactory

public HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)

HtmlDocumentFactory

public HtmlDocumentFactory(Properties properties)
                    throws ConfigurationException
Throws:
ConfigurationException

HtmlDocumentFactory

public HtmlDocumentFactory(String[] property)
                    throws ConfigurationException
Throws:
ConfigurationException
Method Detail

getParser

protected org.apache.tika.parser.Parser getParser()
Description copied from class: AbstractSimpleTikaDocumentFactory
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.

Specified by:
getParser in class AbstractSimpleTikaDocumentFactory
Returns:
the parser to be used to parse this kind of documents.

metadataFields

protected List<TikaField> metadataFields()
Description copied from class: AbstractSimpleTikaDocumentFactory
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.

Overrides:
metadataFields in class AbstractSimpleTikaDocumentFactory
Returns:
the list of Tika fields that this factory provides.