it.unimi.di.mg4j.document.tika
Class HtmlDocumentFactory
java.lang.Object
it.unimi.di.mg4j.document.AbstractDocumentFactory
it.unimi.di.mg4j.document.PropertyBasedDocumentFactory
it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory
it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
it.unimi.di.mg4j.document.tika.HtmlDocumentFactory
- All Implemented Interfaces:
- DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable
public class HtmlDocumentFactory
- extends AbstractSimpleTikaDocumentFactory
A document factory for the HTML format.
The metadata that will be tentatively parsed is Metadata.TITLE
.
- Author:
- Salvatore Insalaco
- See Also:
- Serialized Form
Method Summary |
protected org.apache.tika.parser.Parser |
getParser()
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe. |
protected List<TikaField> |
metadataFields()
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method. |
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory |
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey |
HtmlDocumentFactory
public HtmlDocumentFactory()
HtmlDocumentFactory
public HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
HtmlDocumentFactory
public HtmlDocumentFactory(Properties properties)
throws ConfigurationException
- Throws:
ConfigurationException
HtmlDocumentFactory
public HtmlDocumentFactory(String[] property)
throws ConfigurationException
- Throws:
ConfigurationException
getParser
protected org.apache.tika.parser.Parser getParser()
- Description copied from class:
AbstractSimpleTikaDocumentFactory
- The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.
- Specified by:
getParser
in class AbstractSimpleTikaDocumentFactory
- Returns:
- the parser to be used to parse this kind of documents.
metadataFields
protected List<TikaField> metadataFields()
- Description copied from class:
AbstractSimpleTikaDocumentFactory
- The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.
- Overrides:
metadataFields
in class AbstractSimpleTikaDocumentFactory
- Returns:
- the list of Tika fields that this factory provides.