|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.document.AbstractDocumentFactory it.unimi.di.mg4j.document.PropertyBasedDocumentFactory it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory it.unimi.di.mg4j.document.tika.AbstractSimpleTikaDocumentFactory
public abstract class AbstractSimpleTikaDocumentFactory
An abstract document factory that provides an implementation for getDocument(InputStream, Reference2ObjectMap)
and fields()
. Moreover, it gets a WordReader
object using the PropertyBasedDocumentFactory.MetadataKeys.WORDREADER
property.
Concrete subclasses must provide a getParser()
method and may optionally override metadataFields()
(which
currently returns the empty list) to return the list of Tika fields provided by this factory. Note that getParser()
should return always the same instance, as
Tika parsers are immutable and thread-safe.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory |
---|
PropertyBasedDocumentFactory.MetadataKeys |
Nested classes/interfaces inherited from interface it.unimi.di.mg4j.document.DocumentFactory |
---|
DocumentFactory.FieldType |
Field Summary |
---|
Fields inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory |
---|
defaultMetadata |
Constructor Summary | |
---|---|
AbstractSimpleTikaDocumentFactory()
|
|
AbstractSimpleTikaDocumentFactory(Properties properties)
|
|
AbstractSimpleTikaDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
|
|
AbstractSimpleTikaDocumentFactory(String[] property)
|
Method Summary | |
---|---|
DocumentFactory |
copy()
|
protected List<TikaField> |
fields()
Returns the list of Tika fields (they will be mapped to MG4J fields whose index is their index in the list). |
Document |
getDocument(InputStream rawContent,
Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream. |
protected abstract org.apache.tika.parser.Parser |
getParser()
The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe. |
protected List<? extends TikaField> |
metadataFields()
The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method. |
protected boolean |
parseProperty(String key,
String[] values,
Reference2ObjectMap<Enum<?>,Object> metadata)
Parses a property with given key and value, adding it to the given map. |
Methods inherited from class it.unimi.di.mg4j.document.tika.AbstractTikaDocumentFactory |
---|
fieldIndex, fieldName, fieldType, numberOfFields |
Methods inherited from class it.unimi.di.mg4j.document.PropertyBasedDocumentFactory |
---|
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey |
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocumentFactory |
---|
ensureFieldIndex, toString |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
public AbstractSimpleTikaDocumentFactory()
public AbstractSimpleTikaDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
public AbstractSimpleTikaDocumentFactory(Properties properties) throws ConfigurationException
ConfigurationException
public AbstractSimpleTikaDocumentFactory(String[] property) throws ConfigurationException
ConfigurationException
Method Detail |
---|
protected boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata) throws ConfigurationException
PropertyBasedDocumentFactory
Currently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE
property.
Subclasses should do their own parsing, returing true in case of success and
returning super.parseProperty()
otherwise.
parseProperty
in class PropertyBasedDocumentFactory
key
- the property key.values
- the property value; this is an array, because properties may have a list of comma-separated values.metadata
- the metadata map.
ConfigurationException
protected List<TikaField> fields()
AbstractTikaDocumentFactory
fields
in class AbstractTikaDocumentFactory
public Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata) throws IOException
DocumentFactory
The parameter metadata
actually replaces the lack of a simple keyword-based
parameter-passing system in Java. This method might take several different type of “suggestions”
which have been collected by the collection: typically, the document title, a URI representing
the document, its MIME type, its encoding and so on. Some of this information might be
set by default (as it happens, for instance, in a PropertyBasedDocumentFactory
).
Implementations of this method must consult the metadata provided by the collection, possibly
complete them with default factory metadata, and proceed to the document construction.
rawContent
- the raw content from which the document should be extracted; it must not be closed, as
resource management is a responsibility of the DocumentCollection.metadata
- a map from enums (e.g., keys taken in PropertyBasedDocumentFactory
) to various kind of objects.
IOException
protected List<? extends TikaField> metadataFields()
protected abstract org.apache.tika.parser.Parser getParser()
public DocumentFactory copy()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |