Class AbstractSimpleTikaDocumentFactory

    • Constructor Detail

      • AbstractSimpleTikaDocumentFactory

        public AbstractSimpleTikaDocumentFactory()
      • AbstractSimpleTikaDocumentFactory

        public AbstractSimpleTikaDocumentFactory​(Properties properties)
                                          throws org.apache.commons.configuration.ConfigurationException
        Throws:
        org.apache.commons.configuration.ConfigurationException
      • AbstractSimpleTikaDocumentFactory

        public AbstractSimpleTikaDocumentFactory​(String[] property)
                                          throws org.apache.commons.configuration.ConfigurationException
        Throws:
        org.apache.commons.configuration.ConfigurationException
    • Method Detail

      • parseProperty

        protected boolean parseProperty​(String key,
                                        String[] values,
                                        Reference2ObjectMap<Enum<?>,​Object> metadata)
                                 throws org.apache.commons.configuration.ConfigurationException
        Description copied from class: PropertyBasedDocumentFactory
        Parses a property with given key and value, adding it to the given map.

        Currently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE property.

        Subclasses should do their own parsing, returing true in case of success and returning super.parseProperty() otherwise.

        Overrides:
        parseProperty in class PropertyBasedDocumentFactory
        Parameters:
        key - the property key.
        values - the property value; this is an array, because properties may have a list of comma-separated values.
        metadata - the metadata map.
        Returns:
        true if the property was parsed correctly, false if it was ignored.
        Throws:
        org.apache.commons.configuration.ConfigurationException
      • getDocument

        public Document getDocument​(InputStream rawContent,
                                    Reference2ObjectMap<Enum<?>,​Object> metadata)
                             throws IOException
        Description copied from interface: DocumentFactory
        Returns the document obtained by parsing the given byte stream.

        The parameter metadata actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in a PropertyBasedDocumentFactory). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.

        Parameters:
        rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
        metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
        Returns:
        the document obtained by parsing the given character sequence.
        Throws:
        IOException
      • metadataFields

        protected List<? extends TikaField> metadataFields()
        The list of Tika fields (apart for content) that this factory provides; it returns the empty list, so most subclasses may want to override this method.
        Returns:
        the list of Tika fields that this factory provides.
      • getParser

        protected abstract org.apache.tika.parser.Parser getParser()
        The parser to be used to parse this kind of documents; subclasses should return always the same instance, as Tika parsers are immutable and thread-safe.
        Returns:
        the parser to be used to parse this kind of documents.