Class HtmlDocumentFactory

    • Field Detail

      • parser

        protected transient BulletParser parser
        A parser that will be used to extract text from HTML documents.
      • textExtractor

        protected transient TextExtractor textExtractor
        The callback recording text.
      • anchorExtractor

        protected transient AnchorExtractor anchorExtractor
        The callback for anchors.
      • wordReader

        protected transient WordReader wordReader
        The word reader used for all documents.
      • maxPreAnchor

        protected int maxPreAnchor
        The maximum number of characters before an anchor.
      • maxAnchor

        protected int maxAnchor
        The maximum number of characters in an anchor.
      • maxPostAnchor

        protected int maxPostAnchor
        The maximum number of characters after an anchor.
      • delimiter

        protected String delimiter
        A token that will be inserted to delimit the anchor text, or null for no delimiter.
      • text

        protected transient char[] text
        The buffer holding text.
    • Constructor Detail

      • HtmlDocumentFactory

        public HtmlDocumentFactory​(Properties properties)
                            throws org.apache.commons.configuration.ConfigurationException
        Throws:
        org.apache.commons.configuration.ConfigurationException
      • HtmlDocumentFactory

        public HtmlDocumentFactory​(String[] property)
                            throws org.apache.commons.configuration.ConfigurationException
        Throws:
        org.apache.commons.configuration.ConfigurationException
      • HtmlDocumentFactory

        public HtmlDocumentFactory()
    • Method Detail

      • parseProperty

        protected boolean parseProperty​(String key,
                                        String[] values,
                                        Reference2ObjectMap<Enum<?>,​Object> metadata)
                                 throws org.apache.commons.configuration.ConfigurationException
        Description copied from class: PropertyBasedDocumentFactory
        Parses a property with given key and value, adding it to the given map.

        Currently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE property.

        Subclasses should do their own parsing, returing true in case of success and returning super.parseProperty() otherwise.

        Overrides:
        parseProperty in class PropertyBasedDocumentFactory
        Parameters:
        key - the property key.
        values - the property value; this is an array, because properties may have a list of comma-separated values.
        metadata - the metadata map.
        Returns:
        true if the property was parsed correctly, false if it was ignored.
        Throws:
        org.apache.commons.configuration.ConfigurationException
      • init

        protected void init()
      • initVars

        protected void initVars()
      • copy

        public HtmlDocumentFactory copy()
        Returns a copy of this document factory. A new parser is allocated for the copy.
      • numberOfFields

        public int numberOfFields()
        Description copied from interface: DocumentFactory
        Returns the number of fields present in the documents produced by this factory.
        Returns:
        the number of fields present in the documents produced by this factory.
      • fieldName

        public String fieldName​(int field)
        Description copied from interface: DocumentFactory
        Returns the symbolic name of a field.
        Parameters:
        field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
        Returns:
        the symbolic name of the field-th field.
      • fieldIndex

        public int fieldIndex​(String fieldName)
        Description copied from interface: DocumentFactory
        Returns the index of a field, given its symbolic name.
        Parameters:
        fieldName - the name of a field of this factory.
        Returns:
        the corresponding index, or -1 if there is no field with name fieldName.
      • getDocument

        public Document getDocument​(InputStream rawContent,
                                    Reference2ObjectMap<Enum<?>,​Object> metadata)
                             throws IOException
        Description copied from interface: DocumentFactory
        Returns the document obtained by parsing the given byte stream.

        The parameter metadata actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in a PropertyBasedDocumentFactory). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.

        Parameters:
        rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
        metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
        Returns:
        the document obtained by parsing the given character sequence.
        Throws:
        IOException