java.lang.Object
- it.unimi.di.big.mg4j.document.AbstractDocumentFactory
- - it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
  - - it.unimi.di.big.mg4j.document.HtmlDocumentFactory

All Implemented Interfaces:

DocumentFactory, FlyweightPrototype<DocumentFactory>, Serializable
```
public class HtmlDocumentFactory
extends PropertyBasedDocumentFactory
```
A factory that provides fields for body and title of HTML documents. It uses internally a BulletParser. A default encoding can be provided using the property PropertyBasedDocumentFactory.MetadataKeys.ENCODING.
By default, the WordReader provided by this factory is just a FastBufferedReader, but you can specify an alternative word reader using the property PropertyBasedDocumentFactory.MetadataKeys.WORDREADER.
Additional keys make it possible to tune the underlying AnchorExtractor.

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class Description

protected class HtmlDocumentFactory.HtmlDocument
An HTML document.

static class HtmlDocumentFactory.MetadataKeys
- Nested classes/interfaces inherited from interface it.unimi.di.big.mg4j.document.DocumentFactory
  DocumentFactory.FieldType

Field Summary

Fields
Modifier and Type	Field	Description
`protected AnchorExtractor`	`anchorExtractor`	The callback for anchors.
`protected static int`	`DEFAULT_BUFFER_SIZE`
`static int`	`DEFAULT_MAXANCHOR`	Default maximum number of character in an anchor (property `HtmlDocumentFactory.MetadataKeys.MAXANCHOR`)..
`static int`	`DEFAULT_MAXPOSTANCHOR`	Default maximum number of characters after an anchor (property `HtmlDocumentFactory.MetadataKeys.MAXPOSTANCHOR`)..
`static int`	`DEFAULT_MAXPREANCHOR`	Default maximum number of character before an anchor (property `HtmlDocumentFactory.MetadataKeys.MAXPREANCHOR`).
`protected String`	`delimiter`	A token that will be inserted to delimit the anchor text, or `null` for no delimiter.
`protected int`	`maxAnchor`	The maximum number of characters in an anchor.
`protected int`	`maxPostAnchor`	The maximum number of characters after an anchor.
`protected int`	`maxPreAnchor`	The maximum number of characters before an anchor.
`protected BulletParser`	`parser`	A parser that will be used to extract text from HTML documents.
`protected char[]`	`text`	The buffer holding text.
`protected TextExtractor`	`textExtractor`	The callback recording text.
`protected WordReader`	`wordReader`	The word reader used for all documents.

Fields inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata

Constructor Summary

Constructors
Constructor	Description
`HtmlDocumentFactory()`
`HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)`
`HtmlDocumentFactory(Properties properties)`
`HtmlDocumentFactory(String[] property)`

Method Summary

Modifier and Type	Method	Description
`HtmlDocumentFactory`	`copy()`	Returns a copy of this document factory.
`int`	`fieldIndex(String fieldName)`	Returns the index of a field, given its symbolic name.
`String`	`fieldName(int field)`	Returns the symbolic name of a field.
`DocumentFactory.FieldType`	`fieldType(int field)`	Returns the type of a field.
`Document`	`getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)`	Returns the document obtained by parsing the given byte stream.
`protected void`	`init()`
`protected void`	`initVars()`
`int`	`numberOfFields()`	Returns the number of fields present in the documents produced by this factory.
`protected boolean`	`parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)`	Parses a property with given key and value, adding it to the given map.

Methods inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey, toString

Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentFactory
ensureFieldIndex

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - DEFAULT_MAXPREANCHOR
```
public static final int DEFAULT_MAXPREANCHOR
```
    Default maximum number of character before an anchor (property HtmlDocumentFactory.MetadataKeys.MAXPREANCHOR).
    
    See Also:
    
    Constant Field Values
  - DEFAULT_MAXANCHOR
```
public static final int DEFAULT_MAXANCHOR
```
    Default maximum number of character in an anchor (property HtmlDocumentFactory.MetadataKeys.MAXANCHOR)..
    
    See Also:
    
    Constant Field Values
  - DEFAULT_MAXPOSTANCHOR
```
public static final int DEFAULT_MAXPOSTANCHOR
```
    Default maximum number of characters after an anchor (property HtmlDocumentFactory.MetadataKeys.MAXPOSTANCHOR)..
    
    See Also:
    
    Constant Field Values
  - DEFAULT_BUFFER_SIZE
```
protected static final int DEFAULT_BUFFER_SIZE
```
    See Also:
    
    Constant Field Values
  - parser
```
protected transient BulletParser parser
```
    A parser that will be used to extract text from HTML documents.
  - textExtractor
```
protected transient TextExtractor textExtractor
```
    The callback recording text.
  - anchorExtractor
```
protected transient AnchorExtractor anchorExtractor
```
    The callback for anchors.
  - wordReader
```
protected transient WordReader wordReader
```
    The word reader used for all documents.
  - maxPreAnchor
```
protected int maxPreAnchor
```
    The maximum number of characters before an anchor.
  - maxAnchor
```
protected int maxAnchor
```
    The maximum number of characters in an anchor.
  - maxPostAnchor
```
protected int maxPostAnchor
```
    The maximum number of characters after an anchor.
  - delimiter
```
protected String delimiter
```
    A token that will be inserted to delimit the anchor text, or null for no delimiter.
  - text
```
protected transient char[] text
```
    The buffer holding text.
- Constructor Detail
  - HtmlDocumentFactory
```
public HtmlDocumentFactory(Properties properties)
                    throws org.apache.commons.configuration.ConfigurationException
```
    Throws:
    
    org.apache.commons.configuration.ConfigurationException
  - HtmlDocumentFactory
```
public HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
```
  - HtmlDocumentFactory
```
public HtmlDocumentFactory(String[] property)
                    throws org.apache.commons.configuration.ConfigurationException
```
    Throws:
    
    org.apache.commons.configuration.ConfigurationException
  - HtmlDocumentFactory
```
public HtmlDocumentFactory()
```
- Method Detail
  - parseProperty
```
protected boolean parseProperty(String key,
                                String[] values,
                                Reference2ObjectMap<Enum<?>,Object> metadata)
                         throws org.apache.commons.configuration.ConfigurationException
```
    Description copied from class: PropertyBasedDocumentFactory
    
    Parses a property with given key and value, adding it to the given map.
    Currently this implementation just parses the PropertyBasedDocumentFactory.MetadataKeys.LOCALE property.
    Subclasses should do their own parsing, returing true in case of success and returning super.parseProperty() otherwise.
    
    Overrides:
    
    parseProperty in class PropertyBasedDocumentFactory
    
    Parameters:
    
    key - the property key.
    
    values - the property value; this is an array, because properties may have a list of comma-separated values.
    
    metadata - the metadata map.
    
    Returns:
    
    true if the property was parsed correctly, false if it was ignored.
    
    Throws:
    
    org.apache.commons.configuration.ConfigurationException
  - init
```
protected void init()
```
  - initVars
```
protected void initVars()
```
  - copy
```
public HtmlDocumentFactory copy()
```
    Returns a copy of this document factory. A new parser is allocated for the copy.
  - numberOfFields
```
public int numberOfFields()
```
    Description copied from interface: DocumentFactory
    
    Returns the number of fields present in the documents produced by this factory.
    
    Returns:
    
    the number of fields present in the documents produced by this factory.
  - fieldName
```
public String fieldName(int field)
```
    Description copied from interface: DocumentFactory
    
    Returns the symbolic name of a field.
    
    Parameters:
    
    field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
    
    Returns:
    
    the symbolic name of the field-th field.
  - fieldIndex
```
public int fieldIndex(String fieldName)
```
    Description copied from interface: DocumentFactory
    
    Returns the index of a field, given its symbolic name.
    
    Parameters:
    
    fieldName - the name of a field of this factory.
    
    Returns:
    
    the corresponding index, or -1 if there is no field with name fieldName.
  - fieldType
```
public DocumentFactory.FieldType fieldType(int field)
```
    Description copied from interface: DocumentFactory
    
    Returns the type of a field.
    The possible types are defined in DocumentFactory.FieldType.
    
    Parameters:
    
    field - the index of a field (between 0 inclusive and DocumentFactory.numberOfFields() exclusive}).
    
    Returns:
    
    the type of the field-th field.
  - getDocument
```
public Document getDocument(InputStream rawContent,
                            Reference2ObjectMap<Enum<?>,Object> metadata)
                     throws IOException
```
    Description copied from interface: DocumentFactory
    
    Returns the document obtained by parsing the given byte stream.
    The parameter metadata actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in a PropertyBasedDocumentFactory). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.
    
    Parameters:
    
    rawContent - the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.
    
    metadata - a map from enums (e.g., keys taken in PropertyBasedDocumentFactory) to various kind of objects.
    
    Returns:
    
    the document obtained by parsing the given character sequence.
    
    Throws:
    
    IOException

Modifier and Type	Class	Description
`protected class`	`HtmlDocumentFactory.HtmlDocument`	An HTML document.
`static class`	`HtmlDocumentFactory.MetadataKeys`

Class HtmlDocumentFactory

Nested Class Summary

Nested classes/interfaces inherited from interface it.unimi.di.big.mg4j.document.DocumentFactory

Field Summary

Fields inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory

Constructor Summary

Method Summary

Methods inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory

Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentFactory

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_MAXPREANCHOR

DEFAULT_MAXANCHOR

DEFAULT_MAXPOSTANCHOR

DEFAULT_BUFFER_SIZE

parser

textExtractor

anchorExtractor

wordReader

maxPreAnchor

maxAnchor

maxPostAnchor

delimiter

text

Constructor Detail

HtmlDocumentFactory

HtmlDocumentFactory

HtmlDocumentFactory

HtmlDocumentFactory

Method Detail

parseProperty

init

initVars

copy

numberOfFields

fieldName

fieldIndex

fieldType

getDocument