Class HtmlDocumentFactory
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocumentFactory
-
- it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
-
- it.unimi.di.big.mg4j.document.HtmlDocumentFactory
-
- All Implemented Interfaces:
DocumentFactory
,FlyweightPrototype<DocumentFactory>
,Serializable
public class HtmlDocumentFactory extends PropertyBasedDocumentFactory
A factory that provides fields for body and title of HTML documents. It uses internally aBulletParser
. A default encoding can be provided using the propertyPropertyBasedDocumentFactory.MetadataKeys.ENCODING
.By default, the
WordReader
provided by this factory is just aFastBufferedReader
, but you can specify an alternative word reader using the propertyPropertyBasedDocumentFactory.MetadataKeys.WORDREADER
.Additional keys make it possible to tune the underlying
AnchorExtractor
.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected class
HtmlDocumentFactory.HtmlDocument
An HTML document.static class
HtmlDocumentFactory.MetadataKeys
-
Nested classes/interfaces inherited from interface it.unimi.di.big.mg4j.document.DocumentFactory
DocumentFactory.FieldType
-
-
Field Summary
Fields Modifier and Type Field Description protected AnchorExtractor
anchorExtractor
The callback for anchors.protected static int
DEFAULT_BUFFER_SIZE
static int
DEFAULT_MAXANCHOR
Default maximum number of character in an anchor (propertyHtmlDocumentFactory.MetadataKeys.MAXANCHOR
)..static int
DEFAULT_MAXPOSTANCHOR
Default maximum number of characters after an anchor (propertyHtmlDocumentFactory.MetadataKeys.MAXPOSTANCHOR
)..static int
DEFAULT_MAXPREANCHOR
Default maximum number of character before an anchor (propertyHtmlDocumentFactory.MetadataKeys.MAXPREANCHOR
).protected String
delimiter
A token that will be inserted to delimit the anchor text, ornull
for no delimiter.protected int
maxAnchor
The maximum number of characters in an anchor.protected int
maxPostAnchor
The maximum number of characters after an anchor.protected int
maxPreAnchor
The maximum number of characters before an anchor.protected BulletParser
parser
A parser that will be used to extract text from HTML documents.protected char[]
text
The buffer holding text.protected TextExtractor
textExtractor
The callback recording text.protected WordReader
wordReader
The word reader used for all documents.-
Fields inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
defaultMetadata
-
-
Constructor Summary
Constructors Constructor Description HtmlDocumentFactory()
HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
HtmlDocumentFactory(Properties properties)
HtmlDocumentFactory(String[] property)
-
Method Summary
Modifier and Type Method Description HtmlDocumentFactory
copy()
Returns a copy of this document factory.int
fieldIndex(String fieldName)
Returns the index of a field, given its symbolic name.String
fieldName(int field)
Returns the symbolic name of a field.DocumentFactory.FieldType
fieldType(int field)
Returns the type of a field.Document
getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
Returns the document obtained by parsing the given byte stream.protected void
init()
protected void
initVars()
int
numberOfFields()
Returns the number of fields present in the documents produced by this factory.protected boolean
parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata)
Parses a property with given key and value, adding it to the given map.-
Methods inherited from class it.unimi.di.big.mg4j.document.PropertyBasedDocumentFactory
ensureJustOne, getInstance, getInstance, getInstance, getInstance, parseProperties, parseProperties, resolve, resolve, resolveNotNull, sameKey, toString
-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocumentFactory
ensureFieldIndex
-
-
-
-
Field Detail
-
DEFAULT_MAXPREANCHOR
public static final int DEFAULT_MAXPREANCHOR
Default maximum number of character before an anchor (propertyHtmlDocumentFactory.MetadataKeys.MAXPREANCHOR
).- See Also:
- Constant Field Values
-
DEFAULT_MAXANCHOR
public static final int DEFAULT_MAXANCHOR
Default maximum number of character in an anchor (propertyHtmlDocumentFactory.MetadataKeys.MAXANCHOR
)..- See Also:
- Constant Field Values
-
DEFAULT_MAXPOSTANCHOR
public static final int DEFAULT_MAXPOSTANCHOR
Default maximum number of characters after an anchor (propertyHtmlDocumentFactory.MetadataKeys.MAXPOSTANCHOR
)..- See Also:
- Constant Field Values
-
DEFAULT_BUFFER_SIZE
protected static final int DEFAULT_BUFFER_SIZE
- See Also:
- Constant Field Values
-
parser
protected transient BulletParser parser
A parser that will be used to extract text from HTML documents.
-
textExtractor
protected transient TextExtractor textExtractor
The callback recording text.
-
anchorExtractor
protected transient AnchorExtractor anchorExtractor
The callback for anchors.
-
wordReader
protected transient WordReader wordReader
The word reader used for all documents.
-
maxPreAnchor
protected int maxPreAnchor
The maximum number of characters before an anchor.
-
maxAnchor
protected int maxAnchor
The maximum number of characters in an anchor.
-
maxPostAnchor
protected int maxPostAnchor
The maximum number of characters after an anchor.
-
delimiter
protected String delimiter
A token that will be inserted to delimit the anchor text, ornull
for no delimiter.
-
text
protected transient char[] text
The buffer holding text.
-
-
Constructor Detail
-
HtmlDocumentFactory
public HtmlDocumentFactory(Properties properties) throws org.apache.commons.configuration.ConfigurationException
- Throws:
org.apache.commons.configuration.ConfigurationException
-
HtmlDocumentFactory
public HtmlDocumentFactory(Reference2ObjectMap<Enum<?>,Object> defaultMetadata)
-
HtmlDocumentFactory
public HtmlDocumentFactory(String[] property) throws org.apache.commons.configuration.ConfigurationException
- Throws:
org.apache.commons.configuration.ConfigurationException
-
HtmlDocumentFactory
public HtmlDocumentFactory()
-
-
Method Detail
-
parseProperty
protected boolean parseProperty(String key, String[] values, Reference2ObjectMap<Enum<?>,Object> metadata) throws org.apache.commons.configuration.ConfigurationException
Description copied from class:PropertyBasedDocumentFactory
Parses a property with given key and value, adding it to the given map.Currently this implementation just parses the
PropertyBasedDocumentFactory.MetadataKeys.LOCALE
property.Subclasses should do their own parsing, returing true in case of success and returning
super.parseProperty()
otherwise.- Overrides:
parseProperty
in classPropertyBasedDocumentFactory
- Parameters:
key
- the property key.values
- the property value; this is an array, because properties may have a list of comma-separated values.metadata
- the metadata map.- Returns:
- true if the property was parsed correctly, false if it was ignored.
- Throws:
org.apache.commons.configuration.ConfigurationException
-
init
protected void init()
-
initVars
protected void initVars()
-
copy
public HtmlDocumentFactory copy()
Returns a copy of this document factory. A new parser is allocated for the copy.
-
numberOfFields
public int numberOfFields()
Description copied from interface:DocumentFactory
Returns the number of fields present in the documents produced by this factory.- Returns:
- the number of fields present in the documents produced by this factory.
-
fieldName
public String fieldName(int field)
Description copied from interface:DocumentFactory
Returns the symbolic name of a field.- Parameters:
field
- the index of a field (between 0 inclusive andDocumentFactory.numberOfFields()
exclusive}).- Returns:
- the symbolic name of the
field
-th field.
-
fieldIndex
public int fieldIndex(String fieldName)
Description copied from interface:DocumentFactory
Returns the index of a field, given its symbolic name.- Parameters:
fieldName
- the name of a field of this factory.- Returns:
- the corresponding index, or -1 if there is no field with name
fieldName
.
-
fieldType
public DocumentFactory.FieldType fieldType(int field)
Description copied from interface:DocumentFactory
Returns the type of a field.The possible types are defined in
DocumentFactory.FieldType
.- Parameters:
field
- the index of a field (between 0 inclusive andDocumentFactory.numberOfFields()
exclusive}).- Returns:
- the type of the
field
-th field.
-
getDocument
public Document getDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata) throws IOException
Description copied from interface:DocumentFactory
Returns the document obtained by parsing the given byte stream.The parameter
metadata
actually replaces the lack of a simple keyword-based parameter-passing system in Java. This method might take several different type of “suggestions” which have been collected by the collection: typically, the document title, a URI representing the document, its MIME type, its encoding and so on. Some of this information might be set by default (as it happens, for instance, in aPropertyBasedDocumentFactory
). Implementations of this method must consult the metadata provided by the collection, possibly complete them with default factory metadata, and proceed to the document construction.- Parameters:
rawContent
- the raw content from which the document should be extracted; it must not be closed, as resource management is a responsibility of the DocumentCollection.metadata
- a map from enums (e.g., keys taken inPropertyBasedDocumentFactory
) to various kind of objects.- Returns:
- the document obtained by parsing the given character sequence.
- Throws:
IOException
-
-