Package it.unimi.di.big.mg4j.document
Class HtmlDocumentFactory.HtmlDocument
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.AbstractDocument
-
- it.unimi.di.big.mg4j.document.HtmlDocumentFactory.HtmlDocument
-
- All Implemented Interfaces:
Document
,SafelyCloseable
,Closeable
,AutoCloseable
- Enclosing class:
- HtmlDocumentFactory
protected class HtmlDocumentFactory.HtmlDocument extends AbstractDocument
An HTML document. If a TITLE element is available, it will be used fortitle()
instead of the default value.We delay the actual parsing until it is actually necessary, so operations like getting the document URI will not require parsing.
-
-
Field Summary
Fields Modifier and Type Field Description protected Reference2ObjectMap<Enum<?>,Object>
metadata
protected boolean
parsed
Whether we already parsed the document.protected InputStream
rawContent
The cached raw content.
-
Constructor Summary
Constructors Modifier Constructor Description protected
HtmlDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
-
Method Summary
Modifier and Type Method Description Object
content(int field)
Returns the content of the given field.protected void
ensureParsed()
CharSequence
title()
The title of this document.String
toString()
CharSequence
uri()
A URI that is associated with this document.WordReader
wordReader(int field)
Returns a word reader for the givenDocumentFactory.FieldType.TEXT
field.-
Methods inherited from class it.unimi.di.big.mg4j.document.AbstractDocument
close, finalize
-
-
-
-
Field Detail
-
metadata
protected final Reference2ObjectMap<Enum<?>,Object> metadata
-
parsed
protected boolean parsed
Whether we already parsed the document.
-
rawContent
protected final InputStream rawContent
The cached raw content.
-
-
Constructor Detail
-
HtmlDocument
protected HtmlDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
-
-
Method Detail
-
ensureParsed
protected void ensureParsed() throws IOException
- Throws:
IOException
-
title
public CharSequence title()
Description copied from interface:Document
The title of this document.- Returns:
- the title to be used to refer to this document.
-
toString
public String toString()
- Overrides:
toString
in classAbstractDocument
-
uri
public CharSequence uri()
Description copied from interface:Document
A URI that is associated with this document.- Returns:
- the URI associated with this document, or
null
.
-
content
public Object content(int field) throws IOException
Description copied from interface:Document
Returns the content of the given field.- Parameters:
field
- the field index.- Returns:
- the field content; the actual type depends on the field type, as specified by the
DocumentFactory
that built this document. For example, the returned object is going to be aReader
if the field type isDocumentFactory.FieldType.TEXT
. - Throws:
IOException
-
wordReader
public WordReader wordReader(int field)
Description copied from interface:Document
Returns a word reader for the givenDocumentFactory.FieldType.TEXT
field.- Parameters:
field
- the field index.- Returns:
- a word reader object that should be used to break the given field.
-
-