it.unimi.di.mg4j.document
Class HtmlDocumentFactory.HtmlDocument

java.lang.Object
  extended by it.unimi.di.mg4j.document.AbstractDocument
      extended by it.unimi.di.mg4j.document.HtmlDocumentFactory.HtmlDocument
All Implemented Interfaces:
Document, SafelyCloseable, Closeable
Enclosing class:
HtmlDocumentFactory

protected class HtmlDocumentFactory.HtmlDocument
extends AbstractDocument

An HTML document. If a TITLE element is available, it will be used for title() instead of the default value.

We delay the actual parsing until it is actually necessary, so operations like getting the document URI will not require parsing.


Constructor Summary
protected HtmlDocumentFactory.HtmlDocument(InputStream rawContent, Reference2ObjectMap<Enum<?>,Object> metadata)
           
 
Method Summary
 Object content(int field)
          Returns the content of the given field.
 CharSequence title()
          The title of this document.
 String toString()
           
 CharSequence uri()
          A URI that is associated with this document.
 WordReader wordReader(int field)
          Returns a word reader for the given DocumentFactory.FieldType.TEXT field.
 
Methods inherited from class it.unimi.di.mg4j.document.AbstractDocument
close, finalize
 
Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

HtmlDocumentFactory.HtmlDocument

protected HtmlDocumentFactory.HtmlDocument(InputStream rawContent,
                                           Reference2ObjectMap<Enum<?>,Object> metadata)
Method Detail

title

public CharSequence title()
Description copied from interface: Document
The title of this document.

Returns:
the title to be used to refer to this document, or null.

toString

public String toString()
Overrides:
toString in class AbstractDocument

uri

public CharSequence uri()
Description copied from interface: Document
A URI that is associated with this document.

Returns:
the URI associated with this document, or null.

content

public Object content(int field)
               throws IOException
Description copied from interface: Document
Returns the content of the given field.

Parameters:
field - the field index.
Returns:
the field content; the actual type depends on the field type, as specified by the DocumentFactory that built this document. For example, the returned object is going to be a Reader if the field type is DocumentFactory.FieldType.TEXT.
Throws:
IOException

wordReader

public WordReader wordReader(int field)
Description copied from interface: Document
Returns a word reader for the given DocumentFactory.FieldType.TEXT field.

Parameters:
field - the field index.
Returns:
a word reader object that should be used to break the given field.