it.unimi.di.mg4j.document
Interface DocumentCollectionBuilder

All Known Implementing Classes:
SimpleCompressedDocumentCollectionBuilder, ZipDocumentCollectionBuilder

public interface DocumentCollectionBuilder

An interface for classes that can build collections during the indexing process.

A builder is usually based on a basename. Many different collections can be built using the same builder, using open(CharSequence) to specify a suffix that will be added to the basename. Creating several collections is a simple way to make collection construction scalable: for instance, Scan creates several collections, one per batch, and then puts them together using a ConcatenatedDocumentCollection.

After creating an instance of this class and after having opened a new collection, it is possible to add incrementally new documents. Each document must be started with startDocument(CharSequence, CharSequence) and ended with endDocument(); inside each document, each non-text field must be written by passing an object to nonTextField(Object), whereas each text field must be started with startTextField() and ended with endTextField(): inbetween, a call to add(MutableString, MutableString) must be made for each word/nonword pair retrieved from the original collection. At the end, close() returns a ZipDocumentCollection that must be serialised.

Several collections (e.g., SimpleCompressedDocumentCollection, ZipDocumentCollection) can be exact or approximated: in the latter case, nonwords are not recorded to decrease space usage.


Method Summary
 void add(MutableString word, MutableString nonWord)
          Adds a word and a nonword to the current text field, provided that a text field has started but not yet ended; otherwise, doesn't do anything.
 String basename()
          Returns the basename of this builder.
 void close()
          Terminates the contruction of the collection.
 void endDocument()
          Ends a document entry.
 void endTextField()
          Ends a new text field.
 void nonTextField(Object o)
          Adds a non-text field.
 void open(CharSequence suffix)
          Opens a new collection.
 void startDocument(CharSequence title, CharSequence uri)
          Starts a document entry.
 void startTextField()
          Starts a new text field.
 void virtualField(ObjectList<Scan.VirtualDocumentFragment> fragments)
          Adds a virtual field.
 

Method Detail

basename

String basename()
Returns the basename of this builder.

Returns:
the basename

open

void open(CharSequence suffix)
          throws IOException
Opens a new collection.

Parameters:
suffix - a suffix that will be added to the basename provided at construction time.
Throws:
IOException

startDocument

void startDocument(CharSequence title,
                   CharSequence uri)
                   throws IOException
Starts a document entry.

Parameters:
title - the document title (usually, the result of Document.title()).
uri - the document uri (usually, the result of Document.uri()).
Throws:
IOException

endDocument

void endDocument()
                 throws IOException
Ends a document entry.

Throws:
IOException

startTextField

void startTextField()
Starts a new text field.


endTextField

void endTextField()
                  throws IOException
Ends a new text field.

Throws:
IOException

nonTextField

void nonTextField(Object o)
                  throws IOException
Adds a non-text field.

Parameters:
o - the content of the non-text field.
Throws:
IOException

virtualField

void virtualField(ObjectList<Scan.VirtualDocumentFragment> fragments)
                  throws IOException
Adds a virtual field.

Parameters:
fragments - the virtual fragments to be added.
Throws:
IOException

add

void add(MutableString word,
         MutableString nonWord)
         throws IOException
Adds a word and a nonword to the current text field, provided that a text field has started but not yet ended; otherwise, doesn't do anything.

Usually, word e nonWord are just the result of a call to WordReader.next(MutableString, MutableString).

Parameters:
word - a word.
nonWord - a nonword.
Throws:
IOException

close

void close()
           throws IOException
Terminates the contruction of the collection.

Throws:
IOException