Interface DocumentCollectionBuilder
-
- All Known Implementing Classes:
SimpleCompressedDocumentCollectionBuilder
,ZipDocumentCollectionBuilder
public interface DocumentCollectionBuilder
An interface for classes that can build collections during the indexing process.A builder is usually based on a basename. Many different collections can be built using the same builder, using
open(CharSequence)
to specify a suffix that will be added to the basename. Creating several collections is a simple way to make collection construction scalable: for instance,Scan
creates several collections, one per batch, and then puts them together using aConcatenatedDocumentCollection
.After creating an instance of this class and after having opened a new collection, it is possible to add incrementally new documents. Each document must be started with
startDocument(CharSequence, CharSequence)
and ended withendDocument()
; inside each document, each non-text field must be written by passing an object tononTextField(Object)
, whereas each text field must be started withstartTextField()
and ended withendTextField()
: inbetween, a call toadd(MutableString, MutableString)
must be made for each word/nonword pair retrieved from the original collection. At the end,close()
returns aZipDocumentCollection
that must be serialised.Several collections (e.g.,
SimpleCompressedDocumentCollection
,ZipDocumentCollection
) can be exact or approximated: in the latter case, nonwords are not recorded to decrease space usage.
-
-
Method Summary
Modifier and Type Method Description void
add(MutableString word, MutableString nonWord)
String
basename()
Returns the basename of this builder.void
close()
Terminates the contruction of the collection.void
endDocument()
Ends a document entry.void
endTextField()
Ends a new text field.void
nonTextField(Object o)
Adds a non-text field.void
open(CharSequence suffix)
Opens a new collection.void
startDocument(CharSequence title, CharSequence uri)
Starts a document entry.void
startTextField()
Starts a new text field.void
virtualField(List<Scan.VirtualDocumentFragment> fragments)
Adds a virtual field.
-
-
-
Method Detail
-
basename
String basename()
Returns the basename of this builder.- Returns:
- the basename
-
open
void open(CharSequence suffix) throws IOException
Opens a new collection.- Parameters:
suffix
- a suffix that will be added to the basename provided at construction time.- Throws:
IOException
-
startDocument
void startDocument(CharSequence title, CharSequence uri) throws IOException
Starts a document entry.- Parameters:
title
- the document title (usually, the result ofDocument.title()
).uri
- the document uri (usually, the result ofDocument.uri()
).- Throws:
IOException
-
endDocument
void endDocument() throws IOException
Ends a document entry.- Throws:
IOException
-
startTextField
void startTextField()
Starts a new text field.
-
endTextField
void endTextField() throws IOException
Ends a new text field.- Throws:
IOException
-
nonTextField
void nonTextField(Object o) throws IOException
Adds a non-text field.- Parameters:
o
- the content of the non-text field.- Throws:
IOException
-
virtualField
void virtualField(List<Scan.VirtualDocumentFragment> fragments) throws IOException
Adds a virtual field.- Parameters:
fragments
- the virtual fragments to be added.- Throws:
IOException
-
add
void add(MutableString word, MutableString nonWord) throws IOException
Adds a word and a nonword to the current text field, provided that a text field has started but not yet ended; otherwise, doesn't do anything.Usually,
word
enonWord
are just the result of a call toWordReader.next(MutableString, MutableString)
.- Parameters:
word
- a word.nonWord
- a nonword.- Throws:
IOException
-
close
void close() throws IOException
Terminates the contruction of the collection.- Throws:
IOException
-
-