Package it.unimi.di.big.mg4j.document
Class ZipDocumentCollectionBuilder
- java.lang.Object
-
- it.unimi.di.big.mg4j.document.ZipDocumentCollectionBuilder
-
- All Implemented Interfaces:
DocumentCollectionBuilder
public class ZipDocumentCollectionBuilder extends Object implements DocumentCollectionBuilder
A builder for zipped document collections.
-
-
Constructor Summary
Constructors Constructor Description ZipDocumentCollectionBuilder(String basename, DocumentFactory factory, boolean exact)
Creates a new zipped collection builder.
-
Method Summary
Modifier and Type Method Description void
add(MutableString word, MutableString nonWord)
String
basename()
Returns the basename of this builder.void
build(DocumentSequence inputSequence)
void
close()
Terminates the contruction of the collection.void
endDocument()
Ends a document entry.void
endTextField()
Ends a new text field.static void
main(String[] arg)
void
nonTextField(Object o)
Adds a non-text field.void
open(CharSequence suffix)
Opens a new collection.void
startDocument(CharSequence title, CharSequence uri)
Starts a document entry.void
startTextField()
Starts a new text field.void
virtualField(List<Scan.VirtualDocumentFragment> fragments)
Adds a virtual field.
-
-
-
Constructor Detail
-
ZipDocumentCollectionBuilder
public ZipDocumentCollectionBuilder(String basename, DocumentFactory factory, boolean exact)
Creates a new zipped collection builder.- Parameters:
factory
- the factory of the base document sequence.exact
- true iff also non-words should be preserved.
-
-
Method Detail
-
open
public void open(CharSequence suffix) throws FileNotFoundException
Description copied from interface:DocumentCollectionBuilder
Opens a new collection.- Specified by:
open
in interfaceDocumentCollectionBuilder
- Parameters:
suffix
- a suffix that will be added to the basename provided at construction time.- Throws:
FileNotFoundException
-
basename
public String basename()
Description copied from interface:DocumentCollectionBuilder
Returns the basename of this builder.- Specified by:
basename
in interfaceDocumentCollectionBuilder
- Returns:
- the basename
-
startDocument
public void startDocument(CharSequence title, CharSequence uri) throws IOException
Description copied from interface:DocumentCollectionBuilder
Starts a document entry.- Specified by:
startDocument
in interfaceDocumentCollectionBuilder
- Parameters:
title
- the document title (usually, the result ofDocument.title()
).uri
- the document uri (usually, the result ofDocument.uri()
).- Throws:
IOException
-
endDocument
public void endDocument() throws IOException
Description copied from interface:DocumentCollectionBuilder
Ends a document entry.- Specified by:
endDocument
in interfaceDocumentCollectionBuilder
- Throws:
IOException
-
startTextField
public void startTextField()
Description copied from interface:DocumentCollectionBuilder
Starts a new text field.- Specified by:
startTextField
in interfaceDocumentCollectionBuilder
-
nonTextField
public void nonTextField(Object o) throws IOException
Description copied from interface:DocumentCollectionBuilder
Adds a non-text field.- Specified by:
nonTextField
in interfaceDocumentCollectionBuilder
- Parameters:
o
- the content of the non-text field.- Throws:
IOException
-
virtualField
public void virtualField(List<Scan.VirtualDocumentFragment> fragments) throws IOException
Description copied from interface:DocumentCollectionBuilder
Adds a virtual field.- Specified by:
virtualField
in interfaceDocumentCollectionBuilder
- Parameters:
fragments
- the virtual fragments to be added.- Throws:
IOException
-
endTextField
public void endTextField() throws IOException
Description copied from interface:DocumentCollectionBuilder
Ends a new text field.- Specified by:
endTextField
in interfaceDocumentCollectionBuilder
- Throws:
IOException
-
add
public void add(MutableString word, MutableString nonWord) throws IOException
Description copied from interface:DocumentCollectionBuilder
Adds a word and a nonword to the current text field, provided that a text field has started but not yet ended; otherwise, doesn't do anything.Usually,
word
enonWord
are just the result of a call toWordReader.next(MutableString, MutableString)
.- Specified by:
add
in interfaceDocumentCollectionBuilder
- Parameters:
word
- a word.nonWord
- a nonword.- Throws:
IOException
-
close
public void close() throws IOException
Description copied from interface:DocumentCollectionBuilder
Terminates the contruction of the collection.- Specified by:
close
in interfaceDocumentCollectionBuilder
- Throws:
IOException
-
build
public void build(DocumentSequence inputSequence) throws IOException
- Throws:
IOException
-
main
public static void main(String[] arg) throws com.martiansoftware.jsap.JSAPException, IOException, ClassNotFoundException, InvocationTargetException, NoSuchMethodException, IllegalAccessException, InstantiationException, IllegalArgumentException, SecurityException
- Throws:
com.martiansoftware.jsap.JSAPException
IOException
ClassNotFoundException
InvocationTargetException
NoSuchMethodException
IllegalAccessException
InstantiationException
IllegalArgumentException
SecurityException
-
-