Class ZipDocumentCollection

  • All Implemented Interfaces:
    DocumentCollection, DocumentSequence, SafelyCloseable, FlyweightPrototype<DocumentCollection>, Closeable, Serializable, AutoCloseable

    public class ZipDocumentCollection
    extends AbstractDocumentCollection
    implements Serializable
    A document collection stored in a zip file.

    Each instance of this class has an associated zip file. Each Zip entry corresponds to a document: the title is recorded in the comment field, whereas the URI is written with MutableString.writeSelfDelimUTF8(java.io.OutputStream) directly to the zipped output stream. When building an exact ZipDocumentCollection subsequent word/nonword pairs are written in the same way, and delimited by two empty strings. If the collection is not exact, just words are written, and delimited by an empty string. Non-text fields are written directly to the zipped output stream as serialised objects.

    The collection will produce the same documents as the original sequence whence it was produced, in the following sense:

    • the resulting collection has as many document as the original sequence, in the same order, with the same titles and URI;
    • every document has the same number of fields, with the same names and types;
    • non-textual non-virtual fields will be written out as objects, so they need to be serializable;
    • virtual fields will be written as a sequence of self-delimiting UTF-8 mutable strings starting with the number of fragments (converted into a string with String.valueOf(int)), followed by a pair of strings for each fragment (the first string being the document specifier, and the second being the associated text);
    • textual fields will be written out in such a way that, when reading them, the same sequence of words and non-words will be produced; alternatively, one may produce a collection that only copies words (non-words are not copied).

    The collection will be, as any other collection, serialized on a file, but it will refer to another zip file that is going to contain the documents themselves. Please use AbstractDocumentSequence.load(CharSequence) to load instances of this collection.

    Note that the zip format is not designed for a large number of files. This class is mainly a useful example, and a handy way to build quickly a collection containing all fields at indexing time. For a more efficient kind of collection, see SimpleCompressedDocumentCollection.

    Warning: the Reader returned by Document.content(int) for documents produced by this factory is just obtained as the concatenation of words and non-words returned by the word reader for that field. In case the collection is not exact, nonwords are substituted by a space.

    See Also:
    Serialized Form