Class SimpleCompressedDocumentCollection

  • All Implemented Interfaces:
    DocumentCollection, DocumentSequence, SafelyCloseable, FlyweightPrototype<DocumentCollection>, Closeable, Serializable, AutoCloseable

    public class SimpleCompressedDocumentCollection
    extends AbstractDocumentCollection
    implements Serializable
    A basic, compressed document collection that can be easily built at indexing time.

    Instances of this class record virtual and non-text fields just like ZipDocumentCollection—that is, in a zip file. However, text fields are recorded in a simple but highly efficient format. Terms (and nonterms) are numbered globally in an increasing way as they are met. While we scan each document, we keep track of frequencies for a limited number of terms: terms are encoded with their frequency rank if we know their statistics, or by a special code derived from their global number if we have no statistics about them. Every number involved is written in delta code.

    A collection can be exact or approximated: in the latter case, nonwords will not be recorded, and will be turned into spaces when decompressing.

    A instance of this collection will be, as any other collection, serialised on a file, but it will refer to several other files that are derived from the instance basename. Please use AbstractDocumentSequence.load(CharSequence) to load instances of this collection.

    This class suffers the same scalability problem of ZipDocumentCollection if you compress non-text or virtual fields. Text compression, on the other hand, is extremely efficient and scalable.

    Author:
    Sebastiano Vigna
    See Also:
    Serialized Form