Class BitStreamIndexWriter

  • All Implemented Interfaces:
    IndexWriter
    Direct Known Subclasses:
    SkipBitStreamIndexWriter

    public class BitStreamIndexWriter
    extends AbstractBitStreamIndexWriter
    Writes a bitstream-based interleaved index.

    Indices written by this class are somewhat classical. Each inverted list contains the frequency, followed by gap-encoded pointers optionally interleaved with counts and gap-encoded positions. The compression technique used for each component can be chosen using a compression flag.

    Interleaved indices of this kind are essentially unusable, as all information in each posting list must be entirely read (no skipping is possible). One possible exception is disjunctive queries which use all the information in the index (e.g., with proximity scoring). Another possible usage is to test the compression power of different codes, as essentially all classical compression techniques are available. But, most importantly, the Scan tool generates interleaved indices as batches (albeit not using this class).

    These are the files that form an interleaved index:

    basename.properties
    A Java property file containing information about the index.
    basename.terms
    For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i.
    basename.frequencies
    For each term, the number of documents in which the term appears in γ coding. More precisely, i-th integer of the file (starting from 0) is the number of documents in which the term of index i appears. This information appears also at the start of each posting list in the index, but it is also stored in this file for convenience.
    basename.sizes (not generated for payload-based indices)
    For each indexed document, the corresponding size (=number of words) in γ coding. More precisely, i-th integer of the file (starting from 0) is the size in words of the document of index i.
    basename.index
    The inverted index.
    basename.offsets
    For each term, the bit offset in basename.index at which the inverted lists start. More precisely, the first integer is the offset for term 0 in γ coding, and then the i-th integer is the difference between the i-th and the i−1-th offset in γ coding. If T terms were indexed, this file will contain T+1 integers, the last being the difference (in bits) between the length of the entire inverted index and the offset of the last inverted list. Thus, in practice, the file is formed by the number zero (the offset of the first list) followed by the length in bits of each inverted list.
    basename.occurrencies
    For each term, its occurrency, that is, the number of its occurrences throughout the whole document collection, in γ coding. More precisely, the i-th integer of the file (starting from 0) is the occurrency of the term of index i.
    basename.posnumbits
    For each term, the number of bits spent to store positions in γ code (used just for quantum-optimisation purposes).
    basename.sumsmaxpos
    For each term, the sum of the maximum positions in which the term appears (necessary to build a QuasiSuccinctIndex) in δ code.
    basename.stats
    Miscellaneous detailed statistics about the index.
    Since:
    0.6
    Author:
    Paolo Boldi, Sebastiano Vigna
    • Field Detail

      • BEFORE_PAYLOAD

        protected static final int BEFORE_PAYLOAD
        This value of state can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.
        See Also:
        Constant Field Values
      • BEFORE_COUNT

        protected static final int BEFORE_COUNT
        This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
        See Also:
        Constant Field Values
      • BEFORE_POSITIONS

        protected static final int BEFORE_POSITIONS
        This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
        See Also:
        Constant Field Values
      • FIRST_UNUSED_STATE

        protected static final int FIRST_UNUSED_STATE
        This is the first unused state. Subclasses may start from this value to define new states.
        See Also:
        Constant Field Values
      • state

        protected int state
        The current state of the writer.
      • frequency

        protected long frequency
        The number of document records that the current inverted list will contain.
      • writtenDocuments

        protected long writtenDocuments
        The number of document records already written for the current inverted list.
      • currentDocument

        protected long currentDocument
        The current document pointer.
      • lastDocument

        protected long lastDocument
        The last document pointer in the current list.
      • lastInvertedListPos

        protected long lastInvertedListPos
        The position (in bytes) where the last inverted list started.
      • maxCount

        public int maxCount
        The maximum number of positions in a document record so far.
      • b

        protected int b
        The parameter b for Golomb coding of pointers.
      • log2b

        protected int log2b
        The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.
    • Constructor Detail

      • BitStreamIndexWriter

        public BitStreamIndexWriter​(IOFactory ioFactory,
                                    CharSequence basename,
                                    long numberOfDocuments,
                                    boolean writeOffsets,
                                    Map<CompressionFlags.Component,​CompressionFlags.Coding> flags)
                             throws IOException
        Creates a new index writer with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets). When close() will be called, the property file will also be produced (stemmed with .properties), or enriched if it already exists.
        Parameters:
        ioFactory - the factory that will be used to perform I/O.
        basename - the basename.
        numberOfDocuments - the number of documents in the collection to be indexed.
        writeOffsets - if true, the offset file will also be produced.
        flags - a flag map setting the coding techniques to be used (see CompressionFlags).
        Throws:
        IOException