Class BitStreamIndex

  • All Implemented Interfaces:
    Serializable
    Direct Known Subclasses:
    BitStreamHPIndex, FileIndex, InMemoryIndex, MemoryMappedIndex

    public abstract class BitStreamIndex
    extends Index
    A bitstream-based index. Instances of this class contains additional index data related to compression, such as the codes used for each part of the index.

    Implementing subclasses must provide access to the index bitstream both at byte and bit level. A bitstream-based index usually exposes the offset list.

    Wired implementations

    The standard readers associated with an instance of this class are of type BitStreamIndexReader. Nonetheless, it is possible to generate automatically sources for wired classes that work only for a particular set of codings and flags. The wired classes will be fetched automagically by reflection, if available. Please read the section about performance in the MG4J manual.

    Since:
    1.1
    Author:
    Sebastiano Vigna
    See Also:
    Serialized Form
    • Field Detail

      • DEFAULT_HEIGHT

        public static final int DEFAULT_HEIGHT
        The default height (fairly low, due to memory consumption).
        See Also:
        Constant Field Values
      • DEFAULT_QUANTUM

        public static final int DEFAULT_QUANTUM
        The default variable quantum (1% of index size).
        See Also:
        Constant Field Values
      • DEFAULT_FIXED_QUANTUM

        public static final int DEFAULT_FIXED_QUANTUM
        The default fixed quantum (each 64 postings).
        See Also:
        Constant Field Values
      • DEFAULT_BUFFER_SIZE

        public static final int DEFAULT_BUFFER_SIZE
        The default buffer size.
        See Also:
        Constant Field Values
      • offsets

        public final LongBigList offsets
        The offset of each term, if offsets were loaded or specified at creation time, or null.
      • height

        public final int height
        The parameter h (the maximum height of a skip tower), or -1 if this index has no skips.
      • quantum

        public final int quantum
        The quantum, or -1 if this index has no skips, or 0 if this is a BitStreamHPIndex and quanta are variable.
      • bufferSize

        public final int bufferSize
        The size of the buffer used to read the bit stream.
      • readerConstructor

        public transient Constructor<? extends IndexReader> readerConstructor
        The constructor that will be used to create new index readers.
      • FIXED_POINT_BITS

        public static final int FIXED_POINT_BITS
        Fixed number of fractional binary digits used in fixed-point computation of Golomb moduli.
        See Also:
        Constant Field Values
    • Method Detail

      • getInputBitStream

        public abstract InputBitStream getInputBitStream​(int bufferSize)
                                                  throws IOException
        Returns an input bit stream over the index.
        Parameters:
        bufferSize - a suggested buffer size.
        Returns:
        an input bit stream over the index.
        Throws:
        IOException
      • getInputStream

        public abstract InputStream getInputStream()
                                            throws IOException
        Returns an input stream over the index.
        Returns:
        an input stream over the index.
        Throws:
        IOException
      • getReader

        public IndexReader getReader​(int bufferSize)
                              throws IOException
        Description copied from class: Index
        Creates and returns a new IndexReader based on this index. After that, you can use the reader to read this index.
        Specified by:
        getReader in class Index
        Parameters:
        bufferSize - the size of the buffer to be used accessing the reader, or -1 for a default buffer size.
        Returns:
        a new IndexReader to read this index.
        Throws:
        IOException
      • golombModulus

        public static int golombModulus​(long p,
                                        long q)
        Computes the Golomb modulus for a given fraction using fixed-point arithmetic and a precomputed table for small values. This gives results that are extremely close to ⌈ log( 2 - p/q ) / log( 1 - p/q ) ⌉, but the computation is orders of magnitude quicker.
        Parameters:
        p - the numerator.
        q - the denominator (larger than or equal to p).
        Returns:
        the Golomb modulus for p/q.
      • gaussianGolombModulus

        public static int gaussianGolombModulus​(long quantumSigma,
                                                int shift)
        Computes the Gaussian Golomb modulus for a given standard deviation and shift using fixed-point arithmetic.

        The Golomb modulus for (positive and negative) integers normally distributed with standard deviation σ can be computed using the formula ⌈ 2 sqrt( 2 / π ) ln(2) σ ⌉.

        The resulting Golomb modulus is near to optimal for coding such integers after they have been passed through Fast.int2nat(int). Note, however, that Golomb coding is not optimal for a normal distribution.

        This function is used to compute the correct Golomb modulus for skip towers.

        Parameters:
        quantumSigma - the standard deviation of a quantum as returned by quantumSigma(long, long, long).
        shift - a shift parameter.
        Returns:
        the Golomb modulus for the standard deviation obtained multiplying quantumSigma by the square root of 2shift-1.
      • quantumSigma

        public static long quantumSigma​(long frequency,
                                        long numberOfDocuments,
                                        long quantum)
        Computes the standard deviation associated with a given quantum and document frequency.
        Parameters:
        frequency - the document frequency.
        numberOfDocuments - the overall number of documents.
        quantum - the quantum.
        Returns:
        a long representing in fixed-point arithmetic the value Math.sqrt( quantum * ( 1 - p ) ) / p, where p is the relative frequency.