it.unimi.di.mg4j.index
Class BitStreamIndex

java.lang.Object
  extended by it.unimi.di.mg4j.index.Index
      extended by it.unimi.di.mg4j.index.BitStreamIndex
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
BitStreamHPIndex, FileIndex, InMemoryIndex, MemoryMappedIndex, RemoteBitStreamIndex

public abstract class BitStreamIndex
extends Index

A bitstream-based index. Instances of this class contains additional index data related to compression, such as the codes used for each part of the index.

Implementing subclasses must provide access to the index bitstream both at byte and bit level. A bitstream-based index usually exposes the offset list.

Wired implementations

The standard readers associated with an instance of this class are of type BitStreamIndexReader. Nonetheless, it is possible to generate automatically sources for wired classes that work only for a particular set of codings and flags. The wired classes will be fetched automagically by reflection, if available. Please read the section about performance in the MG4J manual.

Since:
1.1
Author:
Sebastiano Vigna
See Also:
Serialized Form

Nested Class Summary
static class BitStreamIndex.PropertyKeys
          Symbolic names for additional properties of a BitStreamIndex.
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.index.Index
Index.EmptyIndexIterator, Index.UriKeys
 
Field Summary
 int bufferSize
          The size of the buffer used to read the bit stream.
 CompressionFlags.Coding countCoding
          The coding for counts.
static int DEFAULT_BUFFER_SIZE
          The default buffer size.
static int DEFAULT_FIXED_QUANTUM
          The default fixed quantum (each 64 postings).
static int DEFAULT_HEIGHT
          The default height (fairly low, due to memory consumption).
static int DEFAULT_QUANTUM
          The default variable quantum (1% of index size).
static int FIXED_POINT_BITS
          Fixed number of fractional binary digits used in fixed-point computation of Golomb moduli.
static long FIXED_POINT_MULTIPLIER
          1L << FIXED_POINT_BITS.
 CompressionFlags.Coding frequencyCoding
          The coding for frequencies.
 int height
          The parameter h (the maximum height of a skip tower), or -1 if this index has no skips.
 LongList offsets
          The offset of each term, if offsets were loaded or specified at creation time, or null.
 CompressionFlags.Coding pointerCoding
          The coding for pointers.
 CompressionFlags.Coding positionCoding
          The coding for positions.
 int quantum
          The quantum, or -1 if this index has no skips, or 0 if this is a BitStreamHPIndex and quanta are variable.
 Constructor<? extends IndexReader> readerConstructor
          The constructor that will be used to create new index readers.
 
Fields inherited from class it.unimi.di.mg4j.index.Index
field, hasCounts, hasPayloads, hasPositions, keyIndex, maxCount, numberOfDocuments, numberOfOccurrences, numberOfPostings, numberOfTerms, payload, prefixMap, properties, singletonSet, sizes, termMap, termProcessor
 
Constructor Summary
BitStreamIndex(int numberOfDocuments, int numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, CompressionFlags.Coding frequencyCoding, CompressionFlags.Coding pointerCoding, CompressionFlags.Coding countCoding, CompressionFlags.Coding positionCoding, int quantum, int height, int bufferSize, TermProcessor termProcessor, String field, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntList sizes, LongList offsets)
           
 
Method Summary
protected static String featureName(CompressionFlags.Coding coding)
           
static int gaussianGolombModulus(long quantumSigma, int shift)
          Computes the Gaussian Golomb modulus for a given standard deviation and shift using fixed-point arithmetic.
protected  Constructor<? extends IndexReader> getConstructor()
           
abstract  InputBitStream getInputBitStream(int bufferSize)
          Returns an input bit stream over the index.
abstract  InputStream getInputStream()
          Returns an input stream over the index.
 IndexReader getReader(int bufferSize)
          Creates and returns a new IndexReader based on this index.
static int golombModulus(int p, int q)
          Computes the Golomb modulus for a given fraction using fixed-point arithmetic and a precomputed table for small values.
static long quantumSigma(int frequency, int numberOfDocuments, int quantum)
          Computes the standard deviation associated with a given quantum and document frequency.
 String toString()
           
 
Methods inherited from class it.unimi.di.mg4j.index.Index
documents, documents, documents, getEmptyIndexIterator, getEmptyIndexIterator, getEmptyIndexIterator, getEmptyIndexIterator, getInstance, getInstance, getInstance, getInstance, getInstance, getReader, getTermProcessor, keyIndex
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_HEIGHT

public static final int DEFAULT_HEIGHT
The default height (fairly low, due to memory consumption).

See Also:
Constant Field Values

DEFAULT_QUANTUM

public static final int DEFAULT_QUANTUM
The default variable quantum (1% of index size).

See Also:
Constant Field Values

DEFAULT_FIXED_QUANTUM

public static final int DEFAULT_FIXED_QUANTUM
The default fixed quantum (each 64 postings).

See Also:
Constant Field Values

DEFAULT_BUFFER_SIZE

public static final int DEFAULT_BUFFER_SIZE
The default buffer size.

See Also:
Constant Field Values

frequencyCoding

public final CompressionFlags.Coding frequencyCoding
The coding for frequencies. See CompressionFlags.


pointerCoding

public final CompressionFlags.Coding pointerCoding
The coding for pointers. See CompressionFlags.


countCoding

public final CompressionFlags.Coding countCoding
The coding for counts. See CompressionFlags.


positionCoding

public final CompressionFlags.Coding positionCoding
The coding for positions. See CompressionFlags.


offsets

public final LongList offsets
The offset of each term, if offsets were loaded or specified at creation time, or null.


height

public final int height
The parameter h (the maximum height of a skip tower), or -1 if this index has no skips.


quantum

public final int quantum
The quantum, or -1 if this index has no skips, or 0 if this is a BitStreamHPIndex and quanta are variable.


bufferSize

public final int bufferSize
The size of the buffer used to read the bit stream.


readerConstructor

public transient Constructor<? extends IndexReader> readerConstructor
The constructor that will be used to create new index readers.


FIXED_POINT_BITS

public static final int FIXED_POINT_BITS
Fixed number of fractional binary digits used in fixed-point computation of Golomb moduli.

See Also:
Constant Field Values

FIXED_POINT_MULTIPLIER

public static final long FIXED_POINT_MULTIPLIER
1L << FIXED_POINT_BITS.

See Also:
Constant Field Values
Constructor Detail

BitStreamIndex

public BitStreamIndex(int numberOfDocuments,
                      int numberOfTerms,
                      long numberOfPostings,
                      long numberOfOccurrences,
                      int maxCount,
                      Payload payload,
                      CompressionFlags.Coding frequencyCoding,
                      CompressionFlags.Coding pointerCoding,
                      CompressionFlags.Coding countCoding,
                      CompressionFlags.Coding positionCoding,
                      int quantum,
                      int height,
                      int bufferSize,
                      TermProcessor termProcessor,
                      String field,
                      Properties properties,
                      StringMap<? extends CharSequence> termMap,
                      PrefixMap<? extends CharSequence> prefixMap,
                      IntList sizes,
                      LongList offsets)
Method Detail

getConstructor

protected Constructor<? extends IndexReader> getConstructor()

featureName

protected static String featureName(CompressionFlags.Coding coding)

getInputBitStream

public abstract InputBitStream getInputBitStream(int bufferSize)
                                          throws IOException
Returns an input bit stream over the index.

Parameters:
bufferSize - a suggested buffer size.
Returns:
an input bit stream over the index.
Throws:
IOException

getInputStream

public abstract InputStream getInputStream()
                                    throws IOException
Returns an input stream over the index.

Returns:
an input stream over the index.
Throws:
IOException

getReader

public IndexReader getReader(int bufferSize)
                      throws IOException
Description copied from class: Index
Creates and returns a new IndexReader based on this index. After that, you can use the reader to read this index.

Specified by:
getReader in class Index
Parameters:
bufferSize - the size of the buffer to be used accessing the reader, or -1 for a default buffer size.
Returns:
a new IndexReader to read this index.
Throws:
IOException

golombModulus

public static int golombModulus(int p,
                                int q)
Computes the Golomb modulus for a given fraction using fixed-point arithmetic and a precomputed table for small values. This gives results that are extremely close to ⌈ log( 2 - p/q ) / log( 1 - p/q ) ⌉, but the computation is orders of magnitude quicker.

Parameters:
p - the numerator.
q - the denominator (larger than or equal to p).
Returns:
the Golomb modulus for p/q.

gaussianGolombModulus

public static int gaussianGolombModulus(long quantumSigma,
                                        int shift)
Computes the Gaussian Golomb modulus for a given standard deviation and shift using fixed-point arithmetic.

The Golomb modulus for (positive and negative) integers normally distributed with standard deviation σ can be computed using the formula ⌈ 2 sqrt( 2 / π ) ln(2) σ ⌉.

The resulting Golomb modulus is near to optimal for coding such integers after they have been passed through Fast.int2nat(int). Note, however, that Golomb coding is not optimal for a normal distribution.

This function is used to compute the correct Golomb modulus for skip towers.

Parameters:
quantumSigma - the standard deviation of a quantum as returned by quantumSigma(int, int, int).
shift - a shift parameter.
Returns:
the Golomb modulus for the standard deviation obtained multiplying quantumSigma by the square root of 2shift-1.

quantumSigma

public static long quantumSigma(int frequency,
                                int numberOfDocuments,
                                int quantum)
Computes the standard deviation associated with a given quantum and document frequency.

Parameters:
frequency - the document frequency.
numberOfDocuments - the overall number of documents.
quantum - the quantum.
Returns:
a long representing in fixed-point arithmetic the value Math.sqrt( quantum * ( 1 - p ) ) / p, where p is the relative frequency.

toString

public String toString()
Overrides:
toString in class Object