Class BitStreamIndex
- java.lang.Object
-
- it.unimi.di.big.mg4j.index.Index
-
- it.unimi.di.big.mg4j.index.BitStreamIndex
-
- All Implemented Interfaces:
Serializable
- Direct Known Subclasses:
BitStreamHPIndex
,FileIndex
,InMemoryIndex
,MemoryMappedIndex
public abstract class BitStreamIndex extends Index
A bitstream-based index. Instances of this class contains additional index data related to compression, such as the codes used for each part of the index.Implementing subclasses must provide access to the index bitstream both at byte and bit level. A bitstream-based index usually exposes the offset list.
Wired implementations
The standard readers associated with an instance of this class are of type
BitStreamIndexReader
. Nonetheless, it is possible to generate automatically sources for wired classes that work only for a particular set of codings and flags. The wired classes will be fetched automagically by reflection, if available. Please read the section about performance in the MG4J manual.- Since:
- 1.1
- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
BitStreamIndex.PropertyKeys
Symbolic names for additional properties of aBitStreamIndex
.-
Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.index.Index
Index.EmptyIndexIterator, Index.UriKeys
-
-
Field Summary
Fields Modifier and Type Field Description int
bufferSize
The size of the buffer used to read the bit stream.CompressionFlags.Coding
countCoding
The coding for counts.static int
DEFAULT_BUFFER_SIZE
The default buffer size.static int
DEFAULT_FIXED_QUANTUM
The default fixed quantum (each 64 postings).static int
DEFAULT_HEIGHT
The default height (fairly low, due to memory consumption).static int
DEFAULT_QUANTUM
The default variable quantum (1% of index size).static int
FIXED_POINT_BITS
Fixed number of fractional binary digits used in fixed-point computation of Golomb moduli.static long
FIXED_POINT_MULTIPLIER
1L <<
.FIXED_POINT_BITS
CompressionFlags.Coding
frequencyCoding
The coding for frequencies.int
height
The parameterh
(the maximum height of a skip tower), or -1 if this index has no skips.LongBigList
offsets
The offset of each term, if offsets were loaded or specified at creation time, ornull
.CompressionFlags.Coding
pointerCoding
The coding for pointers.CompressionFlags.Coding
positionCoding
The coding for positions.int
quantum
The quantum, or -1 if this index has no skips, or 0 if this is aBitStreamHPIndex
and quanta are variable.Constructor<? extends IndexReader>
readerConstructor
The constructor that will be used to create new index readers.-
Fields inherited from class it.unimi.di.big.mg4j.index.Index
field, hasCounts, hasPayloads, hasPositions, keyIndex, maxCount, numberOfDocuments, numberOfOccurrences, numberOfPostings, numberOfTerms, payload, prefixMap, properties, singletonSet, sizes, termMap, termProcessor
-
-
Constructor Summary
Constructors Constructor Description BitStreamIndex(long numberOfDocuments, long numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, CompressionFlags.Coding frequencyCoding, CompressionFlags.Coding pointerCoding, CompressionFlags.Coding countCoding, CompressionFlags.Coding positionCoding, int quantum, int height, int bufferSize, TermProcessor termProcessor, String field, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntBigList sizes, LongBigList offsets)
-
Method Summary
Modifier and Type Method Description protected static String
featureName(CompressionFlags.Coding coding)
static int
gaussianGolombModulus(long quantumSigma, int shift)
Computes the Gaussian Golomb modulus for a given standard deviation and shift using fixed-point arithmetic.protected Constructor<? extends IndexReader>
getConstructor()
abstract InputBitStream
getInputBitStream(int bufferSize)
Returns an input bit stream over the index.abstract InputStream
getInputStream()
Returns an input stream over the index.IndexReader
getReader(int bufferSize)
Creates and returns a newIndexReader
based on this index.static int
golombModulus(long p, long q)
Computes the Golomb modulus for a given fraction using fixed-point arithmetic and a precomputed table for small values.static long
quantumSigma(long frequency, long numberOfDocuments, long quantum)
Computes the standard deviation associated with a given quantum and document frequency.String
toString()
-
Methods inherited from class it.unimi.di.big.mg4j.index.Index
documents, documents, documents, getEmptyIndexIterator, getEmptyIndexIterator, getEmptyIndexIterator, getEmptyIndexIterator, getInstance, getInstance, getInstance, getInstance, getInstance, getReader, getTermProcessor, keyIndex
-
-
-
-
Field Detail
-
DEFAULT_HEIGHT
public static final int DEFAULT_HEIGHT
The default height (fairly low, due to memory consumption).- See Also:
- Constant Field Values
-
DEFAULT_QUANTUM
public static final int DEFAULT_QUANTUM
The default variable quantum (1% of index size).- See Also:
- Constant Field Values
-
DEFAULT_FIXED_QUANTUM
public static final int DEFAULT_FIXED_QUANTUM
The default fixed quantum (each 64 postings).- See Also:
- Constant Field Values
-
DEFAULT_BUFFER_SIZE
public static final int DEFAULT_BUFFER_SIZE
The default buffer size.- See Also:
- Constant Field Values
-
frequencyCoding
public final CompressionFlags.Coding frequencyCoding
The coding for frequencies. SeeCompressionFlags
.
-
pointerCoding
public final CompressionFlags.Coding pointerCoding
The coding for pointers. SeeCompressionFlags
.
-
countCoding
public final CompressionFlags.Coding countCoding
The coding for counts. SeeCompressionFlags
.
-
positionCoding
public final CompressionFlags.Coding positionCoding
The coding for positions. SeeCompressionFlags
.
-
offsets
public final LongBigList offsets
The offset of each term, if offsets were loaded or specified at creation time, ornull
.
-
height
public final int height
The parameterh
(the maximum height of a skip tower), or -1 if this index has no skips.
-
quantum
public final int quantum
The quantum, or -1 if this index has no skips, or 0 if this is aBitStreamHPIndex
and quanta are variable.
-
bufferSize
public final int bufferSize
The size of the buffer used to read the bit stream.
-
readerConstructor
public transient Constructor<? extends IndexReader> readerConstructor
The constructor that will be used to create new index readers.
-
FIXED_POINT_BITS
public static final int FIXED_POINT_BITS
Fixed number of fractional binary digits used in fixed-point computation of Golomb moduli.- See Also:
- Constant Field Values
-
FIXED_POINT_MULTIPLIER
public static final long FIXED_POINT_MULTIPLIER
1L <<
.FIXED_POINT_BITS
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
BitStreamIndex
public BitStreamIndex(long numberOfDocuments, long numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, CompressionFlags.Coding frequencyCoding, CompressionFlags.Coding pointerCoding, CompressionFlags.Coding countCoding, CompressionFlags.Coding positionCoding, int quantum, int height, int bufferSize, TermProcessor termProcessor, String field, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntBigList sizes, LongBigList offsets)
-
-
Method Detail
-
getConstructor
protected Constructor<? extends IndexReader> getConstructor()
-
featureName
protected static String featureName(CompressionFlags.Coding coding)
-
getInputBitStream
public abstract InputBitStream getInputBitStream(int bufferSize) throws IOException
Returns an input bit stream over the index.- Parameters:
bufferSize
- a suggested buffer size.- Returns:
- an input bit stream over the index.
- Throws:
IOException
-
getInputStream
public abstract InputStream getInputStream() throws IOException
Returns an input stream over the index.- Returns:
- an input stream over the index.
- Throws:
IOException
-
getReader
public IndexReader getReader(int bufferSize) throws IOException
Description copied from class:Index
Creates and returns a newIndexReader
based on this index. After that, you can use the reader to read this index.- Specified by:
getReader
in classIndex
- Parameters:
bufferSize
- the size of the buffer to be used accessing the reader, or -1 for a default buffer size.- Returns:
- a new
IndexReader
to read this index. - Throws:
IOException
-
golombModulus
public static int golombModulus(long p, long q)
Computes the Golomb modulus for a given fraction using fixed-point arithmetic and a precomputed table for small values. This gives results that are extremely close to ⌈ log( 2 -p
/q
) / log( 1 -p
/q
) ⌉, but the computation is orders of magnitude quicker.- Parameters:
p
- the numerator.q
- the denominator (larger than or equal top
).- Returns:
- the Golomb modulus for
p
/q
.
-
gaussianGolombModulus
public static int gaussianGolombModulus(long quantumSigma, int shift)
Computes the Gaussian Golomb modulus for a given standard deviation and shift using fixed-point arithmetic.The Golomb modulus for (positive and negative) integers normally distributed with standard deviation σ can be computed using the formula ⌈ 2 sqrt( 2 / π ) ln(2) σ ⌉.
The resulting Golomb modulus is near to optimal for coding such integers after they have been passed through
Fast.int2nat(int)
. Note, however, that Golomb coding is not optimal for a normal distribution.This function is used to compute the correct Golomb modulus for skip towers.
- Parameters:
quantumSigma
- the standard deviation of a quantum as returned byquantumSigma(long, long, long)
.shift
- a shift parameter.- Returns:
- the Golomb modulus for the standard deviation obtained multiplying
quantumSigma
by the square root of 2shift
-1.
-
quantumSigma
public static long quantumSigma(long frequency, long numberOfDocuments, long quantum)
Computes the standard deviation associated with a given quantum and document frequency.- Parameters:
frequency
- the document frequency.numberOfDocuments
- the overall number of documents.quantum
- the quantum.- Returns:
- a long representing in fixed-point arithmetic the value
Math.sqrt( quantum * ( 1 - p ) ) / p
, wherep
is the relative frequency.
-
-