Class SkipBitStreamIndexWriter
- java.lang.Object
-
- it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
-
- it.unimi.di.big.mg4j.index.BitStreamIndexWriter
-
- it.unimi.di.big.mg4j.index.SkipBitStreamIndexWriter
-
- All Implemented Interfaces:
IndexWriter
,VariableQuantumIndexWriter
public class SkipBitStreamIndexWriter extends BitStreamIndexWriter implements VariableQuantumIndexWriter
Writes a bitstream-based interleaved index with skips.These indices are managed by MG4J mainly for historical reasons, as quasi-succinct indices are just better under every respect.
An interleaved inverted index with skips makes it possible to skip ahead quickly while reading inverted lists. More specifically, when reading the inverted list relative to a certain term, one may want to decide to skip all document records that concern documents with pointer less than a given integer. In a normal inverted index this is impossible: one would have to read all document records sequentially.
The skipping structure used by this class is new, and has been described by Paolo Boldi and Sebastiano Vigna in “Compressed perfect embedded skip lists for quick inverted-index lookups”, Proc. SPIRE 2005, volume 3772 of Lecture Notes in Computer Science, pages 25−28. Springer, 2005.
- Since:
- 0.6
- Author:
- Paolo Boldi, Sebastiano Vigna
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
SkipBitStreamIndexWriter.TowerData
A structure maintaining statistical data about tower construction.
-
Field Summary
Fields Modifier and Type Field Description long
bitsForEntryBitLengths
The number of bits written for entry lenghts.long
bitsForQuantumBitLengths
The number of bits written for quantum lengths.long
bitsForVariableQuanta
The number of bits written for variable quanta.static int
DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list.long
numberOfBlocks
The number of written blocks.int
prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.int
prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.SkipBitStreamIndexWriter.TowerData
towerData
The sum of all tower data computed so far.-
Fields inherited from class it.unimi.di.big.mg4j.index.BitStreamIndexWriter
b, BEFORE_COUNT, BEFORE_DOCUMENT_RECORD, BEFORE_FREQUENCY, BEFORE_INVERTED_LIST, BEFORE_PAYLOAD, BEFORE_POINTER, BEFORE_POSITIONS, currentDocument, FIRST_UNUSED_STATE, frequency, lastDocument, lastInvertedListPos, log2b, maxCount, obs, state, writtenDocuments
-
Fields inherited from class it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding
-
-
Constructor Summary
Constructors Constructor Description SkipBitStreamIndexWriter(IOFactory ioFactory, CharSequence basename, long numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height)
Creates a new skip index writer with the specified basename.
-
Method Summary
Modifier and Type Method Description void
close()
Closes this index writer, completing the index creation process and releasing all resources.static int
log2Quantum(long predictedFrequency, long numberOfDocuments, double fraction, long predictedSize, long predictedPositionsSize)
Suggests a quantum size using frequency and bit size data.OutputBitStream
newDocumentRecord()
Starts a new document record.long
newInvertedList()
Starts a new inverted list.long
newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize)
Starts a new inverted list.void
printStats(PrintStream stats)
Writes to the given print stream statistical information about the index just built.Properties
properties()
Returns properties of the index generated by this index writer.void
writeDocumentPointer(OutputBitStream out, long pointer)
Writes a document pointer.void
writeFrequency(long frequency)
Writes the frequency.long
writtenBits()
Returns the overall number of bits written onto the underlying stream(s).-
Methods inherited from class it.unimi.di.big.mg4j.index.BitStreamIndexWriter
writeDocumentPositions, writePayload, writePositionCount
-
-
-
-
Field Detail
-
DEFAULT_TEMP_BUFFER_SIZE
public static final int DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list. Inverted lists shorter than this number of bytes will be directly rebuilt from the buffer, and never flushed to disk.- See Also:
- Constant Field Values
-
towerData
public final SkipBitStreamIndexWriter.TowerData towerData
The sum of all tower data computed so far.
-
bitsForVariableQuanta
public long bitsForVariableQuanta
The number of bits written for variable quanta.
-
bitsForQuantumBitLengths
public long bitsForQuantumBitLengths
The number of bits written for quantum lengths.
-
bitsForEntryBitLengths
public long bitsForEntryBitLengths
The number of bits written for entry lenghts.
-
numberOfBlocks
public long numberOfBlocks
The number of written blocks.
-
prevEntryBitLength
public int prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.
-
prevQuantumBitLength
public int prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.
-
-
Constructor Detail
-
SkipBitStreamIndexWriter
public SkipBitStreamIndexWriter(IOFactory ioFactory, CharSequence basename, long numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height) throws IOException
Creates a new skip index writer with the specified basename. The index will be written on a file (stemmed with .index). IfwriteOffsets
, also an offset file will be produced (stemmed with .offsets).- Parameters:
ioFactory
- the factory that will be used to perform I/O.basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.writeOffsets
- iftrue
, the offset file will also be produced.tempBufferSize
- the size in bytes of the internal temporary buffer (inverted lists shorter than this size will never be flushed to disk).flags
- a flag map setting the coding techniques to be used (seeCompressionFlags
).quantum
- the quantum; it must be zero, or a power of two; if it is zero, a variable-quantum index is assumed.height
- the maximum height of a skip tower; the cache will contain at most 2h document records.- Throws:
IOException
-
-
Method Detail
-
log2Quantum
public static int log2Quantum(long predictedFrequency, long numberOfDocuments, double fraction, long predictedSize, long predictedPositionsSize)
Suggests a quantum size using frequency and bit size data.- Parameters:
predictedFrequency
- a prediction of the frequency of the inverted list.numberOfDocuments
- the number of documents in the collection.fraction
- the fraction of space to be used for skip lists.predictedSize
- a prediction of the size of the inverted list for terms and counts.predictedPositionsSize
- a prediction of the size of the inverted list for positions (might be zero).- Returns:
- -1, if this list should not have towers because the suggested quantum size is larger than or equal to
predictedFrequency
; the logarithm of the suggested quantum size, otherwise.
-
newInvertedList
public long newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize) throws IOException
Description copied from interface:VariableQuantumIndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.This method provides additional information that will be used to compute the correct quantum for the skip structure of the inverted list.
- Specified by:
newInvertedList
in interfaceVariableQuantumIndexWriter
- Parameters:
predictedFrequency
- the predicted frequency of the inverted list; this might just be an approximation.fraction
- the fraction of the inverted list that will be dedicated to skipping structures.predictedSize
- the predicted size of the part of the inverted list that stores terms and counts.predictedPositionsSize
- the predicted size of the part of the inverted list that stores positions.- Returns:
- the position (in bits) of the underlying bit stream where the new inverted list starts.
- Throws:
IOException
- See Also:
IndexWriter.newInvertedList()
-
newInvertedList
public long newInvertedList() throws IOException
Description copied from interface:IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.- Specified by:
newInvertedList
in interfaceIndexWriter
- Overrides:
newInvertedList
in classBitStreamIndexWriter
- Returns:
- the position (in bits) of the underlying bit stream where the new inverted list starts.
- Throws:
IOException
-
writeFrequency
public void writeFrequency(long frequency) throws IOException
Description copied from interface:IndexWriter
Writes the frequency.- Specified by:
writeFrequency
in interfaceIndexWriter
- Overrides:
writeFrequency
in classBitStreamIndexWriter
- Parameters:
frequency
- the (positive) number of document records that this inverted list will contain.- Throws:
IOException
-
newDocumentRecord
public OutputBitStream newDocumentRecord() throws IOException
Description copied from interface:IndexWriter
Starts a new document record.This method must be called exactly exactly f times, where f is the frequency specified with
IndexWriter.writeFrequency(long)
.- Specified by:
newDocumentRecord
in interfaceIndexWriter
- Overrides:
newDocumentRecord
in classBitStreamIndexWriter
- Returns:
- the output bit stream where the next document record data should be written, if necessary, or
null
, ifIndexWriter.writeDocumentPointer(OutputBitStream, long)
ignores its first argument. - Throws:
IOException
-
writeDocumentPointer
public void writeDocumentPointer(OutputBitStream out, long pointer) throws IOException
Description copied from interface:IndexWriter
Writes a document pointer.This method must be called immediately after
IndexWriter.newDocumentRecord()
.- Specified by:
writeDocumentPointer
in interfaceIndexWriter
- Overrides:
writeDocumentPointer
in classBitStreamIndexWriter
- Parameters:
out
- the output bit stream where the pointer will be written.pointer
- the document pointer.- Throws:
IOException
-
close
public void close() throws IOException
Description copied from interface:IndexWriter
Closes this index writer, completing the index creation process and releasing all resources.- Specified by:
close
in interfaceIndexWriter
- Overrides:
close
in classBitStreamIndexWriter
- Throws:
IOException
-
writtenBits
public long writtenBits()
Description copied from interface:IndexWriter
Returns the overall number of bits written onto the underlying stream(s).- Specified by:
writtenBits
in interfaceIndexWriter
- Overrides:
writtenBits
in classBitStreamIndexWriter
- Returns:
- the number of bits written, according to the variables keeping statistical records.
-
properties
public Properties properties()
Description copied from interface:IndexWriter
Returns properties of the index generated by this index writer.This method should only be called after
IndexWriter.close()
. It returns a new property object containing values for (whenever appropriate)Index.PropertyKeys.DOCUMENTS
,Index.PropertyKeys.TERMS
,Index.PropertyKeys.POSTINGS
,Index.PropertyKeys.MAXCOUNT
,Index.PropertyKeys.INDEXCLASS
,Index.PropertyKeys.CODING
,Index.PropertyKeys.PAYLOADCLASS
,BitStreamIndex.PropertyKeys.SKIPQUANTUM
, andBitStreamIndex.PropertyKeys.SKIPHEIGHT
.- Specified by:
properties
in interfaceIndexWriter
- Overrides:
properties
in classBitStreamIndexWriter
- Returns:
- properties a new set of properties for the just created index.
-
printStats
public void printStats(PrintStream stats)
Description copied from interface:IndexWriter
Writes to the given print stream statistical information about the index just built. This method must be called afterIndexWriter.close()
.- Specified by:
printStats
in interfaceIndexWriter
- Overrides:
printStats
in classAbstractBitStreamIndexWriter
- Parameters:
stats
- a print stream where statistical information will be written.
-
-