Class BitStreamHPIndexWriter
- java.lang.Object
-
- it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
-
- it.unimi.di.big.mg4j.index.BitStreamHPIndexWriter
-
- All Implemented Interfaces:
IndexWriter
,VariableQuantumIndexWriter
public class BitStreamHPIndexWriter extends AbstractBitStreamIndexWriter implements VariableQuantumIndexWriter
Writes a bitstream-based high-performance index.These indices are managed by MG4J mainly for historical reasons, as quasi-succinct indices are just better under every respect. In particular, they do not support I/O factories.
The difference between indices generated by this class and those generated by
SkipBitStreamIndexWriter
lie in the level of interleaving. Indices generated by this class store positions in a separate stream with extension .positions (similarly to Lucene), and have a mandatory skip structure (an extension of that used by aSkipBitStreamIndexWriter
) that indexes both the main index file and the positions file. This can result in major performance improvement in the resolution of position-based operators (e.g., phrases) and in the evaluation of proximity-based scorers.Presently, indices generated by this class cannot carry payloads: you must use a
BitStreamIndexWriter
in that case. Moreover, only nonparametric indices can be used for positions (this limitation rules outCompressionFlags.Coding.GOLOMB
andCompressionFlags.Coding.INTERPOLATIVE
).- Since:
- 1.2
- Author:
- Sebastiano Vigna
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
BitStreamHPIndexWriter.TowerData
A structure maintaining statistical data about tower construction.
-
Field Summary
Fields Modifier and Type Field Description protected int
b
The parameterb
for Golomb coding of pointers.protected static int
BEFORE_COUNT
This value ofstate
can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.protected static int
BEFORE_DOCUMENT_RECORD
This value ofstate
means that we are ready to callnewDocumentRecord()
.protected static int
BEFORE_FREQUENCY
This value ofstate
means that we are positioned at the start of an inverted list, and we should callwriteFrequency(long)
.protected static int
BEFORE_INVERTED_LIST
This value ofstate
means that we should callnewInvertedList()
.protected static int
BEFORE_PAYLOAD
This value ofstate
can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.protected static int
BEFORE_POINTER
This value ofstate
means that we just started a new document record, and we should callwriteDocumentPointer(OutputBitStream, long)
.protected static int
BEFORE_POSITIONS
This value ofstate
can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.long
bitsForEntryBitLengths
The number of bits written for entry lenghts.long
bitsForPositionsOffsets
The number of bits written for offsets in the file of positions.long
bitsForPositionsQuantumBitLengths
The number of bits written for quantum lengths in the positions stream.long
bitsForQuantumBitLengths
The number of bits written for quantum lengths.long
bitsForVariableQuanta
The number of bits written for variable quanta.protected long
currentDocument
The current document pointer.static int
DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list.protected static int
FIRST_UNUSED_STATE
This is the first unused state.protected long
frequency
The number of document records that the current inverted list will contain.protected long
lastDocument
The last document pointer in the current list.protected int
log2b
The parameterlog2b
for Golomb coding of pointers; it is the most significant bit ofb
.int
maxCount
The maximum number of positions in a document record so far.long
numberOfBlocks
The number of written blocks.protected OutputBitStream
obs
The underlying indexOutputBitStream
.protected OutputBitStream
positions
The underlying positionsOutputBitStream
.int
prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.int
prevPositionsQuantumBitLength
An estimate on the number of bits occupied per quantum in the positions stream in the last written cache, or -1 if no cache has been written for the current inverted list.int
prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.protected int
state
The current state of the writer.BitStreamHPIndexWriter.TowerData
towerData
The sum of all tower data computed so far.protected long
writtenDocuments
The number of document records already written for the current inverted list.-
Fields inherited from class it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding
-
-
Constructor Summary
Constructors Constructor Description BitStreamHPIndexWriter(CharSequence basename, long numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height)
Creates a new index writer, with the specified basename.
-
Method Summary
Modifier and Type Method Description void
close()
Closes this index writer, completing the index creation process and releasing all resources.OutputBitStream
newDocumentRecord()
Starts a new document record.long
newInvertedList()
Starts a new inverted list.long
newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize)
Starts a new inverted list.void
printStats(PrintStream stats)
Writes to the given print stream statistical information about the index just built.Properties
properties()
Returns properties of the index generated by this index writer.void
writeDocumentPointer(OutputBitStream unused, long pointer)
Writes a document pointer.void
writeDocumentPositions(OutputBitStream unused, int[] position, int offset, int count, int docSize)
Writes the positions of the occurrences of the current term in the current document to the givenOutputBitStream
.void
writeFrequency(long frequency)
Writes the frequency.void
writePayload(OutputBitStream out, Payload payload)
Writes the payload for the current document.void
writePositionCount(OutputBitStream out, int count)
Writes the count of the occurrences of the current term in the current document to the givenOutputBitStream
.long
writtenBits()
Returns the overall number of bits written onto the underlying stream(s).
-
-
-
Field Detail
-
DEFAULT_TEMP_BUFFER_SIZE
public static final int DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list. Inverted lists shorter than this number of bytes will be directly rebuilt from the buffer, and never flushed to disk.- See Also:
- Constant Field Values
-
BEFORE_INVERTED_LIST
protected static final int BEFORE_INVERTED_LIST
This value ofstate
means that we should callnewInvertedList()
.- See Also:
- Constant Field Values
-
BEFORE_FREQUENCY
protected static final int BEFORE_FREQUENCY
This value ofstate
means that we are positioned at the start of an inverted list, and we should callwriteFrequency(long)
.- See Also:
- Constant Field Values
-
BEFORE_DOCUMENT_RECORD
protected static final int BEFORE_DOCUMENT_RECORD
This value ofstate
means that we are ready to callnewDocumentRecord()
.- See Also:
- Constant Field Values
-
BEFORE_POINTER
protected static final int BEFORE_POINTER
This value ofstate
means that we just started a new document record, and we should callwriteDocumentPointer(OutputBitStream, long)
.- See Also:
- Constant Field Values
-
BEFORE_PAYLOAD
protected static final int BEFORE_PAYLOAD
This value ofstate
can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.- See Also:
- Constant Field Values
-
BEFORE_COUNT
protected static final int BEFORE_COUNT
This value ofstate
can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.- See Also:
- Constant Field Values
-
BEFORE_POSITIONS
protected static final int BEFORE_POSITIONS
This value ofstate
can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.- See Also:
- Constant Field Values
-
FIRST_UNUSED_STATE
protected static final int FIRST_UNUSED_STATE
This is the first unused state. Subclasses may start from this value to define new states.- See Also:
- Constant Field Values
-
obs
protected OutputBitStream obs
The underlying indexOutputBitStream
.
-
positions
protected OutputBitStream positions
The underlying positionsOutputBitStream
.
-
state
protected int state
The current state of the writer.
-
frequency
protected long frequency
The number of document records that the current inverted list will contain.
-
writtenDocuments
protected long writtenDocuments
The number of document records already written for the current inverted list.
-
currentDocument
protected long currentDocument
The current document pointer.
-
lastDocument
protected long lastDocument
The last document pointer in the current list.
-
b
protected int b
The parameterb
for Golomb coding of pointers.
-
log2b
protected int log2b
The parameterlog2b
for Golomb coding of pointers; it is the most significant bit ofb
.
-
maxCount
public int maxCount
The maximum number of positions in a document record so far.
-
bitsForPositionsOffsets
public long bitsForPositionsOffsets
The number of bits written for offsets in the file of positions.
-
towerData
public final BitStreamHPIndexWriter.TowerData towerData
The sum of all tower data computed so far.
-
bitsForVariableQuanta
public long bitsForVariableQuanta
The number of bits written for variable quanta.
-
bitsForQuantumBitLengths
public long bitsForQuantumBitLengths
The number of bits written for quantum lengths.
-
bitsForPositionsQuantumBitLengths
public long bitsForPositionsQuantumBitLengths
The number of bits written for quantum lengths in the positions stream.
-
bitsForEntryBitLengths
public long bitsForEntryBitLengths
The number of bits written for entry lenghts.
-
numberOfBlocks
public long numberOfBlocks
The number of written blocks.
-
prevEntryBitLength
public int prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.
-
prevQuantumBitLength
public int prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.
-
prevPositionsQuantumBitLength
public int prevPositionsQuantumBitLength
An estimate on the number of bits occupied per quantum in the positions stream in the last written cache, or -1 if no cache has been written for the current inverted list.
-
-
Constructor Detail
-
BitStreamHPIndexWriter
public BitStreamHPIndexWriter(CharSequence basename, long numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height) throws IOException
Creates a new index writer, with the specified basename. The index will be written on a file (stemmed with .index). IfwriteOffsets
, also an offset file will be produced (stemmed with .offsets).- Parameters:
basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.writeOffsets
- iftrue
, the offset file will also be produced.tempBufferSize
- the size of the write buffer of the cache.flags
- a flag map setting the coding techniques to be used (seeCompressionFlags
).quantum
- the quantum; it must be zero, or a power of two; if it is zero, a variable-quantum index is assumed.height
- the maximum height of a skip tower; the cache will contain at most 2h document records.- Throws:
IOException
-
-
Method Detail
-
newInvertedList
public long newInvertedList() throws IOException
Description copied from interface:IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.- Specified by:
newInvertedList
in interfaceIndexWriter
- Returns:
- the position (in bits) of the underlying bit stream where the new inverted list starts.
- Throws:
IOException
-
newInvertedList
public long newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize) throws IOException
Description copied from interface:VariableQuantumIndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.This method provides additional information that will be used to compute the correct quantum for the skip structure of the inverted list.
- Specified by:
newInvertedList
in interfaceVariableQuantumIndexWriter
- Parameters:
predictedFrequency
- the predicted frequency of the inverted list; this might just be an approximation.fraction
- the fraction of the inverted list that will be dedicated to skipping structures.predictedSize
- the predicted size of the part of the inverted list that stores terms and counts.predictedPositionsSize
- the predicted size of the part of the inverted list that stores positions.- Returns:
- the position (in bits) of the underlying bit stream where the new inverted list starts.
- Throws:
IOException
- See Also:
IndexWriter.newInvertedList()
-
writeFrequency
public void writeFrequency(long frequency) throws IOException
Description copied from interface:IndexWriter
Writes the frequency.- Specified by:
writeFrequency
in interfaceIndexWriter
- Parameters:
frequency
- the (positive) number of document records that this inverted list will contain.- Throws:
IOException
-
newDocumentRecord
public OutputBitStream newDocumentRecord() throws IOException
Description copied from interface:IndexWriter
Starts a new document record.This method must be called exactly exactly f times, where f is the frequency specified with
IndexWriter.writeFrequency(long)
.- Specified by:
newDocumentRecord
in interfaceIndexWriter
- Returns:
- the output bit stream where the next document record data should be written, if necessary, or
null
, ifIndexWriter.writeDocumentPointer(OutputBitStream, long)
ignores its first argument. - Throws:
IOException
-
writeDocumentPointer
public void writeDocumentPointer(OutputBitStream unused, long pointer) throws IOException
Description copied from interface:IndexWriter
Writes a document pointer.This method must be called immediately after
IndexWriter.newDocumentRecord()
.- Specified by:
writeDocumentPointer
in interfaceIndexWriter
- Parameters:
unused
- the output bit stream where the pointer will be written.pointer
- the document pointer.- Throws:
IOException
-
writePayload
public void writePayload(OutputBitStream out, Payload payload) throws IOException
Description copied from interface:IndexWriter
Writes the payload for the current document.This method must be called immediately after
IndexWriter.writeDocumentPointer(OutputBitStream, long)
.- Specified by:
writePayload
in interfaceIndexWriter
- Parameters:
out
- the output bit stream where the payload will be written.payload
- the payload.- Throws:
IOException
-
writePositionCount
public void writePositionCount(OutputBitStream out, int count) throws IOException
Description copied from interface:IndexWriter
Writes the count of the occurrences of the current term in the current document to the givenOutputBitStream
.- Specified by:
writePositionCount
in interfaceIndexWriter
- Parameters:
out
- the output stream where the occurrences should be written.count
- the count.- Throws:
IOException
-
writeDocumentPositions
public void writeDocumentPositions(OutputBitStream unused, int[] position, int offset, int count, int docSize) throws IOException
Description copied from interface:IndexWriter
Writes the positions of the occurrences of the current term in the current document to the givenOutputBitStream
.- Specified by:
writeDocumentPositions
in interfaceIndexWriter
- Parameters:
unused
- the output stream where the occurrences should be written.position
- the position vector (a sequence of strictly increasing natural numbers).offset
- the first valid entry inposition
.count
- the number of valid entries inposition
starting fromoffset
.docSize
- the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).- Throws:
IOException
-
close
public void close() throws IOException
Description copied from interface:IndexWriter
Closes this index writer, completing the index creation process and releasing all resources.- Specified by:
close
in interfaceIndexWriter
- Throws:
IOException
-
writtenBits
public long writtenBits()
Description copied from interface:IndexWriter
Returns the overall number of bits written onto the underlying stream(s).- Specified by:
writtenBits
in interfaceIndexWriter
- Returns:
- the number of bits written, according to the variables keeping statistical records.
-
properties
public Properties properties()
Description copied from interface:IndexWriter
Returns properties of the index generated by this index writer.This method should only be called after
IndexWriter.close()
. It returns a new property object containing values for (whenever appropriate)Index.PropertyKeys.DOCUMENTS
,Index.PropertyKeys.TERMS
,Index.PropertyKeys.POSTINGS
,Index.PropertyKeys.MAXCOUNT
,Index.PropertyKeys.INDEXCLASS
,Index.PropertyKeys.CODING
,Index.PropertyKeys.PAYLOADCLASS
,BitStreamIndex.PropertyKeys.SKIPQUANTUM
, andBitStreamIndex.PropertyKeys.SKIPHEIGHT
.- Specified by:
properties
in interfaceIndexWriter
- Returns:
- properties a new set of properties for the just created index.
-
printStats
public void printStats(PrintStream stats)
Description copied from interface:IndexWriter
Writes to the given print stream statistical information about the index just built. This method must be called afterIndexWriter.close()
.- Specified by:
printStats
in interfaceIndexWriter
- Overrides:
printStats
in classAbstractBitStreamIndexWriter
- Parameters:
stats
- a print stream where statistical information will be written.
-
-