it.unimi.di.mg4j.index
Class BitStreamHPIndexWriter

java.lang.Object
  extended by it.unimi.di.mg4j.index.AbstractBitStreamIndexWriter
      extended by it.unimi.di.mg4j.index.BitStreamHPIndexWriter
All Implemented Interfaces:
IndexWriter, VariableQuantumIndexWriter

public class BitStreamHPIndexWriter
extends AbstractBitStreamIndexWriter
implements VariableQuantumIndexWriter

Writes a bitstream-based high-performance index.

These indices are managed by MG4J mainly for historical reasons, as quasi-succinct indices are just better under every respect. In particular, they do not support I/O factories.

The difference between indices generated by this class and those generated by SkipBitStreamIndexWriter lie in the level of interleaving. Indices generated by this class store positions in a separate stream with extension .positions (similarly to Lucene), and have a mandatory skip structure (an extension of that used by a SkipBitStreamIndexWriter) that indexes both the main index file and the positions file. This can result in major performance improvement in the resolution of position-based operators (e.g., phrases) and in the evaluation of proximity-based scorers.

Presently, indices generated by this class cannot carry payloads: you must use a BitStreamIndexWriter in that case. Moreover, only nonparametric indices can be used for positions (this limitation rules out CompressionFlags.Coding.GOLOMB and CompressionFlags.Coding.INTERPOLATIVE).

Since:
1.2
Author:
Sebastiano Vigna

Nested Class Summary
static class BitStreamHPIndexWriter.TowerData
          A structure maintaining statistical data about tower construction.
 
Field Summary
protected  int b
          The parameter b for Golomb coding of pointers.
protected static int BEFORE_COUNT
          This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
protected static int BEFORE_DOCUMENT_RECORD
          This value of state means that we are ready to call newDocumentRecord().
protected static int BEFORE_FREQUENCY
          This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(int).
protected static int BEFORE_INVERTED_LIST
          This value of state means that we should call newInvertedList().
protected static int BEFORE_PAYLOAD
          This value of state can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.
protected static int BEFORE_POINTER
          This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, int).
protected static int BEFORE_POSITIONS
          This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
 long bitsForEntryBitLengths
          The number of bits written for entry lenghts.
 long bitsForPositionsOffsets
          The number of bits written for offsets in the file of positions.
 long bitsForPositionsQuantumBitLengths
          The number of bits written for quantum lengths in the positions stream.
 long bitsForQuantumBitLengths
          The number of bits written for quantum lengths.
 long bitsForVariableQuanta
          The number of bits written for variable quanta.
protected  int currentDocument
          The current document pointer.
static int DEFAULT_TEMP_BUFFER_SIZE
          The size of the buffer for the temporary file used to build an inverted list.
protected static int FIRST_UNUSED_STATE
          This is the first unused state.
protected  int frequency
          The number of document records that the current inverted list will contain.
protected  int lastDocument
          The last document pointer in the current list.
protected  int log2b
          The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.
 int maxCount
          The maximum number of positions in a document record so far.
 long numberOfBlocks
          The number of written blocks.
protected  OutputBitStream obs
          The underlying index OutputBitStream.
protected  OutputBitStream positions
          The underlying positions OutputBitStream.
 int prevEntryBitLength
          An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.
 int prevPositionsQuantumBitLength
          An estimate on the number of bits occupied per quantum in the positions stream in the last written cache, or -1 if no cache has been written for the current inverted list.
 int prevQuantumBitLength
          An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.
protected  int state
          The current state of the writer.
 BitStreamHPIndexWriter.TowerData towerData
          The sum of all tower data computed so far.
protected  int writtenDocuments
          The number of document records already written for the current inverted list.
 
Fields inherited from class it.unimi.di.mg4j.index.AbstractBitStreamIndexWriter
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding
 
Constructor Summary
BitStreamHPIndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height)
          Creates a new index writer, with the specified basename.
 
Method Summary
 void close()
          Closes this index writer, completing the index creation process and releasing all resources.
 OutputBitStream newDocumentRecord()
          Starts a new document record.
 long newInvertedList()
          Starts a new inverted list.
 long newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize)
          Starts a new inverted list.
 void printStats(PrintStream stats)
          Writes to the given print stream statistical information about the index just built.
 Properties properties()
          Returns properties of the index generated by this index writer.
 void writeDocumentPointer(OutputBitStream unused, int pointer)
          Writes a document pointer.
 void writeDocumentPositions(OutputBitStream unused, int[] position, int offset, int count, int docSize)
          Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.
 void writeFrequency(int frequency)
          Writes the frequency.
 void writePayload(OutputBitStream out, Payload payload)
          Writes the payload for the current document.
 void writePositionCount(OutputBitStream out, int count)
          Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.
 long writtenBits()
          Returns the overall number of bits written onto the underlying stream(s).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_TEMP_BUFFER_SIZE

public static final int DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list. Inverted lists shorter than this number of bytes will be directly rebuilt from the buffer, and never flushed to disk.

See Also:
Constant Field Values

BEFORE_INVERTED_LIST

protected static final int BEFORE_INVERTED_LIST
This value of state means that we should call newInvertedList().

See Also:
Constant Field Values

BEFORE_FREQUENCY

protected static final int BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(int).

See Also:
Constant Field Values

BEFORE_DOCUMENT_RECORD

protected static final int BEFORE_DOCUMENT_RECORD
This value of state means that we are ready to call newDocumentRecord().

See Also:
Constant Field Values

BEFORE_POINTER

protected static final int BEFORE_POINTER
This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, int).

See Also:
Constant Field Values

BEFORE_PAYLOAD

protected static final int BEFORE_PAYLOAD
This value of state can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.

See Also:
Constant Field Values

BEFORE_COUNT

protected static final int BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.

See Also:
Constant Field Values

BEFORE_POSITIONS

protected static final int BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.

See Also:
Constant Field Values

FIRST_UNUSED_STATE

protected static final int FIRST_UNUSED_STATE
This is the first unused state. Subclasses may start from this value to define new states.

See Also:
Constant Field Values

obs

protected OutputBitStream obs
The underlying index OutputBitStream.


positions

protected OutputBitStream positions
The underlying positions OutputBitStream.


state

protected int state
The current state of the writer.


frequency

protected int frequency
The number of document records that the current inverted list will contain.


writtenDocuments

protected int writtenDocuments
The number of document records already written for the current inverted list.


currentDocument

protected int currentDocument
The current document pointer.


lastDocument

protected int lastDocument
The last document pointer in the current list.


b

protected int b
The parameter b for Golomb coding of pointers.


log2b

protected int log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.


maxCount

public int maxCount
The maximum number of positions in a document record so far.


bitsForPositionsOffsets

public long bitsForPositionsOffsets
The number of bits written for offsets in the file of positions.


towerData

public final BitStreamHPIndexWriter.TowerData towerData
The sum of all tower data computed so far.


bitsForVariableQuanta

public long bitsForVariableQuanta
The number of bits written for variable quanta.


bitsForQuantumBitLengths

public long bitsForQuantumBitLengths
The number of bits written for quantum lengths.


bitsForPositionsQuantumBitLengths

public long bitsForPositionsQuantumBitLengths
The number of bits written for quantum lengths in the positions stream.


bitsForEntryBitLengths

public long bitsForEntryBitLengths
The number of bits written for entry lenghts.


numberOfBlocks

public long numberOfBlocks
The number of written blocks.


prevEntryBitLength

public int prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.


prevQuantumBitLength

public int prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.


prevPositionsQuantumBitLength

public int prevPositionsQuantumBitLength
An estimate on the number of bits occupied per quantum in the positions stream in the last written cache, or -1 if no cache has been written for the current inverted list.

Constructor Detail

BitStreamHPIndexWriter

public BitStreamHPIndexWriter(CharSequence basename,
                              int numberOfDocuments,
                              boolean writeOffsets,
                              int tempBufferSize,
                              Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
                              int quantum,
                              int height)
                       throws IOException
Creates a new index writer, with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets).

Parameters:
basename - the basename.
numberOfDocuments - the number of documents in the collection to be indexed.
writeOffsets - if true, the offset file will also be produced.
tempBufferSize - the size of the write buffer of the cache.
flags - a flag map setting the coding techniques to be used (see CompressionFlags).
quantum - the quantum; it must be zero, or a power of two; if it is zero, a variable-quantum index is assumed.
height - the maximum height of a skip tower; the cache will contain at most 2h document records.
Throws:
IOException
Method Detail

newInvertedList

public long newInvertedList()
                     throws IOException
Description copied from interface: IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.

Specified by:
newInvertedList in interface IndexWriter
Returns:
the position (in bits) of the underlying bit stream where the new inverted list starts.
Throws:
IOException

newInvertedList

public long newInvertedList(long predictedFrequency,
                            double fraction,
                            long predictedSize,
                            long predictedPositionsSize)
                     throws IOException
Description copied from interface: VariableQuantumIndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.

This method provides additional information that will be used to compute the correct quantum for the skip structure of the inverted list.

Specified by:
newInvertedList in interface VariableQuantumIndexWriter
Parameters:
predictedFrequency - the predicted frequency of the inverted list; this might just be an approximation.
fraction - the fraction of the inverted list that will be dedicated to skipping structures.
predictedSize - the predicted size of the part of the inverted list that stores terms and counts.
predictedPositionsSize - the predicted size of the part of the inverted list that stores positions.
Returns:
the position (in bits) of the underlying bit stream where the new inverted list starts.
Throws:
IOException
See Also:
IndexWriter.newInvertedList()

writeFrequency

public void writeFrequency(int frequency)
                    throws IOException
Description copied from interface: IndexWriter
Writes the frequency.

Specified by:
writeFrequency in interface IndexWriter
Parameters:
frequency - the (positive) number of document records that this inverted list will contain.
Throws:
IOException

newDocumentRecord

public OutputBitStream newDocumentRecord()
                                  throws IOException
Description copied from interface: IndexWriter
Starts a new document record.

This method must be called exactly exactly f times, where f is the frequency specified with IndexWriter.writeFrequency(int).

Specified by:
newDocumentRecord in interface IndexWriter
Returns:
the output bit stream where the next document record data should be written, if necessary, or null, if IndexWriter.writeDocumentPointer(OutputBitStream, int) ignores its first argument.
Throws:
IOException

writeDocumentPointer

public void writeDocumentPointer(OutputBitStream unused,
                                 int pointer)
                          throws IOException
Description copied from interface: IndexWriter
Writes a document pointer.

This method must be called immediately after IndexWriter.newDocumentRecord().

Specified by:
writeDocumentPointer in interface IndexWriter
Parameters:
unused - the output bit stream where the pointer will be written.
pointer - the document pointer.
Throws:
IOException

writePayload

public void writePayload(OutputBitStream out,
                         Payload payload)
                  throws IOException
Description copied from interface: IndexWriter
Writes the payload for the current document.

This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, int).

Specified by:
writePayload in interface IndexWriter
Parameters:
out - the output bit stream where the payload will be written.
payload - the payload.
Throws:
IOException

writePositionCount

public void writePositionCount(OutputBitStream out,
                               int count)
                        throws IOException
Description copied from interface: IndexWriter
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.

Specified by:
writePositionCount in interface IndexWriter
Parameters:
out - the output stream where the occurrences should be written.
count - the count.
Throws:
IOException

writeDocumentPositions

public void writeDocumentPositions(OutputBitStream unused,
                                   int[] position,
                                   int offset,
                                   int count,
                                   int docSize)
                            throws IOException
Description copied from interface: IndexWriter
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.

Specified by:
writeDocumentPositions in interface IndexWriter
Parameters:
unused - the output stream where the occurrences should be written.
position - the position vector (a sequence of strictly increasing natural numbers).
offset - the first valid entry in position.
count - the number of valid entries in position starting from offset.
docSize - the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
Throws:
IOException

close

public void close()
           throws IOException
Description copied from interface: IndexWriter
Closes this index writer, completing the index creation process and releasing all resources.

Specified by:
close in interface IndexWriter
Throws:
IOException

writtenBits

public long writtenBits()
Description copied from interface: IndexWriter
Returns the overall number of bits written onto the underlying stream(s).

Specified by:
writtenBits in interface IndexWriter
Returns:
the number of bits written, according to the variables keeping statistical records.

properties

public Properties properties()
Description copied from interface: IndexWriter
Returns properties of the index generated by this index writer.

This method should only be called after IndexWriter.close(). It returns a new property object containing values for (whenever appropriate) Index.PropertyKeys.DOCUMENTS, Index.PropertyKeys.TERMS, Index.PropertyKeys.POSTINGS, Index.PropertyKeys.MAXCOUNT, Index.PropertyKeys.INDEXCLASS, Index.PropertyKeys.CODING, Index.PropertyKeys.PAYLOADCLASS, BitStreamIndex.PropertyKeys.SKIPQUANTUM, and BitStreamIndex.PropertyKeys.SKIPHEIGHT.

Specified by:
properties in interface IndexWriter
Returns:
properties a new set of properties for the just created index.

printStats

public void printStats(PrintStream stats)
Description copied from interface: IndexWriter
Writes to the given print stream statistical information about the index just built. This method must be called after IndexWriter.close().

Specified by:
printStats in interface IndexWriter
Overrides:
printStats in class AbstractBitStreamIndexWriter
Parameters:
stats - a print stream where statistical information will be written.