|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.di.mg4j.index.AbstractBitStreamIndexWriter
it.unimi.di.mg4j.index.BitStreamHPIndexWriter
public class BitStreamHPIndexWriter
Writes a bitstream-based high-performance index.
These indices are managed by MG4J mainly for historical reasons, as quasi-succinct indices are just better under every respect. In particular, they do not support I/O factories.
The difference between indices generated by this class and those generated
by SkipBitStreamIndexWriter
lie in the level
of interleaving. Indices generated by this class store positions in a separate stream with extension .positions (similarly to Lucene), and
have a mandatory skip structure (an extension of that used by a SkipBitStreamIndexWriter
)
that indexes both the main index file and the positions file. This can result in major performance
improvement in the resolution of position-based operators (e.g., phrases) and in the evaluation
of proximity-based scorers.
Presently, indices generated by this class cannot carry payloads: you must use a BitStreamIndexWriter
in that case. Moreover, only nonparametric indices can be used for positions
(this limitation rules out CompressionFlags.Coding.GOLOMB
and CompressionFlags.Coding.INTERPOLATIVE
).
Nested Class Summary | |
---|---|
static class |
BitStreamHPIndexWriter.TowerData
A structure maintaining statistical data about tower construction. |
Field Summary | |
---|---|
protected int |
b
The parameter b for Golomb coding of pointers. |
protected static int |
BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record. |
protected static int |
BEFORE_DOCUMENT_RECORD
This value of state means that we are ready to call newDocumentRecord() . |
protected static int |
BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list,
and we should call writeFrequency(int) . |
protected static int |
BEFORE_INVERTED_LIST
This value of state means that we should call newInvertedList() . |
protected static int |
BEFORE_PAYLOAD
This value of state can be assumed only in indices that contain payloads; it
means that we are positioned just before the payload for the current document record. |
protected static int |
BEFORE_POINTER
This value of state means that we just started a new document record, and we
should call writeDocumentPointer(OutputBitStream, int) . |
protected static int |
BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record. |
long |
bitsForEntryBitLengths
The number of bits written for entry lenghts. |
long |
bitsForPositionsOffsets
The number of bits written for offsets in the file of positions. |
long |
bitsForPositionsQuantumBitLengths
The number of bits written for quantum lengths in the positions stream. |
long |
bitsForQuantumBitLengths
The number of bits written for quantum lengths. |
long |
bitsForVariableQuanta
The number of bits written for variable quanta. |
protected int |
currentDocument
The current document pointer. |
static int |
DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list. |
protected static int |
FIRST_UNUSED_STATE
This is the first unused state. |
protected int |
frequency
The number of document records that the current inverted list will contain. |
protected int |
lastDocument
The last document pointer in the current list. |
protected int |
log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b . |
int |
maxCount
The maximum number of positions in a document record so far. |
long |
numberOfBlocks
The number of written blocks. |
protected OutputBitStream |
obs
The underlying index OutputBitStream . |
protected OutputBitStream |
positions
The underlying positions OutputBitStream . |
int |
prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list. |
int |
prevPositionsQuantumBitLength
An estimate on the number of bits occupied per quantum in the positions stream in the last written cache, or -1 if no cache has been written for the current inverted list. |
int |
prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list. |
protected int |
state
The current state of the writer. |
BitStreamHPIndexWriter.TowerData |
towerData
The sum of all tower data computed so far. |
protected int |
writtenDocuments
The number of document records already written for the current inverted list. |
Fields inherited from class it.unimi.di.mg4j.index.AbstractBitStreamIndexWriter |
---|
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding |
Constructor Summary | |
---|---|
BitStreamHPIndexWriter(CharSequence basename,
int numberOfDocuments,
boolean writeOffsets,
int tempBufferSize,
Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
int quantum,
int height)
Creates a new index writer, with the specified basename. |
Method Summary | |
---|---|
void |
close()
Closes this index writer, completing the index creation process and releasing all resources. |
OutputBitStream |
newDocumentRecord()
Starts a new document record. |
long |
newInvertedList()
Starts a new inverted list. |
long |
newInvertedList(long predictedFrequency,
double fraction,
long predictedSize,
long predictedPositionsSize)
Starts a new inverted list. |
void |
printStats(PrintStream stats)
Writes to the given print stream statistical information about the index just built. |
Properties |
properties()
Returns properties of the index generated by this index writer. |
void |
writeDocumentPointer(OutputBitStream unused,
int pointer)
Writes a document pointer. |
void |
writeDocumentPositions(OutputBitStream unused,
int[] position,
int offset,
int count,
int docSize)
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream . |
void |
writeFrequency(int frequency)
Writes the frequency. |
void |
writePayload(OutputBitStream out,
Payload payload)
Writes the payload for the current document. |
void |
writePositionCount(OutputBitStream out,
int count)
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream . |
long |
writtenBits()
Returns the overall number of bits written onto the underlying stream(s). |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_TEMP_BUFFER_SIZE
protected static final int BEFORE_INVERTED_LIST
state
means that we should call newInvertedList()
.
protected static final int BEFORE_FREQUENCY
state
means that we are positioned at the start of an inverted list,
and we should call writeFrequency(int)
.
protected static final int BEFORE_DOCUMENT_RECORD
state
means that we are ready to call newDocumentRecord()
.
protected static final int BEFORE_POINTER
state
means that we just started a new document record, and we
should call writeDocumentPointer(OutputBitStream, int)
.
protected static final int BEFORE_PAYLOAD
state
can be assumed only in indices that contain payloads; it
means that we are positioned just before the payload for the current document record.
protected static final int BEFORE_COUNT
state
can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record.
protected static final int BEFORE_POSITIONS
state
can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record.
protected static final int FIRST_UNUSED_STATE
protected OutputBitStream obs
OutputBitStream
.
protected OutputBitStream positions
OutputBitStream
.
protected int state
protected int frequency
protected int writtenDocuments
protected int currentDocument
protected int lastDocument
protected int b
b
for Golomb coding of pointers.
protected int log2b
log2b
for Golomb coding of pointers; it is the most significant bit of b
.
public int maxCount
public long bitsForPositionsOffsets
public final BitStreamHPIndexWriter.TowerData towerData
public long bitsForVariableQuanta
public long bitsForQuantumBitLengths
public long bitsForPositionsQuantumBitLengths
public long bitsForEntryBitLengths
public long numberOfBlocks
public int prevEntryBitLength
public int prevQuantumBitLength
public int prevPositionsQuantumBitLength
Constructor Detail |
---|
public BitStreamHPIndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height) throws IOException
writeOffsets
, also an offset file will be produced (stemmed with .offsets).
basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.writeOffsets
- if true
, the offset file will also be produced.tempBufferSize
- the size of the write buffer of the cache.flags
- a flag map setting the coding techniques to be used (see CompressionFlags
).quantum
- the quantum; it must be zero, or a power of two; if it is zero, a variable-quantum index is assumed.height
- the maximum height of a skip tower; the cache will contain at most 2h document records.
IOException
Method Detail |
---|
public long newInvertedList() throws IOException
IndexWriter
newInvertedList
in interface IndexWriter
IOException
public long newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize) throws IOException
VariableQuantumIndexWriter
This method provides additional information that will be used to compute the correct quantum for the skip structure of the inverted list.
newInvertedList
in interface VariableQuantumIndexWriter
predictedFrequency
- the predicted frequency of the inverted list; this might
just be an approximation.fraction
- the fraction of the inverted list that will be dedicated to
skipping structures.predictedSize
- the predicted size of the part of the inverted list that stores
terms and counts.predictedPositionsSize
- the predicted size of the part of the inverted list that
stores positions.
IOException
IndexWriter.newInvertedList()
public void writeFrequency(int frequency) throws IOException
IndexWriter
writeFrequency
in interface IndexWriter
frequency
- the (positive) number of document records that this inverted list will contain.
IOException
public OutputBitStream newDocumentRecord() throws IOException
IndexWriter
This method must be called exactly exactly f times, where f is the frequency specified with
IndexWriter.writeFrequency(int)
.
newDocumentRecord
in interface IndexWriter
null
,
if IndexWriter.writeDocumentPointer(OutputBitStream, int)
ignores its first argument.
IOException
public void writeDocumentPointer(OutputBitStream unused, int pointer) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.newDocumentRecord()
.
writeDocumentPointer
in interface IndexWriter
unused
- the output bit stream where the pointer will be written.pointer
- the document pointer.
IOException
public void writePayload(OutputBitStream out, Payload payload) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, int)
.
writePayload
in interface IndexWriter
out
- the output bit stream where the payload will be written.payload
- the payload.
IOException
public void writePositionCount(OutputBitStream out, int count) throws IOException
IndexWriter
OutputBitStream
.
writePositionCount
in interface IndexWriter
out
- the output stream where the occurrences should be written.count
- the count.
IOException
public void writeDocumentPositions(OutputBitStream unused, int[] position, int offset, int count, int docSize) throws IOException
IndexWriter
OutputBitStream
.
writeDocumentPositions
in interface IndexWriter
unused
- the output stream where the occurrences should be written.position
- the position vector (a sequence of strictly increasing natural numbers).offset
- the first valid entry in position
.count
- the number of valid entries in position
starting from offset
.docSize
- the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
IOException
public void close() throws IOException
IndexWriter
close
in interface IndexWriter
IOException
public long writtenBits()
IndexWriter
writtenBits
in interface IndexWriter
public Properties properties()
IndexWriter
This method should only be called after IndexWriter.close()
.
It returns a new property object
containing values for (whenever appropriate)
Index.PropertyKeys.DOCUMENTS
, Index.PropertyKeys.TERMS
,
Index.PropertyKeys.POSTINGS
, Index.PropertyKeys.MAXCOUNT
,
Index.PropertyKeys.INDEXCLASS
, Index.PropertyKeys.CODING
, Index.PropertyKeys.PAYLOADCLASS
,
BitStreamIndex.PropertyKeys.SKIPQUANTUM
, and BitStreamIndex.PropertyKeys.SKIPHEIGHT
.
properties
in interface IndexWriter
public void printStats(PrintStream stats)
IndexWriter
IndexWriter.close()
.
printStats
in interface IndexWriter
printStats
in class AbstractBitStreamIndexWriter
stats
- a print stream where statistical information will be written.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |