|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.di.mg4j.index.AbstractBitStreamIndexWriter
it.unimi.di.mg4j.index.BitStreamIndexWriter
it.unimi.di.mg4j.index.SkipBitStreamIndexWriter
public class SkipBitStreamIndexWriter
Writes a bitstream-based interleaved index with skips.
These indices are managed by MG4J mainly for historical reasons, as quasi-succinct indices are just better under every respect.
An interleaved inverted index with skips makes it possible to skip ahead quickly while reading inverted lists. More specifically, when reading the inverted list relative to a certain term, one may want to decide to skip all document records that concern documents with pointer less than a given integer. In a normal inverted index this is impossible: one would have to read all document records sequentially.
The skipping structure used by this class is new, and has been described by Paolo Boldi and Sebastiano Vigna in “Compressed perfect embedded skip lists for quick inverted-index lookups”, Proc. SPIRE 2005, volume 3772 of Lecture Notes in Computer Science, pages 25−28. Springer, 2005.
Nested Class Summary | |
---|---|
static class |
SkipBitStreamIndexWriter.TowerData
A structure maintaining statistical data about tower construction. |
Field Summary | |
---|---|
long |
bitsForEntryBitLengths
The number of bits written for entry lenghts. |
long |
bitsForQuantumBitLengths
The number of bits written for quantum lengths. |
long |
bitsForVariableQuanta
The number of bits written for variable quanta. |
static int |
DEFAULT_TEMP_BUFFER_SIZE
The size of the buffer for the temporary file used to build an inverted list. |
long |
numberOfBlocks
The number of written blocks. |
int |
prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list. |
int |
prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list. |
SkipBitStreamIndexWriter.TowerData |
towerData
The sum of all tower data computed so far. |
Fields inherited from class it.unimi.di.mg4j.index.BitStreamIndexWriter |
---|
b, BEFORE_COUNT, BEFORE_DOCUMENT_RECORD, BEFORE_FREQUENCY, BEFORE_INVERTED_LIST, BEFORE_PAYLOAD, BEFORE_POINTER, BEFORE_POSITIONS, currentDocument, FIRST_UNUSED_STATE, frequency, lastDocument, lastInvertedListPos, log2b, maxCount, obs, state, writtenDocuments |
Fields inherited from class it.unimi.di.mg4j.index.AbstractBitStreamIndexWriter |
---|
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding |
Constructor Summary | |
---|---|
SkipBitStreamIndexWriter(IOFactory ioFactory,
CharSequence basename,
int numberOfDocuments,
boolean writeOffsets,
int tempBufferSize,
Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
int quantum,
int height)
Creates a new skip index writer with the specified basename. |
Method Summary | |
---|---|
void |
close()
Closes this index writer, completing the index creation process and releasing all resources. |
static int |
log2Quantum(long predictedFrequency,
long numberOfDocuments,
double fraction,
long predictedSize,
long predictedPositionsSize)
Suggests a quantum size using frequency and bit size data. |
OutputBitStream |
newDocumentRecord()
Starts a new document record. |
long |
newInvertedList()
Starts a new inverted list. |
long |
newInvertedList(long predictedFrequency,
double fraction,
long predictedSize,
long predictedPositionsSize)
Starts a new inverted list. |
void |
printStats(PrintStream stats)
Writes to the given print stream statistical information about the index just built. |
Properties |
properties()
Returns properties of the index generated by this index writer. |
void |
writeDocumentPointer(OutputBitStream out,
int pointer)
Writes a document pointer. |
void |
writeFrequency(int frequency)
Writes the frequency. |
long |
writtenBits()
Returns the overall number of bits written onto the underlying stream(s). |
Methods inherited from class it.unimi.di.mg4j.index.BitStreamIndexWriter |
---|
writeDocumentPositions, writePayload, writePositionCount |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_TEMP_BUFFER_SIZE
public final SkipBitStreamIndexWriter.TowerData towerData
public long bitsForVariableQuanta
public long bitsForQuantumBitLengths
public long bitsForEntryBitLengths
public long numberOfBlocks
public int prevEntryBitLength
public int prevQuantumBitLength
Constructor Detail |
---|
public SkipBitStreamIndexWriter(IOFactory ioFactory, CharSequence basename, int numberOfDocuments, boolean writeOffsets, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, int quantum, int height) throws IOException
writeOffsets
, also an offset file will be produced (stemmed with .offsets).
ioFactory
- the factory that will be used to perform I/O.basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.writeOffsets
- if true
, the offset file will also be produced.tempBufferSize
- the size in bytes of the internal temporary buffer (inverted lists shorter than this size will never be flushed to disk).flags
- a flag map setting the coding techniques to be used (see CompressionFlags
).quantum
- the quantum; it must be zero, or a power of two; if it is zero, a variable-quantum index is assumed.height
- the maximum height of a skip tower; the cache will contain at most 2h document records.
IOException
Method Detail |
---|
public static int log2Quantum(long predictedFrequency, long numberOfDocuments, double fraction, long predictedSize, long predictedPositionsSize)
predictedFrequency
- a prediction of the frequency of the inverted list.numberOfDocuments
- the number of documents in the collection.fraction
- the fraction of space to be used for skip lists.predictedSize
- a prediction of the size of the inverted list for terms and counts.predictedPositionsSize
- a prediction of the size of the inverted list for positions (might be zero).
predictedFrequency
;
the logarithm of the suggested quantum size, otherwise.public long newInvertedList(long predictedFrequency, double fraction, long predictedSize, long predictedPositionsSize) throws IOException
VariableQuantumIndexWriter
This method provides additional information that will be used to compute the correct quantum for the skip structure of the inverted list.
newInvertedList
in interface VariableQuantumIndexWriter
predictedFrequency
- the predicted frequency of the inverted list; this might
just be an approximation.fraction
- the fraction of the inverted list that will be dedicated to
skipping structures.predictedSize
- the predicted size of the part of the inverted list that stores
terms and counts.predictedPositionsSize
- the predicted size of the part of the inverted list that
stores positions.
IOException
IndexWriter.newInvertedList()
public long newInvertedList() throws IOException
IndexWriter
newInvertedList
in interface IndexWriter
newInvertedList
in class BitStreamIndexWriter
IOException
public void writeFrequency(int frequency) throws IOException
IndexWriter
writeFrequency
in interface IndexWriter
writeFrequency
in class BitStreamIndexWriter
frequency
- the (positive) number of document records that this inverted list will contain.
IOException
public OutputBitStream newDocumentRecord() throws IOException
IndexWriter
This method must be called exactly exactly f times, where f is the frequency specified with
IndexWriter.writeFrequency(int)
.
newDocumentRecord
in interface IndexWriter
newDocumentRecord
in class BitStreamIndexWriter
null
,
if IndexWriter.writeDocumentPointer(OutputBitStream, int)
ignores its first argument.
IOException
public void writeDocumentPointer(OutputBitStream out, int pointer) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.newDocumentRecord()
.
writeDocumentPointer
in interface IndexWriter
writeDocumentPointer
in class BitStreamIndexWriter
out
- the output bit stream where the pointer will be written.pointer
- the document pointer.
IOException
public void close() throws IOException
IndexWriter
close
in interface IndexWriter
close
in class BitStreamIndexWriter
IOException
public long writtenBits()
IndexWriter
writtenBits
in interface IndexWriter
writtenBits
in class BitStreamIndexWriter
public Properties properties()
IndexWriter
This method should only be called after IndexWriter.close()
.
It returns a new property object
containing values for (whenever appropriate)
Index.PropertyKeys.DOCUMENTS
, Index.PropertyKeys.TERMS
,
Index.PropertyKeys.POSTINGS
, Index.PropertyKeys.MAXCOUNT
,
Index.PropertyKeys.INDEXCLASS
, Index.PropertyKeys.CODING
, Index.PropertyKeys.PAYLOADCLASS
,
BitStreamIndex.PropertyKeys.SKIPQUANTUM
, and BitStreamIndex.PropertyKeys.SKIPHEIGHT
.
properties
in interface IndexWriter
properties
in class BitStreamIndexWriter
public void printStats(PrintStream stats)
IndexWriter
IndexWriter.close()
.
printStats
in interface IndexWriter
printStats
in class AbstractBitStreamIndexWriter
stats
- a print stream where statistical information will be written.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |