|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.index.QuasiSuccinctIndexWriter
public class QuasiSuccinctIndexWriter
An index writer for quasi-succinct indices.
A quasi-succinct index does not use gap-compression to represent its various components, but rather the Elias–Fano representation of monotone sequences. The index was described in detail by Sebastiano Vigna in the paper “Quasi-Succinct Indices”. It is smaller than a γ/δ-code gap-compressed index, and significantly faster when computing conjunctive, phrasal or proximity operators, as it provides constant-time access on average to every piece of information in the index.
In a quasi-succinct index pointers, counters and positions are represented in three different
files, each of which has its own offset file. The file do not contain a byte-oriented
bitstream representation, but rather
arrays of 64-bit longwords with specified byte order (by default the native one
for performance reasons). The longwords are either loaded in memory as a LongBigArrayBigList
or mapped using a ByteBufferLongBigList
. The bit k of a file is the
bit k mod 64 of the longword of index ⌊k / 64⌋.
Nested Class Summary | |
---|---|
protected static class |
QuasiSuccinctIndexWriter.Accumulator
|
protected static class |
QuasiSuccinctIndexWriter.LongWordCache
|
static class |
QuasiSuccinctIndexWriter.LongWordOutputBitStream
|
Field Summary | |
---|---|
static int |
DEFAULT_CACHE_SIZE
The default size of the bit cache. |
Constructor Summary | |
---|---|
QuasiSuccinctIndexWriter(IOFactory ioFactory,
CharSequence basename,
int numberOfDocuments,
int log2Quantum,
int cacheSize,
Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
ByteOrder byteOrder)
Creates a new index writer, with the specified basename. |
Method Summary | |
---|---|
void |
close()
Closes this index writer, completing the index creation process and releasing all resources. |
static int |
lowerBits(long length,
long upperBound,
boolean strict)
Returns the number of lower bits for the Elias–Fano encoding of a list of given length, upper bound and strictness. |
OutputBitStream |
newDocumentRecord()
Starts a new document record. |
long |
newInvertedList()
Starts a new inverted list. |
void |
newInvertedList(int frequency,
long occurrency,
long sumMaxPos)
Starts a new inverted list. |
static long |
numberOfPointers(long length,
long upperBound,
int log2Quantum,
boolean strict,
boolean indexZeroes)
Returns the number of forward or skip pointers to the Elias–Fano encoding of a list of given length, upper bound and strictness. |
static int |
pointerSize(long length,
long upperBound,
boolean strict,
boolean indexZeroes)
Returns the size in bits of forward or skip pointers to the Elias–Fano encoding of a list of given length, upper bound and strictness. |
void |
printStats(PrintStream stats)
Writes to the given print stream statistical information about the index just built. |
Properties |
properties()
Returns properties of the index generated by this index writer. |
void |
writeDocumentPointer(OutputBitStream out,
int pointer)
Writes a document pointer. |
void |
writeDocumentPositions(OutputBitStream unused,
int[] position,
int offset,
int count,
int docSize)
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream . |
void |
writeFrequency(int frequency)
Writes the frequency. |
void |
writePayload(OutputBitStream unused,
Payload payload)
Writes the payload for the current document. |
void |
writePositionCount(OutputBitStream unused,
int count)
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream . |
long |
writtenBits()
Returns the overall number of bits written onto the underlying stream(s). |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_CACHE_SIZE
Constructor Detail |
---|
public QuasiSuccinctIndexWriter(IOFactory ioFactory, CharSequence basename, int numberOfDocuments, int log2Quantum, int cacheSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, ByteOrder byteOrder) throws IOException
ioFactory
- the factory that will be used to perform I/O.basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.log2Quantum
- the logarithm of the quantum.cacheSize
- the size in byte of the bit caches.byteOrder
- the byte order of the index (if null
, ByteOrder.nativeOrder()
).
IOException
Method Detail |
---|
public static int lowerBits(long length, long upperBound, boolean strict)
length
- the number of elements of the list.upperBound
- an upper bound for the elements of the list.strict
- if true, the elements of the list are strictly increasing, and the
returned number of bits is for the strict representation (e.g., storing the
k-th element decreased by k).
public static int pointerSize(long length, long upperBound, boolean strict, boolean indexZeroes)
length
- the number of elements of the list.upperBound
- an upper bound for the elements of the list.strict
- if true, the elements of the list are strictly increasing, and the
returned number of bits is for the strict representation (e.g., storing the
k-th element decreased by k).indexZeroes
- if true, the number of bits for skip pointers is returned; otherwise,
the number of bits for forward pointers is returned.
public static long numberOfPointers(long length, long upperBound, int log2Quantum, boolean strict, boolean indexZeroes)
length
- the number of elements of the list.upperBound
- an upper bound for the elements of the list.log2Quantum
- the logarithm of the quantum size.strict
- if true, the elements of the list are strictly increasing, and the
returned number of bits is for the strict representation (e.g., storing the
k-th element decreased by k).indexZeroes
- if true, an upper bound on the number of skip pointers is returned; otherwise,
the (exact) number of forward pointers is returned.
public void newInvertedList(int frequency, long occurrency, long sumMaxPos) throws IOException
This method provides additional information which is necessary to build the posting list. The information can be omitted if only part of the index is being written (e.g., no positions or even no counts and positions).
frequency
- the frequency of the inverted list.occurrency
- the occurrency of the inverted list (use -1 if you are not writing counts).sumMaxPos
- the sum of the maximum position in each document (unused if positions are not indexed).
IllegalStateException
- if too few records were written for the previous inverted list.
IOException
IndexWriter.newInvertedList()
public long newInvertedList() throws IOException
IndexWriter
newInvertedList
in interface IndexWriter
IOException
public void writeFrequency(int frequency)
IndexWriter
writeFrequency
in interface IndexWriter
frequency
- the (positive) number of document records that this inverted list will contain.public OutputBitStream newDocumentRecord() throws IOException
IndexWriter
This method must be called exactly exactly f times, where f is the frequency specified with
IndexWriter.writeFrequency(int)
.
newDocumentRecord
in interface IndexWriter
null
,
if IndexWriter.writeDocumentPointer(OutputBitStream, int)
ignores its first argument.
IOException
public void writeDocumentPointer(OutputBitStream out, int pointer) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.newDocumentRecord()
.
writeDocumentPointer
in interface IndexWriter
out
- the output bit stream where the pointer will be written.pointer
- the document pointer.
IOException
public void writePayload(OutputBitStream unused, Payload payload) throws IOException
IndexWriter
This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, int)
.
writePayload
in interface IndexWriter
unused
- the output bit stream where the payload will be written.payload
- the payload.
IOException
public void writePositionCount(OutputBitStream unused, int count) throws IOException
IndexWriter
OutputBitStream
.
writePositionCount
in interface IndexWriter
unused
- the output stream where the occurrences should be written.count
- the count.
IOException
public void writeDocumentPositions(OutputBitStream unused, int[] position, int offset, int count, int docSize) throws IOException
IndexWriter
OutputBitStream
.
writeDocumentPositions
in interface IndexWriter
unused
- the output stream where the occurrences should be written.position
- the position vector (a sequence of strictly increasing natural numbers).offset
- the first valid entry in position
.count
- the number of valid entries in position
starting from offset
.docSize
- the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
IOException
public long writtenBits()
IndexWriter
writtenBits
in interface IndexWriter
public Properties properties()
IndexWriter
This method should only be called after IndexWriter.close()
.
It returns a new property object
containing values for (whenever appropriate)
Index.PropertyKeys.DOCUMENTS
, Index.PropertyKeys.TERMS
,
Index.PropertyKeys.POSTINGS
, Index.PropertyKeys.MAXCOUNT
,
Index.PropertyKeys.INDEXCLASS
, Index.PropertyKeys.CODING
, Index.PropertyKeys.PAYLOADCLASS
,
BitStreamIndex.PropertyKeys.SKIPQUANTUM
, and BitStreamIndex.PropertyKeys.SKIPHEIGHT
.
properties
in interface IndexWriter
public void close() throws IOException
IndexWriter
close
in interface IndexWriter
IOException
public void printStats(PrintStream stats)
IndexWriter
IndexWriter.close()
.
printStats
in interface IndexWriter
stats
- a print stream where statistical information will be written.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |