it.unimi.di.mg4j.index
Class QuasiSuccinctIndexWriter

java.lang.Object
  extended by it.unimi.di.mg4j.index.QuasiSuccinctIndexWriter
All Implemented Interfaces:
IndexWriter

public class QuasiSuccinctIndexWriter
extends Object
implements IndexWriter

An index writer for quasi-succinct indices.

A quasi-succinct index does not use gap-compression to represent its various components, but rather the Elias–Fano representation of monotone sequences. The index was described in detail by Sebastiano Vigna in the paper “Quasi-Succinct Indices”. It is smaller than a γ/δ-code gap-compressed index, and significantly faster when computing conjunctive, phrasal or proximity operators, as it provides constant-time access on average to every piece of information in the index.

In a quasi-succinct index pointers, counters and positions are represented in three different files, each of which has its own offset file. The file do not contain a byte-oriented bitstream representation, but rather arrays of 64-bit longwords with specified byte order (by default the native one for performance reasons). The longwords are either loaded in memory as a LongBigArrayBigList or mapped using a ByteBufferLongBigList. The bit k of a file is the bit k mod 64 of the longword of index ⌊k / 64⌋.

Author:
Sebastiano Vigna

Nested Class Summary
protected static class QuasiSuccinctIndexWriter.Accumulator
           
protected static class QuasiSuccinctIndexWriter.LongWordCache
           
static class QuasiSuccinctIndexWriter.LongWordOutputBitStream
           
 
Field Summary
static int DEFAULT_CACHE_SIZE
          The default size of the bit cache.
 
Constructor Summary
QuasiSuccinctIndexWriter(IOFactory ioFactory, CharSequence basename, int numberOfDocuments, int log2Quantum, int cacheSize, Map<CompressionFlags.Component,CompressionFlags.Coding> flags, ByteOrder byteOrder)
          Creates a new index writer, with the specified basename.
 
Method Summary
 void close()
          Closes this index writer, completing the index creation process and releasing all resources.
static int lowerBits(long length, long upperBound, boolean strict)
          Returns the number of lower bits for the Elias–Fano encoding of a list of given length, upper bound and strictness.
 OutputBitStream newDocumentRecord()
          Starts a new document record.
 long newInvertedList()
          Starts a new inverted list.
 void newInvertedList(int frequency, long occurrency, long sumMaxPos)
          Starts a new inverted list.
static long numberOfPointers(long length, long upperBound, int log2Quantum, boolean strict, boolean indexZeroes)
          Returns the number of forward or skip pointers to the Elias–Fano encoding of a list of given length, upper bound and strictness.
static int pointerSize(long length, long upperBound, boolean strict, boolean indexZeroes)
          Returns the size in bits of forward or skip pointers to the Elias–Fano encoding of a list of given length, upper bound and strictness.
 void printStats(PrintStream stats)
          Writes to the given print stream statistical information about the index just built.
 Properties properties()
          Returns properties of the index generated by this index writer.
 void writeDocumentPointer(OutputBitStream out, int pointer)
          Writes a document pointer.
 void writeDocumentPositions(OutputBitStream unused, int[] position, int offset, int count, int docSize)
          Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.
 void writeFrequency(int frequency)
          Writes the frequency.
 void writePayload(OutputBitStream unused, Payload payload)
          Writes the payload for the current document.
 void writePositionCount(OutputBitStream unused, int count)
          Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.
 long writtenBits()
          Returns the overall number of bits written onto the underlying stream(s).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_CACHE_SIZE

public static final int DEFAULT_CACHE_SIZE
The default size of the bit cache.

See Also:
Constant Field Values
Constructor Detail

QuasiSuccinctIndexWriter

public QuasiSuccinctIndexWriter(IOFactory ioFactory,
                                CharSequence basename,
                                int numberOfDocuments,
                                int log2Quantum,
                                int cacheSize,
                                Map<CompressionFlags.Component,CompressionFlags.Coding> flags,
                                ByteOrder byteOrder)
                         throws IOException
Creates a new index writer, with the specified basename.

Parameters:
ioFactory - the factory that will be used to perform I/O.
basename - the basename.
numberOfDocuments - the number of documents in the collection to be indexed.
log2Quantum - the logarithm of the quantum.
cacheSize - the size in byte of the bit caches.
byteOrder - the byte order of the index (if null, ByteOrder.nativeOrder()).
Throws:
IOException
Method Detail

lowerBits

public static int lowerBits(long length,
                            long upperBound,
                            boolean strict)
Returns the number of lower bits for the Elias–Fano encoding of a list of given length, upper bound and strictness.

Parameters:
length - the number of elements of the list.
upperBound - an upper bound for the elements of the list.
strict - if true, the elements of the list are strictly increasing, and the returned number of bits is for the strict representation (e.g., storing the k-th element decreased by k).
Returns:
the number of bits for the Elias–Fano encoding of a list with the specified parameters.

pointerSize

public static int pointerSize(long length,
                              long upperBound,
                              boolean strict,
                              boolean indexZeroes)
Returns the size in bits of forward or skip pointers to the Elias–Fano encoding of a list of given length, upper bound and strictness.

Parameters:
length - the number of elements of the list.
upperBound - an upper bound for the elements of the list.
strict - if true, the elements of the list are strictly increasing, and the returned number of bits is for the strict representation (e.g., storing the k-th element decreased by k).
indexZeroes - if true, the number of bits for skip pointers is returned; otherwise, the number of bits for forward pointers is returned.
Returns:
the size of bits of forward or skip pointers the Elias–Fano encoding of a list with the specified parameters.

numberOfPointers

public static long numberOfPointers(long length,
                                    long upperBound,
                                    int log2Quantum,
                                    boolean strict,
                                    boolean indexZeroes)
Returns the number of forward or skip pointers to the Elias–Fano encoding of a list of given length, upper bound and strictness.

Parameters:
length - the number of elements of the list.
upperBound - an upper bound for the elements of the list.
log2Quantum - the logarithm of the quantum size.
strict - if true, the elements of the list are strictly increasing, and the returned number of bits is for the strict representation (e.g., storing the k-th element decreased by k).
indexZeroes - if true, an upper bound on the number of skip pointers is returned; otherwise, the (exact) number of forward pointers is returned.
Returns:
an upper bound on the number of skip pointers or the (exact) number of forward pointers.

newInvertedList

public void newInvertedList(int frequency,
                            long occurrency,
                            long sumMaxPos)
                     throws IOException
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.

This method provides additional information which is necessary to build the posting list. The information can be omitted if only part of the index is being written (e.g., no positions or even no counts and positions).

Parameters:
frequency - the frequency of the inverted list.
occurrency - the occurrency of the inverted list (use -1 if you are not writing counts).
sumMaxPos - the sum of the maximum position in each document (unused if positions are not indexed).
Throws:
IllegalStateException - if too few records were written for the previous inverted list.
IOException
See Also:
IndexWriter.newInvertedList()

newInvertedList

public long newInvertedList()
                     throws IOException
Description copied from interface: IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.

Specified by:
newInvertedList in interface IndexWriter
Returns:
the position (in bits) of the underlying bit stream where the new inverted list starts.
Throws:
IOException

writeFrequency

public void writeFrequency(int frequency)
Description copied from interface: IndexWriter
Writes the frequency.

Specified by:
writeFrequency in interface IndexWriter
Parameters:
frequency - the (positive) number of document records that this inverted list will contain.

newDocumentRecord

public OutputBitStream newDocumentRecord()
                                  throws IOException
Description copied from interface: IndexWriter
Starts a new document record.

This method must be called exactly exactly f times, where f is the frequency specified with IndexWriter.writeFrequency(int).

Specified by:
newDocumentRecord in interface IndexWriter
Returns:
the output bit stream where the next document record data should be written, if necessary, or null, if IndexWriter.writeDocumentPointer(OutputBitStream, int) ignores its first argument.
Throws:
IOException

writeDocumentPointer

public void writeDocumentPointer(OutputBitStream out,
                                 int pointer)
                          throws IOException
Description copied from interface: IndexWriter
Writes a document pointer.

This method must be called immediately after IndexWriter.newDocumentRecord().

Specified by:
writeDocumentPointer in interface IndexWriter
Parameters:
out - the output bit stream where the pointer will be written.
pointer - the document pointer.
Throws:
IOException

writePayload

public void writePayload(OutputBitStream unused,
                         Payload payload)
                  throws IOException
Description copied from interface: IndexWriter
Writes the payload for the current document.

This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, int).

Specified by:
writePayload in interface IndexWriter
Parameters:
unused - the output bit stream where the payload will be written.
payload - the payload.
Throws:
IOException

writePositionCount

public void writePositionCount(OutputBitStream unused,
                               int count)
                        throws IOException
Description copied from interface: IndexWriter
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.

Specified by:
writePositionCount in interface IndexWriter
Parameters:
unused - the output stream where the occurrences should be written.
count - the count.
Throws:
IOException

writeDocumentPositions

public void writeDocumentPositions(OutputBitStream unused,
                                   int[] position,
                                   int offset,
                                   int count,
                                   int docSize)
                            throws IOException
Description copied from interface: IndexWriter
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.

Specified by:
writeDocumentPositions in interface IndexWriter
Parameters:
unused - the output stream where the occurrences should be written.
position - the position vector (a sequence of strictly increasing natural numbers).
offset - the first valid entry in position.
count - the number of valid entries in position starting from offset.
docSize - the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
Throws:
IOException

writtenBits

public long writtenBits()
Description copied from interface: IndexWriter
Returns the overall number of bits written onto the underlying stream(s).

Specified by:
writtenBits in interface IndexWriter
Returns:
the number of bits written, according to the variables keeping statistical records.

properties

public Properties properties()
Description copied from interface: IndexWriter
Returns properties of the index generated by this index writer.

This method should only be called after IndexWriter.close(). It returns a new property object containing values for (whenever appropriate) Index.PropertyKeys.DOCUMENTS, Index.PropertyKeys.TERMS, Index.PropertyKeys.POSTINGS, Index.PropertyKeys.MAXCOUNT, Index.PropertyKeys.INDEXCLASS, Index.PropertyKeys.CODING, Index.PropertyKeys.PAYLOADCLASS, BitStreamIndex.PropertyKeys.SKIPQUANTUM, and BitStreamIndex.PropertyKeys.SKIPHEIGHT.

Specified by:
properties in interface IndexWriter
Returns:
properties a new set of properties for the just created index.

close

public void close()
           throws IOException
Description copied from interface: IndexWriter
Closes this index writer, completing the index creation process and releasing all resources.

Specified by:
close in interface IndexWriter
Throws:
IOException

printStats

public void printStats(PrintStream stats)
Description copied from interface: IndexWriter
Writes to the given print stream statistical information about the index just built. This method must be called after IndexWriter.close().

Specified by:
printStats in interface IndexWriter
Parameters:
stats - a print stream where statistical information will be written.