java.lang.Object
- it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
- - it.unimi.di.big.mg4j.index.BitStreamIndexWriter

All Implemented Interfaces:

IndexWriter

Direct Known Subclasses:

SkipBitStreamIndexWriter
```
public class BitStreamIndexWriter
extends AbstractBitStreamIndexWriter
```
Writes a bitstream-based interleaved index.
Indices written by this class are somewhat classical. Each inverted list contains the frequency, followed by gap-encoded pointers optionally interleaved with counts and gap-encoded positions. The compression technique used for each component can be chosen using a compression flag.
Interleaved indices of this kind are essentially unusable, as all information in each posting list must be entirely read (no skipping is possible). One possible exception is disjunctive queries which use all the information in the index (e.g., with proximity scoring). Another possible usage is to test the compression power of different codes, as essentially all classical compression techniques are available. But, most importantly, the Scan tool generates interleaved indices as batches (albeit not using this class).
These are the files that form an interleaved index:

basename.properties
A Java property file containing information about the index.
basename.terms
For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i.
basename.frequencies
For each term, the number of documents in which the term appears in γ coding. More precisely, i-th integer of the file (starting from 0) is the number of documents in which the term of index i appears. This information appears also at the start of each posting list in the index, but it is also stored in this file for convenience.
basename.sizes (not generated for payload-based indices)
For each indexed document, the corresponding size (=number of words) in γ coding. More precisely, i-th integer of the file (starting from 0) is the size in words of the document of index i.
basename.index
The inverted index.
basename.offsets
For each term, the bit offset in basename.index at which the inverted lists start. More precisely, the first integer is the offset for term 0 in γ coding, and then the i-th integer is the difference between the i-th and the i−1-th offset in γ coding. If T terms were indexed, this file will contain T+1 integers, the last being the difference (in bits) between the length of the entire inverted index and the offset of the last inverted list. Thus, in practice, the file is formed by the number zero (the offset of the first list) followed by the length in bits of each inverted list.
basename.occurrencies
For each term, its occurrency, that is, the number of its occurrences throughout the whole document collection, in γ coding. More precisely, the i-th integer of the file (starting from 0) is the occurrency of the term of index i.
basename.posnumbits
For each term, the number of bits spent to store positions in γ code (used just for quantum-optimisation purposes).
basename.sumsmaxpos
For each term, the sum of the maximum positions in which the term appears (necessary to build a QuasiSuccinctIndex) in δ code.
basename.stats
Miscellaneous detailed statistics about the index.

Since:

0.6

Author:

Paolo Boldi, Sebastiano Vigna

Field Summary

Fields
Modifier and Type	Field	Description
`protected int`	`b`	The parameter `b` for Golomb coding of pointers.
`protected static int`	`BEFORE_COUNT`	This value of `state` can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
`protected static int`	`BEFORE_DOCUMENT_RECORD`	This value of `state` means that we are ready to call `newDocumentRecord()`.
`protected static int`	`BEFORE_FREQUENCY`	This value of `state` means that we are positioned at the start of an inverted list, and we should call `writeFrequency(long)`.
`protected static int`	`BEFORE_INVERTED_LIST`	This value of `state` means that we should call `newInvertedList()`.
`protected static int`	`BEFORE_PAYLOAD`	This value of `state` can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.
`protected static int`	`BEFORE_POINTER`	This value of `state` means that we just started a new document record, and we should call `writeDocumentPointer(OutputBitStream, long)`.
`protected static int`	`BEFORE_POSITIONS`	This value of `state` can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
`protected long`	`currentDocument`	The current document pointer.
`protected static int`	`FIRST_UNUSED_STATE`	This is the first unused state.
`protected long`	`frequency`	The number of document records that the current inverted list will contain.
`protected long`	`lastDocument`	The last document pointer in the current list.
`protected long`	`lastInvertedListPos`	The position (in bytes) where the last inverted list started.
`protected int`	`log2b`	The parameter `log2b` for Golomb coding of pointers; it is the most significant bit of `b`.
`int`	`maxCount`	The maximum number of positions in a document record so far.
`protected OutputBitStream`	`obs`	The underlying `OutputBitStream`.
`protected int`	`state`	The current state of the writer.
`protected long`	`writtenDocuments`	The number of document records already written for the current inverted list.

Fields inherited from class it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
bitsForCounts, bitsForFrequencies, bitsForPayloads, bitsForPointers, bitsForPositions, countCoding, currentTerm, flags, frequencyCoding, hasCounts, hasPayloads, hasPositions, numberOfDocuments, numberOfOccurrences, numberOfPostings, pointerCoding, positionCoding

Constructor Summary

Constructors
Constructor	Description
`BitStreamIndexWriter(IOFactory ioFactory, CharSequence basename, long numberOfDocuments, boolean writeOffsets, Map<CompressionFlags.Component,CompressionFlags.Coding> flags)`	Creates a new index writer with the specified basename.

Method Summary

Modifier and Type	Method	Description
`void`	`close()`	Closes this index writer, completing the index creation process and releasing all resources.
`OutputBitStream`	`newDocumentRecord()`	Starts a new document record.
`long`	`newInvertedList()`	Starts a new inverted list.
`Properties`	`properties()`	Returns properties of the index generated by this index writer.
`void`	`writeDocumentPointer(OutputBitStream out, long pointer)`	Writes a document pointer.
`void`	`writeDocumentPositions(OutputBitStream out, int[] position, int offset, int count, int docSize)`	Writes the positions of the occurrences of the current term in the current document to the given `OutputBitStream`.
`void`	`writeFrequency(long frequency)`	Writes the frequency.
`void`	`writePayload(OutputBitStream out, Payload payload)`	Writes the payload for the current document.
`void`	`writePositionCount(OutputBitStream out, int count)`	Writes the count of the occurrences of the current term in the current document to the given `OutputBitStream`.
`long`	`writtenBits()`	Returns the overall number of bits written onto the underlying stream(s).

Methods inherited from class it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter
printStats

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - BEFORE_INVERTED_LIST
```
protected static final int BEFORE_INVERTED_LIST
```
    This value of state means that we should call newInvertedList().
    
    See Also:
    
    Constant Field Values
  - BEFORE_FREQUENCY
```
protected static final int BEFORE_FREQUENCY
```
    This value of state means that we are positioned at the start of an inverted list, and we should call writeFrequency(long).
    
    See Also:
    
    Constant Field Values
  - BEFORE_DOCUMENT_RECORD
```
protected static final int BEFORE_DOCUMENT_RECORD
```
    This value of state means that we are ready to call newDocumentRecord().
    
    See Also:
    
    Constant Field Values
  - BEFORE_POINTER
```
protected static final int BEFORE_POINTER
```
    This value of state means that we just started a new document record, and we should call writeDocumentPointer(OutputBitStream, long).
    
    See Also:
    
    Constant Field Values
  - BEFORE_PAYLOAD
```
protected static final int BEFORE_PAYLOAD
```
    This value of state can be assumed only in indices that contain payloads; it means that we are positioned just before the payload for the current document record.
    
    See Also:
    
    Constant Field Values
  - BEFORE_COUNT
```
protected static final int BEFORE_COUNT
```
    This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
    
    See Also:
    
    Constant Field Values
  - BEFORE_POSITIONS
```
protected static final int BEFORE_POSITIONS
```
    This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
    
    See Also:
    
    Constant Field Values
  - FIRST_UNUSED_STATE
```
protected static final int FIRST_UNUSED_STATE
```
    This is the first unused state. Subclasses may start from this value to define new states.
    
    See Also:
    
    Constant Field Values
  - obs
```
protected OutputBitStream obs
```
    The underlying OutputBitStream.
  - state
```
protected int state
```
    The current state of the writer.
  - frequency
```
protected long frequency
```
    The number of document records that the current inverted list will contain.
  - writtenDocuments
```
protected long writtenDocuments
```
    The number of document records already written for the current inverted list.
  - currentDocument
```
protected long currentDocument
```
    The current document pointer.
  - lastDocument
```
protected long lastDocument
```
    The last document pointer in the current list.
  - lastInvertedListPos
```
protected long lastInvertedListPos
```
    The position (in bytes) where the last inverted list started.
  - maxCount
```
public int maxCount
```
    The maximum number of positions in a document record so far.
  - b
```
protected int b
```
    The parameter b for Golomb coding of pointers.
  - log2b
```
protected int log2b
```
    The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.
- Constructor Detail
  - BitStreamIndexWriter
```
public BitStreamIndexWriter(IOFactory ioFactory,
                            CharSequence basename,
                            long numberOfDocuments,
                            boolean writeOffsets,
                            Map<CompressionFlags.Component,CompressionFlags.Coding> flags)
                     throws IOException
```
    Creates a new index writer with the specified basename. The index will be written on a file (stemmed with .index). If writeOffsets, also an offset file will be produced (stemmed with .offsets). When close() will be called, the property file will also be produced (stemmed with .properties), or enriched if it already exists.
    
    Parameters:
    
    ioFactory - the factory that will be used to perform I/O.
    
    basename - the basename.
    
    numberOfDocuments - the number of documents in the collection to be indexed.
    
    writeOffsets - if true, the offset file will also be produced.
    
    flags - a flag map setting the coding techniques to be used (see CompressionFlags).
    
    Throws:
    
    IOException
- Method Detail
  - newInvertedList
```
public long newInvertedList()
                     throws IOException
```
    Description copied from interface: IndexWriter
    
    Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream.
    
    Returns:
    
    the position (in bits) of the underlying bit stream where the new inverted list starts.
    
    Throws:
    
    IOException
  - writeFrequency
```
public void writeFrequency(long frequency)
                    throws IOException
```
    Description copied from interface: IndexWriter
    
    Writes the frequency.
    
    Parameters:
    
    frequency - the (positive) number of document records that this inverted list will contain.
    
    Throws:
    
    IOException
  - newDocumentRecord
```
public OutputBitStream newDocumentRecord()
                                  throws IOException
```
    Description copied from interface: IndexWriter
    
    Starts a new document record.
    This method must be called exactly exactly f times, where f is the frequency specified with IndexWriter.writeFrequency(long).
    
    Returns:
    
    the output bit stream where the next document record data should be written, if necessary, or null, if IndexWriter.writeDocumentPointer(OutputBitStream, long) ignores its first argument.
    
    Throws:
    
    IOException
  - writeDocumentPointer
```
public void writeDocumentPointer(OutputBitStream out,
                                 long pointer)
                          throws IOException
```
    Description copied from interface: IndexWriter
    
    Writes a document pointer.
    This method must be called immediately after IndexWriter.newDocumentRecord().
    
    Parameters:
    
    out - the output bit stream where the pointer will be written.
    
    pointer - the document pointer.
    
    Throws:
    
    IOException
  - writePayload
```
public void writePayload(OutputBitStream out,
                         Payload payload)
                  throws IOException
```
    Description copied from interface: IndexWriter
    
    Writes the payload for the current document.
    This method must be called immediately after IndexWriter.writeDocumentPointer(OutputBitStream, long).
    
    Parameters:
    
    out - the output bit stream where the payload will be written.
    
    payload - the payload.
    
    Throws:
    
    IOException
  - close
```
public void close()
           throws IOException
```
    Description copied from interface: IndexWriter
    
    Closes this index writer, completing the index creation process and releasing all resources.
    
    Throws:
    
    IOException
  - writePositionCount
```
public void writePositionCount(OutputBitStream out,
                               int count)
                        throws IOException
```
    Description copied from interface: IndexWriter
    
    Writes the count of the occurrences of the current term in the current document to the given OutputBitStream.
    
    Parameters:
    
    out - the output stream where the occurrences should be written.
    
    count - the count.
    
    Throws:
    
    IOException
  - writeDocumentPositions
```
public void writeDocumentPositions(OutputBitStream out,
                                   int[] position,
                                   int offset,
                                   int count,
                                   int docSize)
                            throws IOException
```
    Description copied from interface: IndexWriter
    
    Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream.
    
    Parameters:
    
    out - the output stream where the occurrences should be written.
    
    position - the position vector (a sequence of strictly increasing natural numbers).
    
    offset - the first valid entry in position.
    
    count - the number of valid entries in position starting from offset.
    
    docSize - the size of the current document (only for Golomb and interpolative coding; you can safely pass -1 otherwise).
    
    Throws:
    
    IOException
  - writtenBits
```
public long writtenBits()
```
    Description copied from interface: IndexWriter
    
    Returns the overall number of bits written onto the underlying stream(s).
    
    Returns:
    
    the number of bits written, according to the variables keeping statistical records.
  - properties
```
public Properties properties()
```
    Description copied from interface: IndexWriter
    
    Returns properties of the index generated by this index writer.
    This method should only be called after IndexWriter.close(). It returns a new property object containing values for (whenever appropriate) Index.PropertyKeys.DOCUMENTS, Index.PropertyKeys.TERMS, Index.PropertyKeys.POSTINGS, Index.PropertyKeys.MAXCOUNT, Index.PropertyKeys.INDEXCLASS, Index.PropertyKeys.CODING, Index.PropertyKeys.PAYLOADCLASS, BitStreamIndex.PropertyKeys.SKIPQUANTUM, and BitStreamIndex.PropertyKeys.SKIPHEIGHT.
    
    Returns:
    
    properties a new set of properties for the just created index.

Class BitStreamIndexWriter

Field Summary

Fields inherited from class it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter

Constructor Summary

Method Summary

Methods inherited from class it.unimi.di.big.mg4j.index.AbstractBitStreamIndexWriter

Methods inherited from class java.lang.Object

Field Detail

BEFORE_INVERTED_LIST

BEFORE_FREQUENCY

BEFORE_DOCUMENT_RECORD

BEFORE_POINTER

BEFORE_PAYLOAD

BEFORE_COUNT

BEFORE_POSITIONS

FIRST_UNUSED_STATE

obs

state

frequency

writtenDocuments

currentDocument

lastDocument

lastInvertedListPos

maxCount

b

log2b

Constructor Detail

BitStreamIndexWriter

Method Detail

newInvertedList

writeFrequency

newDocumentRecord

writeDocumentPointer

writePayload

close

writePositionCount

writeDocumentPositions

writtenBits

properties