it.unimi.di.big.mg4j.index (MG4J (big) 5.4.4)

Index generation and access.

This package contains the classes that handle index generation and access. The interval iterators defined in it.unimi.di.big.mg4j.search build upon the classes of this package to provide answer to queries using interval semantics, but it is also possible to access an index directly.

You can easily build indices using the tools in it.unimi.di.big.mg4j.tool. Once an index has been built, it can be opened using an Index object, which gathers metadata that is necessary to access the index. You do not create an Index with a constructor: rather, you use the static factory Index.getInstance(CharSequence) (or one of its variants) to create an instance. This is necessary so that different kind of indices can be treated transparently: for example, the factory may return a IndexCluster if the index is actually a cluster, but you do not need to know that.

From an Index, you can easily obtain either an IndexReader, which allows to scan sequentially or randomly the index. In turn from an IndexReader you can obtain a IndexIterator returning the documents containing a certain term and the position of the term within the document.

But there is more: an IndexIterator is a kind of DocumentIterator, and DocumentIterators can be combined in several ways using the classes of the package it.unimi.di.big.mg4j.search: for instance, you can combine document iterators using AND/OR. Note that you can combine document iterators on different indices, but of course the operation is meaningful only if the two indices contain different information about the same document collection (e.g., title and main text).

More importantly, if the index is full text (the default) for each document containing the term you can get interval iterators that return intervals representing extents of text satisfying the query: for instance, in case of an AND of two terms, the intervals will contain both terms.

Structure of an inverted index

An inverted index is made by a sequence of inverted lists (one inverted list for each term). Inverted lists are made by document records: each document record contains information about the occurrences of the term within a certain document.

The number of documents in which a term appear, that is, the length of the inverted list of the term, is called the frequency.

Each document record contains the (document) pointer, which identifies the document within the collection; optionally, the count, that is, the number of times the term appears in the given document (it is also common to call the count “within-document frequency”, but we find this usage confusing); finally, optionally we might have the positions in which the term occurs. Some indices store also a user-defined payload.

Traditional indices are based on gap encoding techniques: sequence of increasing integers, such as document pointers, are stored by writing the gaps between successive pointers using suitable instantaneous codes. MG4J supports this kind of indices (see BitStreamIndexWriter and BitStreamHPIndexWriter), but since version 5.0 the default index is a new kind of quasi-succinct index that has major performance improvements with respect to traditional gap-encoded indices. Gap-based interleaved indices (see BitStreamIndexWriter and SkipBitStreamIndexWriter) are used during index construction, for payload indices, and should be considered when indices are not to be directly queried (in which a quasi-succinct index is unbeatable) but rather further manipulated.

Interface Summary
Interface	Description
IndexIterator	An iterator over an inverted list.
IndexReader	Provides access to an inverted index.
IndexWriter	An interface for classes that generate indices.
TermProcessor	A term processor, implementing term/prefix transformation and possibly term/prefix filtering.
VariableQuantumIndexWriter	An index writer supporting variable quanta.

Class Summary
Class	Description
AbstractBitStreamIndexWriter	An abstract bitstream-based index writer, providing common variables and a basic `AbstractBitStreamIndexWriter.printStats(PrintStream)` implementation.
AbstractIndexIterator	A very basic abstract implementation of an index iterator, providing an obvious implementation of `IndexIterator.term()`, `IndexIterator.id()`, `DocumentIterator.weight()` and of the visiting methods.
AbstractIndexReader	An abstract, safely closeable implementation of an index reader.
BitStreamHPIndex	A high-performance bitstream-based index.
BitStreamHPIndexReader	A bitstream-based index reader for high-performance indices.
BitStreamHPIndexReader.BitStreamHPIndexReaderIndexIterator
BitStreamHPIndexWriter	Writes a bitstream-based high-performance index.
BitStreamHPIndexWriter.TowerData	A structure maintaining statistical data about tower construction.
BitStreamIndex	A bitstream-based index.
BitStreamIndexReader	A bitstream-based index reader.
BitStreamIndexReader.BitStreamIndexReaderIndexIterator
BitStreamIndexWriter	Writes a bitstream-based interleaved index.
CachingOutputBitStream	A special output bit stream with an additional method `CachingOutputBitStream.buffer()` that returns the internal buffer if the internal buffer contains all that has been written since the last call to `position(0)`.
CompressionFlags	A container for constants and enums related to index compression.
DiskBasedIndex	A static container providing facilities to load an index based on data stored on disk.
DowncaseTermProcessor	A term processor downcasing all characters.
FileHPIndex	A file-based high-performance index.
FileIndex	A file-based index.
Index	An abstract representation of an index.
IndexIntervalIterator	An interval iterator returning the positions of the current document as singleton intervals.
IndexIterators	A class providing static methods and objects that do useful things with index iterators.
IndexIterators.PositionsIterator
InMemoryHPIndex	A `BitStreamHPIndex` index loaded in memory.
InMemoryIndex	A local bitstream index loaded in memory.
MemoryMappedHPIndex	A memory-mapped `BitStreamHPIndex`.
MemoryMappedIndex	A local memory-mapped bitstream index.
MultiTermIndexIterator	A virtual index iterator that merges several component index iterators.
NullTermProcessor	A term processor that accepts all terms and does not do any processing.
QuasiSuccinctIndex	A quasi-succinct index.
QuasiSuccinctIndexReader	An index reader for quasi-succinct indices.
QuasiSuccinctIndexReader.AbstractQuasiSuccinctIndexIterator
QuasiSuccinctIndexReader.CountReader
QuasiSuccinctIndexReader.EliasFanoIndexIterator
QuasiSuccinctIndexReader.EliasFanoPointerReader
QuasiSuccinctIndexReader.LongWordBitReader
QuasiSuccinctIndexReader.PointerReader
QuasiSuccinctIndexReader.PositionReader
QuasiSuccinctIndexReader.RankedIndexIterator
QuasiSuccinctIndexReader.RankedPointerReader
QuasiSuccinctIndexWriter	An index writer for quasi-succinct indices.
QuasiSuccinctIndexWriter.Accumulator
QuasiSuccinctIndexWriter.LongWordCache
QuasiSuccinctIndexWriter.LongWordOutputBitStream
SkipBitStreamIndexWriter	Writes a bitstream-based interleaved index with skips.
SkipBitStreamIndexWriter.TowerData	A structure maintaining statistical data about tower construction.

Enum Summary
Enum	Description
BitStreamIndex.PropertyKeys	Symbolic names for additional properties of a `BitStreamIndex`.
CompressionFlags.Coding	A coding for an index component.
CompressionFlags.Component	A component of the index.
Index.PropertyKeys	Symbolic names for properties of a `Index`.
Index.UriKeys	Keys to be used (downcased) in specifiying additional parameters to a MG4J URI.
QuasiSuccinctIndex.PropertyKeys	Symbolic names for additional properties of a `QuasiSuccinctIndex`.

Exception Summary
Exception Description

TooManyTermsException
Thrown to indicate that a prefix query generated too many terms.

Package it.unimi.di.big.mg4j.index

Structure of an inverted index