Index generation and access.
This package contains the classes that handle index generation and
access. The interval iterators defined in
build upon the classes of this package to provide answer to queries using interval semantics,
but it is also possible to access an index directly.
You can easily build indices using the tools in
it.unimi.di.big.mg4j.tool. Once an index
has been built, it can be opened using an
Index object, which
gathers metadata that is necessary to access the index. You do not create an
with a constructor: rather, you use the static factory
Index.getInstance(CharSequence) (or one of its variants) to create an instance.
This is necessary so that different kind of indices can be treated transparently: for example, the factory
may return a
IndexCluster if the index is actually a cluster,
but you do not need to know that.
you can easily obtain either an
IndexReader, which allows to
scan sequentially or randomly the index. In turn from an
you can obtain a
returning the documents containing a certain term and the position of the term within the document.
But there is more: an
is a kind of
DocumentIterators can be combined in several ways
using the classes of the package
it.unimi.di.big.mg4j.search: for instance, you can combine
document iterators using AND/OR. Note that you can combine document iterators on different
indices, but of course the operation is meaningful only if the two indices contain different information
about the same document collection (e.g., title and main text).
More importantly, if the index is full text (the default) for each document containing the term you can get interval iterators that return intervals representing extents of text satisfying the query: for instance, in case of an AND of two terms, the intervals will contain both terms.
Structure of an inverted index
An inverted index is made by a sequence of inverted lists (one inverted list for each term). Inverted lists are made by document records: each document record contains information about the occurrences of the term within a certain document.
The number of documents in which a term appear, that is, the length of the inverted list of the term, is called the frequency.
Each document record contains the (document) pointer, which identifies the document within the collection; optionally, the count, that is, the number of times the term appears in the given document (it is also common to call the count “within-document frequency”, but we find this usage confusing); finally, optionally we might have the positions in which the term occurs. Some indices store also a user-defined payload.
Traditional indices are based on gap encoding techniques: sequence of increasing integers, such as document pointers,
are stored by writing the gaps between successive pointers using suitable instantaneous codes. MG4J supports this kind
of indices (see
but since version 5.0 the default index is a new kind of quasi-succinct index
that has major performance improvements with respect to traditional gap-encoded indices. Gap-based interleaved indices
SkipBitStreamIndexWriter) are used
during index construction, for payload indices, and should be considered when indices are not to be directly queried
(in which a quasi-succinct index is unbeatable) but rather
Interface Summary Interface Description IndexIteratorAn iterator over an inverted list. IndexReaderProvides access to an inverted index. IndexWriterAn interface for classes that generate indices. TermProcessorA term processor, implementing term/prefix transformation and possibly term/prefix filtering. VariableQuantumIndexWriterAn index writer supporting variable quanta.
Enum Summary Enum Description BitStreamIndex.PropertyKeysSymbolic names for additional properties of a
CompressionFlags.CodingA coding for an index component. CompressionFlags.ComponentA component of the index. Index.PropertyKeysSymbolic names for properties of a
Index.UriKeysKeys to be used (downcased) in specifiying additional parameters to a MG4J URI. QuasiSuccinctIndex.PropertyKeysSymbolic names for additional properties of a
Exception Summary Exception Description TooManyTermsExceptionThrown to indicate that a prefix query generated too many terms.