Since MG4J, the default indices uses in MG4J are not of the
classic gap-compressed kind. A new kind of index, called
quasi-succinct, has replaced them almost entirely
(Scan
's batches are still interleaved indices
without skips). The following discussion is thus mainly of historical
interest.
In MG4J, you can choose the codes used for compression. As a general rule, nonparametric codes are quicker than parametric codes. Thus, Golomb codes for document pointers have an excellent compression rate, but δ codes have very good compression, too, and can be decoded more quickly. The point here is that for unary, γ, shifted γ and δ codes MG4J uses precomputed decoding tables that speed up decompression by an order of magnitude. The default choice (γ for frequencies, δ for pointers, γ for counts, and δ for positions) is very reasonable. For maximum speed you could even try to use γ everywhere (as it is quicker to decode if the precomputed decoding tables fail).
Another important trick is that of discarding what you don't need.
The default MG4J index type is called
high-performance: it contains all information
(pointers, counts, positions) but it is only partially
interleaved―positions are kept in a separate file. This satisfies most
needs, but If you are just using BM25 or TF/IDF scoring, there is no
need to store positions in your index: you can force a standard,
interleaved index, and store just what you need (e.g.,
-cPOSITIONS:NONE
will eliminate positions from the
index).
By default, indices contain a skipping
structure that makes skipping index entries faster. Skipping
structures introduce a slight overhead when scanning sequentially a list
(so you should disable them using the --no-skips
option
if you don't need them), but in general they make query processing
significantly faster. Skipping structure are based on two parameters:
the quantum q
and the
height h
. The quantum dictates
how often the skipping structure should index positions in the inverted
list. The height dictates how far the skipping structure is able to jump
in one shot (an index is able to skip in one shot as far as
q
2h). However, as
h
grows the memory required to build the skip
structures grow exponentially: the rule of thumb is setting
h
as large as possible without incurring in
out-of-memory errors.
Sizing the quantum is a more complex issue, as it depends on the structure of the inverted list. Dense inverted lists require smaller quanta. Since version 3.0, MG4J makes it possible to just specify the percentage of the index size occupied by the skipping structure, and let some machinery compute the correct quantum. You can also specify a quantum explicitly, but it will be the same for all lists, which is usually not a good thing.