Setting up the index structure

Since MG4J, the default indices uses in MG4J are not of the classic gap-compressed kind. A new kind of index, called quasi-succinct, has replaced them almost entirely (Scan's batches are still interleaved indices without skips). The following discussion is thus mainly of historical interest.

In MG4J, you can choose the codes used for compression. As a general rule, nonparametric codes are quicker than parametric codes. Thus, Golomb codes for document pointers have an excellent compression rate, but δ codes have very good compression, too, and can be decoded more quickly. The point here is that for unary, γ, shifted γ and δ codes MG4J uses precomputed decoding tables that speed up decompression by an order of magnitude. The default choice (γ for frequencies, δ for pointers, γ for counts, and δ for positions) is very reasonable. For maximum speed you could even try to use γ everywhere (as it is quicker to decode if the precomputed decoding tables fail).

Another important trick is that of discarding what you don't need. The default MG4J index type is called high-performance: it contains all information (pointers, counts, positions) but it is only partially interleaved―positions are kept in a separate file. This satisfies most needs, but If you are just using BM25 or TF/IDF scoring, there is no need to store positions in your index: you can force a standard, interleaved index, and store just what you need (e.g., -cPOSITIONS:NONE will eliminate positions from the index).

By default, indices contain a skipping structure that makes skipping index entries faster. Skipping structures introduce a slight overhead when scanning sequentially a list (so you should disable them using the --no-skips option if you don't need them), but in general they make query processing significantly faster. Skipping structure are based on two parameters: the quantum q and the height h. The quantum dictates how often the skipping structure should index positions in the inverted list. The height dictates how far the skipping structure is able to jump in one shot (an index is able to skip in one shot as far as q2h). However, as h grows the memory required to build the skip structures grow exponentially: the rule of thumb is setting h as large as possible without incurring in out-of-memory errors.

Sizing the quantum is a more complex issue, as it depends on the structure of the inverted list. Dense inverted lists require smaller quanta. Since version 3.0, MG4J makes it possible to just specify the percentage of the index size occupied by the skipping structure, and let some machinery compute the correct quantum. You can also specify a quantum explicitly, but it will be the same for all lists, which is usually not a good thing.