Table of Contents
MG4J provides a great flexibility in index construction. For instance, you can choose several different codes for the components of the index, and moreover you can decide to drop parts you are not going to use (e.g., positions). All these choices have a significant impact on performance. Building a collection during the indexing phase will of course slow down the whole process.
In general building large batches is a good idea if you have a lot
memory; you can set the tentative batch size using the
-s
option. However, if your collection contains a large
number of terms (e.g., if it contains many hapax
legomena—terms that occur just once in the collection) a
very large number of objects will be generated. This can cause a massive
amount of garbage collection if you're relatively tight on memory. For
this reason, there is a limit on the number of terms indexed at once
(see the -M
option of
IndexBuilder
and
Scan
).