Chapter 3. Performance

Table of Contents

Indexing Time
Setting up the index structure
Setup Time
Query Time

Indexing Time

MG4J provides a great flexibility in index construction. For instance, you can choose several different codes for the components of the index, and moreover you can decide to drop parts you are not going to use (e.g., positions). All these choices have a significant impact on performance. Building a collection during the indexing phase will of course slow down the whole process.

In general building large batches is a good idea if you have a lot memory; you can set the tentative batch size using the -s option. However, if your collection contains a large number of terms (e.g., if it contains many hapax legomena—terms that occur just once in the collection) a very large number of objects will be generated. This can cause a massive amount of garbage collection if you're relatively tight on memory. For this reason, there is a limit on the number of terms indexed at once (see the -M option of IndexBuilder and Scan).