Chapter 3. Performance

Table of Contents

Indexing Time
Setting up the index structure
Setup Time
Query Time

Indexing Time

MG4J provides a great flexibility in index construction. For instance, you can decide to drop parts you are not going to use (e.g., positions) and for interleaved or high-performance indices you can choose several different codes for the components of the index. All these choices have a significant impact on performance. Building a collection during the indexing phase will of course slow down the whole process.

In general building large batches is a good idea if you have a lot memory; you can set the tentative batch size using the -s option. However, if your collection contains a large number of terms (e.g., if it contains many hapax legomena—terms that occur just once in the collection) a very large number of objects will be generated. This can cause a massive amount of garbage collection if you're relatively tight on memory. For this reason, there is a limit on the number of terms indexed at once (see the -M option of IndexBuilder and Scan).

You can build indices using alternative file systems such as HDFS, and even write your own IOFactory implementation. To use a Hadoop file system, just use the --io-factory command line option and specify a suitable object: for example, --io-factory 'it.unimi.di.mg4j.io.HadoopFileSystemIOFactory(hdfs://127.0.0.1:9000/)' will use a local HDFS file system at port 9000. More information can be found in the Javadoc documentation of HadoopFileSystemIOFactory.