Table of Contents
MG4J provides a great flexibility in index construction. For instance, you can decide to drop parts you are not going to use (e.g., positions) and for interleaved or high-performance indices you can choose several different codes for the components of the index. All these choices have a significant impact on performance. Building a collection during the indexing phase will of course slow down the whole process.
In general building large batches is a good idea if you have a lot
memory; you can set the tentative batch size using the
-s
option. However, if your collection contains a large
number of terms (e.g., if it contains many hapax
legomena—terms that occur just once in the collection) a
very large number of objects will be generated. This can cause a massive
amount of garbage collection if you're relatively tight on memory. For
this reason, there is a limit on the number of terms indexed at once
(see the -M
option of
IndexBuilder
and
Scan
).
You can build indices using alternative file systems such as HDFS,
and even write your own IOFactory
implementation.
To use a Hadoop file system, just use the --io-factory
command line option and specify a suitable object: for example,
--io-factory
'it.unimi.di.mg4j.io.HadoopFileSystemIOFactory(hdfs://127.0.0.1:9000/)'
will use a local HDFS file system at port 9000. More information can be
found in the Javadoc documentation of
HadoopFileSystemIOFactory
.