Building a compressed collection

Building a compressed collection
Prev	Chapter 1. A Quick Tour of MG4J	Next

During the indexing process, it is possible to build a compressed version of the collection used to build the index itself. There are several ways to do that (and you can program your own). The easy way is to use the -B option, which accepts a basename from which various files will be generated. By default, MG4J will generate a SimpleCompressedDocumentCollection (but you can write your kind of collection, provide a DocumentCollectionBuilder for it, and just pass it to Scan). For instance,

java it.unimi.di.big.mg4j.tool.IndexBuilder \
    -B javacomp --downcase -S javadoc.collection javadoc

would generate during the indexing process a collection, which would be named javacomp.collection, that you can pass to Query. The collection is actually a ConcatenatedDocumentCollection that exhibits a set of component instances of SimpleCompressedDocumentCollection, one per batch (this arrangement makes collection construction more scalable). This is an important fact to know, because if you move javacomp.collection somewhere else you will also need to move all files stemmed from javacomp@, which contain the component collections.

Note that in this particular case there is no need to build another collection—the FileSetDocumentCollection used to build the index can be happily passed to Query. This is, however, not always the case, as MG4J builds indices out of sequences—objects that expose the data to be indexed in a sequential fashion. A typical example is the default, built-in InputStreamDocumentSequence. Assume you have a file documents.txt that contains one document per line. You can index it as follows:

java it.unimi.di.big.mg4j.tool.IndexBuilder \
    --downcase -p encoding=UTF-8 javadoc <documents.txt

Note the -p encoding=UTF-8 option, which sets the encoding of the text file. This command will create a single index with field name text (you can change the field name with another property—see the InputStreamDocumentSequence Javadoc). When you query the index, results will be displayed as numbers (positions in the original text), as Query has no access to a document collection. But if you specify the -B option, you can build on the fly a collection that can be used by Query to display snippets.

The kind of collection that is create is customisable. The interface DocumentCollectionBuilder specifies what a collection builder should provide to be used at indexing time, and a builder can be specified with the --builder-class option. For instance, by specifying --builder-class ZipDocumentCollectionBuilder you will get back the behaviour of the obsoleted -z option—building a ZipDocumentCollection.

There are many other collections you can play with—they are contained in the package it.unimi.di.big.mg4j.document. There are collections for reading from JDBC databases, comma-separated files, and so on (and, of course, you can write your own). Some collections let you play with other collections: ConcatenatedDocumentCollection exhibits a set of collection as a single collection that concatenates their content. SubDocumentCollection exhibits a contiguous subset of documents of a given collection as a new collection. Some of these classes have constructor that follow dsiutil's ObjectParser conventions, and thus can be constructed directly for the command line. One such class is SubDocumentCollection; the following command line uses the -o option to build such a collection on the fly:

java it.unimi.di.big.mg4j.tool.IndexBuilder \
    --downcase -oSubDocumentCollection\(javadoc.collection,0,10\) mini

The above command would just index the first ten documents of javadoc.collection (see the Javadoc of SubDocumentCollection for more details). You can then use the option -o to pass the same collection to Query, or build a compressed collection during the indexing phase.