During the indexing process, it is possible to build a compressed
version of the collection used to build the index itself. There are
several ways to do that (and you can program your own). The easy way is
to use the -B
option, which accepts a basename from
which various files will be generated. By default, MG4J will generate a
SimpleCompressedDocumentCollection (
but you can
write your kind of collection, provide a
DocumentCollectionBuilder
for it, and just pass
it to Scan
). For instance,
java it.unimi.di.big.mg4j.tool.IndexBuilder \ -B javacomp --downcase -S javadoc.collection javadoc
would
generate during the indexing process a collection, which would be named
javacomp.collection
, that you can pass to
Query
. The collection is actually a
ConcatenatedDocumentCollection
that exhibits a
set of component instances of
SimpleCompressedDocumentCollection
, one per batch
(this arrangement makes collection construction more scalable). This is
an important fact to know, because if you move
javacomp.collection
somewhere else you will also
need to move all files stemmed from javacomp@
,
which contain the component collections.
Note that in this particular case there is no need to build
another collection—the FileSetDocumentCollection
used to build the index can be happily passed to
Query
. This is, however, not always the case, as
MG4J builds indices out of sequences—objects that
expose the data to be indexed in a sequential fashion. A typical example
is the default, built-in
InputStreamDocumentSequence
. Assume you have a
file documents.txt
that contains one document per
line. You can index it as follows:
java it.unimi.di.big.mg4j.tool.IndexBuilder \ --downcase -p encoding=UTF-8 javadoc <documents.txt
Note the
-p encoding=UTF-8
option, which sets the encoding of
the text file. This command will create a single index with field name
text
(you can change the field name with another
property—see the InputStreamDocumentSequence
Javadoc). When you query the index, results will be displayed as numbers
(positions in the original text), as Query
has no
access to a document collection. But if you specify the
-B
option, you can build on the fly a collection that
can be used by Query
to display snippets.
The kind of collection that is create is customisable. The
interface DocumentCollectionBuilder
specifies
what a collection builder should provide to be
used at indexing time, and a builder can be specified with the
--builder-class
option. For instance, by specifying
--builder-class ZipDocumentCollectionBuilder
you will
get back the behaviour of the obsoleted -z
option—building a ZipDocumentCollection
.
There are many other collections you can play with—they are
contained in the package
it.unimi.di.big.mg4j.document
. There are
collections for reading from JDBC databases, comma-separated files, and
so on (and, of course, you can write your own). Some collections let you
play with other collections:
ConcatenatedDocumentCollection
exhibits a set of
collection as a single collection that concatenates their content.
SubDocumentCollection
exhibits a contiguous
subset of documents of a given collection as a new collection. Some of
these classes have constructor that follow dsiutil
's
ObjectParser
conventions, and thus can be
constructed directly for the command line. One such class is
SubDocumentCollection
; the following command line
uses the -o
option to build such a collection on the
fly:
java it.unimi.di.big.mg4j.tool.IndexBuilder \ --downcase -oSubDocumentCollection\(javadoc.collection,0,10\) mini
The
above command would just index the first ten documents of
javadoc.collection
(see the Javadoc of
SubDocumentCollection
for more details). You can
then use the option -o
to pass the same collection to
Query
, or build a compressed collection during
the indexing phase.