Once you have the batches, you must combine
them in a single index (in the IndexBuilder
example, combination has been handled for you). Note that MG4J allows
you to combine any set of indices, which means, for
instance, that if your collection is split in several pieces you can
index the pieces separately and combine them later. MG4J distinguish
three type of index combination:
Concatenation takes a list of indices and builds a new index as follows: the first document of the second index is renumbered to the number of documents of the first index, and the others follow; the first document of the third index is renumbered to the sum of number of documents of the first and second index, and so on. The resulting index is identical to the index that would be produced by indexing the concatenation of document sequences producing each index. This is the kind of combination that is applied to batches, unless documents were renumbered.
Merging assumes that each index contains a separate subset of documents, with non-overlapping number, and merges the lists accordingly. In case a document appears in two indices, the merge operation is stopped. Note that no renumbering is performed. This is the kind of combination that is applied to batches when documents have been renumbered, and each batch contains potentially non-consecutive document numbers.
Pasting relaxes further the assumptions of merging: each index is assumed to index a (possibly empty) part of a document. For each term and document, the positions of the term in the document are gathered (and possibly suitably renumbered). If the inputs that have been indexed are text files with newline as separator, the resulting index is identical to the one that would be obtained by applying the UN*X command paste to the text files. This is the kind of combination that is applied to virtual documents, described in the next section.
Please consult the Javadoc of the package
it.unimi.di.mg4j.document
and of the above classes
for more information.
Incidentally, since you can choose the type of index to be
generated (quasi-succinct, interleaved, high-performance, the type of
skipping, the type of codes, etc.) and
Concatenate
works also with a single output
index, you can use it to convert an index of a certain type into an
index (containing the same information) of any other type.