Combining batches

Once you have the batches, you must combine them in a single index (in the IndexBuilder example, combination has been handled for you). Note that MG4J allows you to combine any set of indices, which means, for instance, that if your collection is split in several piece you can index the pieces separately and combine them later. MG4J distinguish three type of index combination:

  1. Concatenation takes a list of indices and builds a new index as follows: the first document of the second index is renumbered to the number of documents of the first index, and the others follow; the first document of the third index is renumbered to the sum of number of documents of the first and second index, and so on. The resulting index is identical to the index that would be produced by indexing the concatenation of document sequences producing each index. This is the kind of combination that is applied to batches, unless documents were renumbered.

  2. Merging assumes that each index contains a separate subset of documents, with non-overlapping number, and merges the lists accordingly. In case a document appears in two indices, the merge operation is stopped. Note that no renumbering is performed. This is the kind of combination that is applied to batches when documents have been renumbered, and each batch contains potentially non-consecutive document numbers.

  3. Pasting relaxes further the assumptions of merging: each index is assumed to index a (possibly empty) part of a document. For each term and document, the positions of the term in the document are gathered (and possibly suitably renumbered). If the inputs that have been indexed are text files with newline as separator, the resulting index is identical to the one that would be obtained by applying the UN*X command paste to the text files. This is the kind of combination that is applied to virtual documents, described in the next section.

Please consult the Javadoc of the package it.unimi.di.big.mg4j.document and of the above classes for more information.