Class Combine

  • Direct Known Subclasses:
    Concatenate, Merge, Paste

    public abstract class Combine
    extends Object
    Combines several indices.

    Indices may be combined in several different ways. This abstract class contains code that is common to classes such as Merge or Concatenate: essentially, command line parsing, index opening, and term list fusion is taken care of. Then, the template method combine(int, long) must write into indexWriter the combined inverted list. If, however, metadataOnly is true, indexWriter is null and combine(int, long) must just compute the total frequency, occurrency, and sum of maximum positions.

    Note that by combining a single index into a new one you can recompress an index with different compression parameters (which includes the possibility of eliminating positions or counts). It is also possible to build just the metadata associated with an index (term list, frequencies, occurrencies).

    The subclasses of this class must implement combine(int, long) so that indices with different sets of features are combined keeping the largest set of features requested by the user. For instance, combining an index with positions and an index with counts, but no positions, should generate an index with counts but no positions.

    Warning: a combination requires opening three files per input index, plus a few more files for the output index. If the combination process is interrupted by an exception claiming that there are too many open files, check how to increase the number of files you can open (usually, for instance on UN*X, there is a global and a per-process limit, so be sure to set both).

    Read-once indices, readers, and distributed index combination

    If the indices and bitstream index readers involved in the combination are read-once (i.e., opening an index and reading once its contents sequentially causes each file composing the index to be read exactly once) then also Combine implementations should be read-once (Concatenate, Merge and Paste are).

    This means, in particular, that index combination can be performed from pipes, which in turn can be filled, for instance, with data coming from the network. In other words, albeit this class is theoretically based on a number of indices existing on a local disk, those indices can be substituted with suitable pipes filled with remote data without affecting the combination process. For instance, the following bash code creates three sets of pipes for an interleaved index:

     for i in 0 1 2; do
       for e in frequencies occurrencies index offsets posnumbits sumsmaxpos properties sizes terms; do 
         mkfifo pipe$i.$e

    Each pipe should be then filled with suitable data, for instance obtained from the net (assuming you have indices index0, index1 and index2 on

     for i in 0 1 2; do 
       for e in frequencies occurrencies index offsets posnumbits sumsmaxpos properties sizes terms; do 
         (ssh -x cat index$i.$e >pipe$i.$e &)

    Now all pipes will be filled with data from the corresponding remote files, and combining the indices pipe0, pipe1 and pipe2 will give the same result as combining index0, index1 and index2 on the remote system.

    Sebastiano Vigna