Class Paste

  • public final class Paste
    extends Combine
    Pastes several indices.

    Pasting is a very slow way of combining indices: we assume that not only documents, but also document occurrences might be scattered throughout several indices. When a document appears in several indices, its occurrences in a given index are combined. We have two possibilities:

    • standard pasting: position lists are simply concatenated—it is responsibility of the caller to guarantee that they have been numbered in an increasing fashion; the sizes of the last input index are the sizes of the pasted index;
    • incremental pasting: position lists are concatenated, but each list is renumbered by adding to all positions the sum of the sizes of the current document for all indices the precede the current one (this kind of pasting was the only one available before version 3.0).

    Standard pasting is used, for instance, to paste the batches of a virtual field generated by Scan; the latter takes care of numbering positions correctly. If, however, you index parts of the same document collection on different machines using the same VirtualDocumentResolver, then the resulting indices for virtual fields will have all position starting from zero, and they will need an incremental pasting to be combined correctly.

    Conceptually, this operation is equivalent to splitting a collection vertically: each document is divided into a fixed number n of consecutive segments (possibly of length 0), and a set of n indices is created using the k-th segment of all documents. Pasting the resulting indices will produce an index that is identical to the index generated by the original collection. The behaviour is analogous to that of the UN*X paste command if documents are single-line lists of words.

    Note that in case every document appears at most in one index pasting is equivalent to merging. It is, however, significantly slower, as the presence of the same document in several lists makes it necessary to scan completely the inverted lists to be pasted to compute the frequency. To do so, an in-memory buffer is allocated. If an inverted list does not fit in the memory buffer, it is spilled on disk. Sizing correctly the buffer, and choosing a fast file system for the temporary directory can significantly affect performance.

    Warning: incremental pasting is very memory-intensive, as a list of sizes must be loaded for each index. You can use the URI option succinctsizes=1 to load sizes in a succinct format, which will ease the problem.

    Sebastiano Vigna