java.lang.Object
- it.unimi.di.big.mg4j.tool.Combine
- - it.unimi.di.big.mg4j.tool.Paste

```
public final class Paste
extends Combine
```
Pastes several indices.
Pasting is a very slow way of combining indices: we assume that not only documents, but also document occurrences might be scattered throughout several indices. When a document appears in several indices, its occurrences in a given index are combined. We have two possibilities:
- standard pasting: position lists are simply concatenated—it is responsibility of the caller to guarantee that they have been numbered in an increasing fashion; the sizes of the last input index are the sizes of the pasted index;
- incremental pasting: position lists are concatenated, but each list is renumbered by adding to all positions the sum of the sizes of the current document for all indices the precede the current one (this kind of pasting was the only one available before version 3.0).
Standard pasting is used, for instance, to paste the batches of a virtual field generated by Scan; the latter takes care of numbering positions correctly. If, however, you index parts of the same document collection on different machines using the same VirtualDocumentResolver, then the resulting indices for virtual fields will have all position starting from zero, and they will need an incremental pasting to be combined correctly.
Conceptually, this operation is equivalent to splitting a collection vertically: each document is divided into a fixed number n of consecutive segments (possibly of length 0), and a set of n indices is created using the k-th segment of all documents. Pasting the resulting indices will produce an index that is identical to the index generated by the original collection. The behaviour is analogous to that of the UN*X paste command if documents are single-line lists of words.
Note that in case every document appears at most in one index pasting is equivalent to merging. It is, however, significantly slower, as the presence of the same document in several lists makes it necessary to scan completely the inverted lists to be pasted to compute the frequency. To do so, an in-memory buffer is allocated. If an inverted list does not fit in the memory buffer, it is spilled on disk. Sizing correctly the buffer, and choosing a fast file system for the temporary directory can significantly affect performance.
Warning: incremental pasting is very memory-intensive, as a list of sizes must be loaded for each index. You can use the URI option succinctsizes=1 to load sizes in a succinct format, which will ease the problem.
Since:

1.0

Author:

Sebastiano Vigna

Nested Class Summary
- Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.tool.Combine
  Combine.GammaCodedIntIterator, Combine.IndexType

Field Summary

Fields
Modifier and Type	Field	Description
`static int`	`DEFAULT_MEMORY_BUFFER_SIZE`	The default size of the temporary bit stream buffer used while pasting.
`protected long[]`	`doc`	The reference array of the document queue.
`protected IntHeapPriorityQueue`	`documentQueue`	The queue containing document pointers (for remapped indices).

Fields inherited from class it.unimi.di.big.mg4j.tool.Combine
additionalProperties, bufferSize, DEFAULT_BUFFER_SIZE, frequency, hasCounts, hasPayloads, hasPositions, haveSumsMaxPos, index, indexIterator, indexReader, indexWriter, inputBasename, ioFactory, maxCount, metadataOnly, needsSizes, numberOfDocuments, numberOfOccurrences, numIndices, outputBasename, p, positionArray, predictedLengthNumBits, predictedSize, quasiSuccinctIndexWriter, size, sumsMaxPos, termQueue, usedIndex, variableQuantumIndexWriter

Constructor Summary

Constructors
Constructor	Description
`Paste(IOFactory ioFactory, String outputBasename, String[] inputBasename, boolean metadataOnly, boolean incremental, int bufferSize, File tempFileDir, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, Combine.IndexType indexType, boolean skips, int quantum, int height, int skipBufferOrCacheSize, long logInterval)`	Pastes several indices into one.
`Paste(IOFactory ioFactory, String outputBasename, String[] inputBasename, IntList delete, boolean metadataOnly, boolean incremental, int bufferSize, File tempFileDir, int tempBufferSize, Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags, Combine.IndexType indexType, boolean skips, int quantum, int height, int skipBufferSize, long logInterval)`	Pastes several indices into one.

Method Summary

Modifier and Type	Method	Description
`protected long`	`combine(int numUsedIndices, long occurrency)`	Combines several indices.
`protected long`	`combineNumberOfDocuments()`	Combines the number of documents.
`protected int`	`combineSizes(OutputBitStream sizesOutputBitStream)`	Combines size lists.
`static void`	`main(String[] arg)`
`void`	`run()`

Methods inherited from class it.unimi.di.big.mg4j.tool.Combine
main, sizes

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - DEFAULT_MEMORY_BUFFER_SIZE
```
public static final int DEFAULT_MEMORY_BUFFER_SIZE
```
    The default size of the temporary bit stream buffer used while pasting. Posting lists larger than this size will be precomputed on disk and then added to the index.
    
    See Also:
    
    Constant Field Values
  - doc
```
protected final long[] doc
```
    The reference array of the document queue.
  - documentQueue
```
protected final IntHeapPriorityQueue documentQueue
```
    The queue containing document pointers (for remapped indices).
- Constructor Detail
  - Paste
```
public Paste(IOFactory ioFactory,
             String outputBasename,
             String[] inputBasename,
             boolean metadataOnly,
             boolean incremental,
             int bufferSize,
             File tempFileDir,
             int tempBufferSize,
             Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
             Combine.IndexType indexType,
             boolean skips,
             int quantum,
             int height,
             int skipBufferOrCacheSize,
             long logInterval)
      throws IOException,
             org.apache.commons.configuration.ConfigurationException,
             URISyntaxException,
             ClassNotFoundException,
             SecurityException,
             InstantiationException,
             IllegalAccessException,
             InvocationTargetException,
             NoSuchMethodException
```
    Pastes several indices into one.
    
    Parameters:
    
    ioFactory - the factory that will be used to perform I/O.
    
    outputBasename - the basename of the combined index.
    
    inputBasename - the basenames of the input indices.
    
    metadataOnly - if true, we save only metadata (term list, frequencies, global counts).
    
    incremental - if true, we perform an incremental paste (needs sizes).
    
    bufferSize - the buffer size for index readers.
    
    tempFileDir - the directory of the temporary file used when pasting.
    
    tempBufferSize - the size of the in-memory buffer used when pasting.
    
    writerFlags - the flags for the index writer.
    
    indexType - the type of the index to build.
    
    skips - whether to insert skips in case interleaved is true.
    
    quantum - the quantum of skipping structures; if negative, a percentage of space for variable-quantum indices (irrelevant if skips is false).
    
    height - the height of skipping towers (irrelevant if skips is false).
    
    skipBufferOrCacheSize - the size of the buffer used to hold temporarily inverted lists during the skipping structure construction, or the size of the bit cache used when building a quasi-succinct index.
    
    logInterval - how often we log.
    
    Throws:
    
    IOException
    
    org.apache.commons.configuration.ConfigurationException
    
    URISyntaxException
    
    ClassNotFoundException
    
    SecurityException
    
    InstantiationException
    
    IllegalAccessException
    
    InvocationTargetException
    
    NoSuchMethodException
  - Paste
```
public Paste(IOFactory ioFactory,
             String outputBasename,
             String[] inputBasename,
             IntList delete,
             boolean metadataOnly,
             boolean incremental,
             int bufferSize,
             File tempFileDir,
             int tempBufferSize,
             Map<CompressionFlags.Component,CompressionFlags.Coding> writerFlags,
             Combine.IndexType indexType,
             boolean skips,
             int quantum,
             int height,
             int skipBufferSize,
             long logInterval)
      throws IOException,
             org.apache.commons.configuration.ConfigurationException,
             URISyntaxException,
             ClassNotFoundException,
             SecurityException,
             InstantiationException,
             IllegalAccessException,
             InvocationTargetException,
             NoSuchMethodException
```
    Pastes several indices into one.
    
    Parameters:
    
    ioFactory - the factory that will be used to perform I/O.
    
    outputBasename - the basename of the combined index.
    
    inputBasename - the basenames of the input indices.
    
    delete - a monotonically increasing list of integers representing documents that will be deleted from the output index, or null.
    
    metadataOnly - if true, we save only metadata (term list, frequencies, global counts).
    
    incremental - if true, we perform an incremental paste (needs sizes).
    
    bufferSize - the buffer size for index readers.
    
    tempFileDir - the directory of the temporary file used when pasting.
    
    tempBufferSize - the size of the in-memory buffer used when pasting.
    
    writerFlags - the flags for the index writer.
    
    indexType - the type of the index to build.
    
    skips - whether to insert skips in case interleaved is true.
    
    quantum - the quantum of skipping structures; if negative, a percentage of space for variable-quantum indices (irrelevant if skips is false).
    
    height - the height of skipping towers (irrelevant if skips is false).
    
    skipBufferSize - the size of the buffer used to hold temporarily inverted lists during the skipping structure construction.
    
    logInterval - how often we log.
    
    Throws:
    
    IOException
    
    org.apache.commons.configuration.ConfigurationException
    
    URISyntaxException
    
    ClassNotFoundException
    
    SecurityException
    
    InstantiationException
    
    IllegalAccessException
    
    InvocationTargetException
    
    NoSuchMethodException
- Method Detail
  - combineNumberOfDocuments
```
protected long combineNumberOfDocuments()
```
    Description copied from class: Combine
    
    Combines the number of documents.
    
    Specified by:
    
    combineNumberOfDocuments in class Combine
    
    Returns:
    
    the number of documents of the combined index.
  - combineSizes
```
protected int combineSizes(OutputBitStream sizesOutputBitStream)
                    throws IOException
```
    Description copied from class: Combine
    
    Combines size lists.
    
    Specified by:
    
    combineSizes in class Combine
    
    Returns:
    
    the maximum size of a document in the combined index.
    
    Throws:
    
    IOException
  - combine
```
protected long combine(int numUsedIndices,
                       long occurrency)
                throws IOException
```
    Description copied from class: Combine
    
    Combines several indices.
    When this method is called, exactly numUsedIndices entries of Combine.usedIndex contain, in increasing order, the indices containing inverted lists for the current term. Implementations of this method must combine the inverted list and return the total frequency.
    
    Specified by:
    
    combine in class Combine
    
    Parameters:
    
    numUsedIndices - the number of valid entries in Combine.usedIndex.
    
    occurrency - the occurrency of the term (used only when building Combine.IndexType.QUASI_SUCCINCT indices).
    
    Returns:
    
    the total frequency.
    
    Throws:
    
    IOException
  - run
```
public void run()
         throws org.apache.commons.configuration.ConfigurationException,
                IOException
```
    Overrides:
    
    run in class Combine
    
    Throws:
    
    org.apache.commons.configuration.ConfigurationException
    
    IOException
  - main
```
public static void main(String[] arg)
                 throws org.apache.commons.configuration.ConfigurationException,
                        SecurityException,
                        com.martiansoftware.jsap.JSAPException,
                        IOException,
                        URISyntaxException,
                        ClassNotFoundException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException
```
    Throws:
    
    org.apache.commons.configuration.ConfigurationException
    
    SecurityException
    
    com.martiansoftware.jsap.JSAPException
    
    IOException
    
    URISyntaxException
    
    ClassNotFoundException
    
    InstantiationException
    
    IllegalAccessException
    
    InvocationTargetException
    
    NoSuchMethodException

Class Paste

Nested Class Summary

Nested classes/interfaces inherited from class it.unimi.di.big.mg4j.tool.Combine

Field Summary

Fields inherited from class it.unimi.di.big.mg4j.tool.Combine

Constructor Summary

Method Summary

Methods inherited from class it.unimi.di.big.mg4j.tool.Combine

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_MEMORY_BUFFER_SIZE

doc

documentQueue

Constructor Detail

Paste

Paste

Method Detail

combineNumberOfDocuments

combineSizes

combine

run

main