java.lang.Object
- it.unimi.di.big.mg4j.tool.PartitionLexically

```
public class PartitionLexically
extends Object
```
Partitions an index lexically.
A global index is partitioned lexically by providing a LexicalPartitioningStrategy that specifies a destination local index for each term, and a local term number. The global index is read directly at the bit level, and the posting lists are divided among the local indices using the provided strategy. For instance, an ContiguousLexicalStrategy divides an index into contiguous blocks (of terms) specified by the given strategy.
By choice, document pointers are not remapped. Thus, it may happen that one of the local indices contains no posting with a certain document. However, computing the subset of documents contained in each local index to remap them in a contiguous interval is not a good idea, as usually the subset of documents appearing in the postings of each local index is large.
To speed up the search of the right local index of a not-so-frequent term (in particular with a chained strategy), after partitioning an index you can create Bloom filters that will be used to try to avoid inquiring indices that do not contain a term. The filters will be automatically loaded by Index.getInstance(CharSequence, boolean, boolean).
Note that the size file is the same for each local index and is not copied. Please use standard operating system features such as symbolic links to provide size files to local indices.
If you plan to cluster the partitioned indices and you need document sizes (e.g., for BM25 scoring), you can use the index property Index.UriKeys.SIZES to load the original size file. If you plan on partitioning an index requiring document sizes, you should consider a custom index loading scheme that shares the size list among all local indices. Important: this class just partitions the index. No auxiliary files (most notably, term maps or prefix maps) will be generated. Please refer to a StringMap implementation (e.g., ShiftAddXorSignedStringMap or ImmutableExternalPrefixMap).
Write-once output and distributed index partitioning

The partitioning process writes each index file sequentially exactly once, so index partitioning can output its results to pipes, which in turn can spill their content, for instance, through the network. In other words, albeit this class theoretically creates a number of local indices on disk, those indices can be substituted with suitable pipes creating remote local indices without affecting the partitioning process. For instance, the following bash code creates three sets of pipes:
```
 for i in 0 1 2; do
   for e in frequencies occurrencies index offsets properties sizes terms; do 
     mkfifo pipe-$i.$e
   done
 done
 
```
Each pipe must be emptied elsewhere, for instance (assuming you want local indices index0, index1 and index2 on example.com):
```
 for i in 0 1 2; do 
   for e in frequencies occurrencies index offsets properties sizes terms; do 
     (cat pipe-$i.$e | ssh -x example.com "cat >index-$i.$e" &)
   done
 done
 
```
If we now start a partitioning process generating three local indices named pipe-0, pipe-1 and pipe-2 all pipes will be written to by the process, and the data will create remotely indices index-0, index-1 and index-2.
Since:

1.0.1

Author:

Sebastiano Vigna

Nested Class Summary

Nested Classes
Modifier and Type Class Description

protected static class PartitionLexically.LongWordInputBitStream

Field Summary

Fields
Modifier and Type Field Description

static int DEFAULT_BUFFER_SIZE
The default buffer size for all involved indices.

Constructor Summary

Constructors
Constructor	Description
`PartitionLexically(String inputBasename, String outputBasename, LexicalPartitioningStrategy strategy, String strategyFilename, int bufferSize, long logInterval)`

Method Summary

Modifier and Type Method Description

static void main(String[] arg)

void run()

void runTermsOnly()
- Methods inherited from class java.lang.Object
  clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- DEFAULT_BUFFER_SIZE
```
public static final int DEFAULT_BUFFER_SIZE
```
  The default buffer size for all involved indices.
  
  See Also:
  
  Constant Field Values

Constructor Detail

PartitionLexically

public PartitionLexically(String inputBasename,
                          String outputBasename,
                          LexicalPartitioningStrategy strategy,
                          String strategyFilename,
                          int bufferSize,
                          long logInterval)

Method Detail

runTermsOnly

public void runTermsOnly()
                  throws IOException

Throws:: IOException

run

public void run()
         throws org.apache.commons.configuration.ConfigurationException,
                IOException,
                ClassNotFoundException

Throws:: org.apache.commons.configuration.ConfigurationException; IOException; ClassNotFoundException

main

public static void main(String[] arg)
                 throws com.martiansoftware.jsap.JSAPException,
                        org.apache.commons.configuration.ConfigurationException,
                        IOException,
                        ClassNotFoundException,
                        SecurityException,
                        InstantiationException,
                        IllegalAccessException

Throws:: com.martiansoftware.jsap.JSAPException; org.apache.commons.configuration.ConfigurationException; IOException; ClassNotFoundException; SecurityException; InstantiationException; IllegalAccessException

Modifier and Type	Method	Description
`static void`	`main(String[] arg)`
`void`	`run()`
`void`	`runTermsOnly()`

Class PartitionLexically

Write-once output and distributed index partitioning

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_BUFFER_SIZE

Constructor Detail

PartitionLexically

Method Detail

runTermsOnly

run

main