Splitting indices

Combining indices has a counterpart: you can partition an index into several indices. There are many reasons to do so: you might want to split an index in several segments containing different group of documents so to distribute the load of a multiserver system. Or you might want to store in main memory the posting lists of the terms that appear more often and just map into memory the rest. MG4J has two tools that make it possible to partition an index: PartitionLexically and PartitionDocumentally. The first tool creates several indices containing distinct subsets of words. The second tool creates indices containing distinct subsets of documents. To make the process as customizable as possible, both tools accept a partitioning strategy, that is, an object that specifies, for each term or document, where it should be stored. There are ready-to-use strategies, but you can also write your own.

Once you have created several indices, you can see them again as a single index using an index cluster—a type of index that exposes a number of local indices as a single global index. A cluster uses a clustering strategy which is often associated with a partitioning strategy. Moreover, you can always Merge back the partitioned indices you created and get back exactly the original index.

The documentation of the package it.unimi.di.mg4j.index.cluster and its classes is a good starting point to understand partitioning and clusters.