it.unimi.di.mg4j.index.cluster
Class IndexCluster

java.lang.Object
  extended by it.unimi.di.mg4j.index.Index
      extended by it.unimi.di.mg4j.index.cluster.IndexCluster
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
DocumentalCluster, LexicalCluster

public abstract class IndexCluster
extends Index

An abstract index cluster. An index cluster is an index exposing transparently a list of local indices as a single global index. A cluster usually is generated by partitioning an index lexically or documentally, but nothing prevents the creation of hand-made clusters.

Note that, upon creation of an instance, the main index key of all local indices is set to that instance.

An index cluster is defined by a property file. The only properties common to all index clusters are localindex, which can be specified multiple times (order is relevant) and contains the URIs of the local indices of the cluster, and strategy, which contains the filename of a serialised ClusteringStrategy. The indices will be loaded using Index.getInstance(CharSequence,boolean,boolean), so there is no restriction on the URIs that can be used (e.g., you can cluster a set of remote indices).

Alternatively, the property strategyclass can be used to specify a class name (the class will be loaded using MG4JClassParser, so you can omit the package if the class is in MG4J). The class must provide a constructor with a signature like that of ChainedLexicalClusteringStrategy.ChainedLexicalClusteringStrategy(Index[], BloomFilter[])).

If you plan to use global document sizes (e.g., for BM25 scoring) you will need to load them explicitly using the property Index.UriKeys.SIZES, which must specify a size file for the whole collection. If you are clustering a partitioned index, this is usually the original size file.

Optionally, an index cluster may provide Bloom filters to reduce useless access to local indices that do not contain a term. The filters have the standard extension BLOOM_EXTENSION.

This class exposes a static factory method that uses the indexclass property to load the appropriate implementing subclass; Bloom filters are loaded automatically.

See Also:
Serialized Form

Nested Class Summary
static class IndexCluster.PropertyKeys
          Symbolic names for properties of an IndexCluster.
 
Nested classes/interfaces inherited from class it.unimi.di.mg4j.index.Index
Index.EmptyIndexIterator, Index.UriKeys
 
Field Summary
static String BLOOM_EXTENSION
          The default extension for Bloom term filters.
protected  Index[] localIndex
          The local indices of this cluster.
static String STRATEGY_DEFAULT_EXTENSION
          The default extension of a strategy.
protected  BloomFilter[] termFilter
          An array of Bloom filter to reduce index access, or null.
 
Fields inherited from class it.unimi.di.mg4j.index.Index
field, hasCounts, hasPayloads, hasPositions, keyIndex, maxCount, numberOfDocuments, numberOfOccurrences, numberOfPostings, numberOfTerms, payload, prefixMap, properties, singletonSet, sizes, termMap, termProcessor
 
Constructor Summary
protected IndexCluster(Index[] localIndex, BloomFilter[] termFilter, int numberOfDocuments, int numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, IntList sizes, Properties properties)
           
 
Method Summary
static Index getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, EnumMap<Index.UriKeys,String> queryProperties)
          Returns a new index cluster.
 void keyIndex(Index newKeyIndex)
          Sets the index used as a key to retrieve intervals from iterators generated from this index.
 
Methods inherited from class it.unimi.di.mg4j.index.Index
documents, documents, documents, getEmptyIndexIterator, getEmptyIndexIterator, getEmptyIndexIterator, getEmptyIndexIterator, getInstance, getInstance, getInstance, getInstance, getInstance, getReader, getReader, getTermProcessor
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

STRATEGY_DEFAULT_EXTENSION

public static final String STRATEGY_DEFAULT_EXTENSION
The default extension of a strategy.

See Also:
Constant Field Values

BLOOM_EXTENSION

public static final String BLOOM_EXTENSION
The default extension for Bloom term filters.

See Also:
Constant Field Values

localIndex

protected final Index[] localIndex
The local indices of this cluster.


termFilter

protected final BloomFilter[] termFilter
An array of Bloom filter to reduce index access, or null.

Constructor Detail

IndexCluster

protected IndexCluster(Index[] localIndex,
                       BloomFilter[] termFilter,
                       int numberOfDocuments,
                       int numberOfTerms,
                       long numberOfPostings,
                       long numberOfOccurrences,
                       int maxCount,
                       Payload payload,
                       boolean hasCounts,
                       boolean hasPositions,
                       TermProcessor termProcessor,
                       String field,
                       IntList sizes,
                       Properties properties)
Method Detail

getInstance

public static Index getInstance(CharSequence basename,
                                boolean randomAccess,
                                boolean documentSizes,
                                EnumMap<Index.UriKeys,String> queryProperties)
                         throws ConfigurationException,
                                IOException,
                                ClassNotFoundException,
                                SecurityException,
                                URISyntaxException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException
Returns a new index cluster.

This method uses the LOCALINDEX property to locate the local indices, loads them (passing on randomAccess) and builds a new index cluster using the appropriate implementing subclass.

Note that documentSizes is just passed to the local indices. This can be useful in documental clusters, as it allows local scoring, but it is useless in lexical clusters, as scoring is necessarily centralised. In the latter case, the property Index.UriKeys.SIZES can be used to specify a global sizes file (which usually comes from an original global index).

Parameters:
basename - the basename.
randomAccess - whether the index should be accessible randomly.
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
queryProperties - a map containing associations between Index.UriKeys and values, or null.
Throws:
ConfigurationException
IOException
ClassNotFoundException
SecurityException
URISyntaxException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

keyIndex

public void keyIndex(Index newKeyIndex)
Description copied from class: Index
Sets the index used as a key to retrieve intervals from iterators generated from this index.

This setter is a compromise between clarity of design and efficiency. Each index iterator is based on an index, and when that index is passed to DocumentIterator.intervalIterator(Index), intervals corresponding to the positions of the term in the current document are returned. Analogously, DocumentIterator.indices() returns a singleton set containing the index. However, when composing indices into clusters, often iterators generated by a local index must act as if they really belong to the global index. This method allows to set the index that is used as a key to return intervals, and that is contained in Index.singletonSet.

Note that setting this value will only influence index readers created afterwards.

Overrides:
keyIndex in class Index
Parameters:
newKeyIndex - the new index to be used as a key for interval retrieval.