it.unimi.di.mg4j.index
Class Index

java.lang.Object
  extended by it.unimi.di.mg4j.index.Index
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
BitStreamIndex, IndexCluster, QuasiSuccinctIndex, RemoteIndex

public abstract class Index
extends Object
implements Serializable

An abstract representation of an index.

Concrete subclasses of this class represent abstract index access information: for instance, the basename or IP address/port, flags, etc. It allows to build easily index readers over the index: in turn, index readers provide document iterators.

This class contains just methods declarations, and attributes for all data that is common to any form of index. Note that we use an abstract class, rather than an interface, because interfaces do not allow to declare attributes.

We provide static factory methods (e.g., getInstance(CharSequence)) that return an index given a suitable URI string. If the scheme part is mg4j, then the URI is assumed to point at a remote index. Otherwise, it is assumed to be the basename of a local index. In both cases, a query part introduced by ? can specify additional parameters (key=value pairs separated by ;). For instance, the URI example?inmemory=1 will load the index with basename example, caching its content in core memory. Please have a look at constants in Index.UriKeys (and analogous enums in subclasses) for additional parameters.

If the index is local, by convention this class will locate a property file with extension DiskBasedIndex.PROPERTIES_EXTENSION that is expected to contain a number of key/value pairs (which are quite informative and can be examined manually). In particular, the key Index.PropertyKeys.INDEXCLASS explain which kind of index class should be used to read the index. The file might contain additional keys depending on the value of Index.PropertyKeys.INDEXCLASS (e.g., QuasiSuccinctIndex.PropertyKeys.BYTEORDER). An index usually exposes term or prefix maps and the size list but this is not compulsory (the latter, in particular, is necessary with certain codings).

Thread safety

Indices are a natural candidate for multithreaded access. An instance of this class must be thread safe as long as external data structures provided to its constructors are. For instance, the tool IndexBuilder generates a synchronized ImmutableExternalPrefixMap so that by default the resulting index is thread safe.

For instance, a DiskBasedIndex requires a list of term offsets, term maps, etc. As long as all these data structures are thread safe, the same is true of the index. Data structures created by static factory methods such as DiskBasedIndex.getInstance(CharSequence) are thread safe.

Note that IndexReaders returned by getReader() are not thread safe (even if the method getReader() is). The logic behind this arrangement is that you create as many reader as you need, and then Closeable.close() them. In a multithreaded environment, a pool of index readers can be created, and a custom QueryBuilderVisitor can be used to build DocumentIterators using the given pool of readers. In this case readers are not closed, but rather reused.

Read-once load

Implementations of this class are strongly encouraged to offer read-once constructors and factory methods: property files and other data related to the index (but not to an IndexReader should be read exactly once, and sequentially. This feature is very useful when combining indices.

Since:
0.9
Author:
Paolo Boldi, Sebastiano Vigna
See Also:
Serialized Form

Nested Class Summary
 class Index.EmptyIndexIterator
          An iterator returning no documents based on this index.
static class Index.PropertyKeys
          Symbolic names for properties of a Index.
static class Index.UriKeys
          Keys to be used (downcased) in specifiying additional parameters to a MG4J URI.
 
Field Summary
 String field
          The field indexed by this index, or null.
 boolean hasCounts
          Whether this index contains counts.
 boolean hasPayloads
          Whether this index contains payloads; if true, payload is non-null.
 boolean hasPositions
          Whether this index contains positions.
 Index keyIndex
          The index used as a key to retrieve intervals.
 int maxCount
          The maximum number of positions in an position list, or possibly -1 if this index does not have positions.
 int numberOfDocuments
          The number of documents of the collection.
 long numberOfOccurrences
          The number of occurrences of the collection, or possibly -1 if it is unknown.
 long numberOfPostings
          The number of postings (pairs term/document) of the collection.
 int numberOfTerms
          The number of terms of the collection.
 Payload payload
          The payload for this index, or null.
 PrefixMap<? extends CharSequence> prefixMap
          The prefix map for this index, or null if the prefix map was not loaded.
 Properties properties
          The properties of this index.
 ReferenceSet<Index> singletonSet
          An immutable singleton set containing just keyIndex.
 IntList sizes
          The size of each document, or null if sizes are not necessary or not loaded in this index.
 StringMap<? extends CharSequence> termMap
          The term map for this index, or null if the term map was not loaded.
 TermProcessor termProcessor
          The term processor used to build this index.
 
Constructor Summary
protected Index(int numberOfDocuments, int numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntList sizes, Properties properties)
          Creates a new instance, initialising all fields.
 
Method Summary
 IndexIterator documents(CharSequence term)
          Creates a new IndexReader for this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.
 IndexIterator documents(CharSequence prefix, int limit)
          Creates a number of instances of IndexReader for this index and uses them to return a MultiTermIndexIterator over the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, an UnsupportedOperationException will be thrown.
 IndexIterator documents(int term)
          Creates a new IndexReader for this index and uses it to return an index iterator over the documents containing a term.
 IndexIterator getEmptyIndexIterator()
           
 IndexIterator getEmptyIndexIterator(CharSequence term)
           
 IndexIterator getEmptyIndexIterator(CharSequence term, int termNumber)
           
 IndexIterator getEmptyIndexIterator(int term)
           
static Index getInstance(CharSequence uri)
          Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.
static Index getInstance(CharSequence uri, boolean randomAccess)
          Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.
static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes)
          Returns a new index using the given URI, searching dynamically for term and prefix maps.
static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)
          Returns a new index using the given URI and no IOFactory.
static Index getInstance(IOFactory ioFactory, CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)
          Returns a new index using the given URI.
 IndexReader getReader()
          Creates and returns a new IndexReader based on this index, using the default buffer size.
abstract  IndexReader getReader(int bufferSize)
          Creates and returns a new IndexReader based on this index.
protected static TermProcessor getTermProcessor(Properties properties)
           
 void keyIndex(Index newKeyIndex)
          Sets the index used as a key to retrieve intervals from iterators generated from this index.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

field

public final String field
The field indexed by this index, or null.


properties

public final Properties properties
The properties of this index. It is stored here for convenience (for instance, if custom keys are added to the property file), but it may be null.


numberOfDocuments

public final int numberOfDocuments
The number of documents of the collection.


numberOfTerms

public final int numberOfTerms
The number of terms of the collection. This field might be set to -1 in some cases (for instance, in certain documental clusters).


numberOfOccurrences

public final long numberOfOccurrences
The number of occurrences of the collection, or possibly -1 if it is unknown.


numberOfPostings

public final long numberOfPostings
The number of postings (pairs term/document) of the collection.


maxCount

public final int maxCount
The maximum number of positions in an position list, or possibly -1 if this index does not have positions.


payload

public final Payload payload
The payload for this index, or null.


hasPayloads

public final boolean hasPayloads
Whether this index contains payloads; if true, payload is non-null.


hasCounts

public final boolean hasCounts
Whether this index contains counts.


hasPositions

public final boolean hasPositions
Whether this index contains positions.


termProcessor

public final TermProcessor termProcessor
The term processor used to build this index.


singletonSet

public ReferenceSet<Index> singletonSet
An immutable singleton set containing just keyIndex.


keyIndex

public Index keyIndex
The index used as a key to retrieve intervals. Usually equal to this, but it is settable.


termMap

public final StringMap<? extends CharSequence> termMap
The term map for this index, or null if the term map was not loaded.


prefixMap

public final PrefixMap<? extends CharSequence> prefixMap
The prefix map for this index, or null if the prefix map was not loaded.


sizes

public final IntList sizes
The size of each document, or null if sizes are not necessary or not loaded in this index.

Constructor Detail

Index

protected Index(int numberOfDocuments,
                int numberOfTerms,
                long numberOfPostings,
                long numberOfOccurrences,
                int maxCount,
                Payload payload,
                boolean hasCounts,
                boolean hasPositions,
                TermProcessor termProcessor,
                String field,
                StringMap<? extends CharSequence> termMap,
                PrefixMap<? extends CharSequence> prefixMap,
                IntList sizes,
                Properties properties)
Creates a new instance, initialising all fields.

Method Detail

getTermProcessor

protected static TermProcessor getTermProcessor(Properties properties)

getInstance

public static Index getInstance(IOFactory ioFactory,
                                CharSequence uri,
                                boolean randomAccess,
                                boolean documentSizes,
                                boolean maps)
                         throws IOException,
                                ConfigurationException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException
Returns a new index using the given URI.

If uri has scheme mg4j, the index is considered to be remote and index creation delegated to IndexServer.getIndex(String, int, boolean, boolean). Otherwise, we delegate to DiskBasedIndex.getInstance(CharSequence, boolean, boolean, boolean, EnumMap).

Parameters:
ioFactory - the factory that will be used to perform I/O, or null (implying the IOFactory.FILESYSTEM_FACTORY for disk-based indices).
uri - the URI defining the index.
randomAccess - whether the index should be accessible randomly.
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
maps - if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).
Throws:
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

getInstance

public static Index getInstance(CharSequence uri,
                                boolean randomAccess,
                                boolean documentSizes,
                                boolean maps)
                         throws IOException,
                                ConfigurationException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException
Returns a new index using the given URI and no IOFactory.

If uri has scheme mg4j, the index is considered to be remote and index creation delegated to IndexServer.getIndex(String, int, boolean, boolean). Otherwise, we delegate to DiskBasedIndex.getInstance(CharSequence, boolean, boolean, boolean, EnumMap).

Parameters:
uri - the URI defining the index.
randomAccess - whether the index should be accessible randomly.
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
maps - if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).
Throws:
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException

getInstance

public static Index getInstance(CharSequence uri,
                                boolean randomAccess,
                                boolean documentSizes)
                         throws IOException,
                                ConfigurationException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps.

Parameters:
uri - the URI defining the index.
randomAccess - whether the index should be accessible randomly.
documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
Throws:
IOException
ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
See Also:
getInstance(CharSequence, boolean, boolean, boolean)

getInstance

public static Index getInstance(CharSequence uri,
                                boolean randomAccess)
                         throws ConfigurationException,
                                IOException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.

Parameters:
uri - the URI defining the index.
randomAccess - whether the index should be accessible randomly.
Throws:
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
See Also:
getInstance(CharSequence, boolean, boolean)

getInstance

public static Index getInstance(CharSequence uri)
                         throws ConfigurationException,
                                IOException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.

Parameters:
uri - the URI defining the index.
Throws:
ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
See Also:
getInstance(CharSequence, boolean)

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator()

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator(int term)

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator(CharSequence term)

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator(CharSequence term,
                                           int termNumber)

getReader

public IndexReader getReader()
                      throws IOException
Creates and returns a new IndexReader based on this index, using the default buffer size. After that, you can use the reader to read this index.

Returns:
a new IndexReader to read this index.
Throws:
IOException

getReader

public abstract IndexReader getReader(int bufferSize)
                               throws IOException
Creates and returns a new IndexReader based on this index. After that, you can use the reader to read this index.

Parameters:
bufferSize - the size of the buffer to be used accessing the reader, or -1 for a default buffer size.
Returns:
a new IndexReader to read this index.
Throws:
IOException

documents

public IndexIterator documents(int term)
                        throws IOException
Creates a new IndexReader for this index and uses it to return an index iterator over the documents containing a term.

Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See IndexReader.documents(int) for a method with the same semantics, but making reader reuse possible.

Parameters:
term - a term.
Throws:
IOException - if an exception occurred while accessing the index.
UnsupportedOperationException - if this index is not accessible by term number.
See Also:
IndexReader.documents(int)

documents

public IndexIterator documents(CharSequence term)
                        throws IOException
Creates a new IndexReader for this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.

Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See IndexReader.documents(int) for a method with the same semantics, but making reader reuse possible.

Unless the term processor of this index is null, words coming from a query will have to be processed before being used with this method.

Parameters:
term - a term.
Throws:
IOException - if an exception occurred while accessing the index.
UnsupportedOperationException - if the term map is not available for this index.
See Also:
IndexReader.documents(CharSequence)

documents

public IndexIterator documents(CharSequence prefix,
                               int limit)
                        throws IOException,
                               TooManyTermsException
Creates a number of instances of IndexReader for this index and uses them to return a MultiTermIndexIterator over the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, an UnsupportedOperationException will be thrown.

Parameters:
prefix - a prefix.
limit - a limit on the number of terms that will be used to resolve the prefix query; if the terms starting with prefix are more than limit, a TooManyTermsException will be thrown.
Throws:
UnsupportedOperationException - if this index cannot resolve prefixes.
TooManyTermsException - if there are more than limit terms starting with prefix.
IOException

keyIndex

public void keyIndex(Index newKeyIndex)
Sets the index used as a key to retrieve intervals from iterators generated from this index.

This setter is a compromise between clarity of design and efficiency. Each index iterator is based on an index, and when that index is passed to DocumentIterator.intervalIterator(Index), intervals corresponding to the positions of the term in the current document are returned. Analogously, DocumentIterator.indices() returns a singleton set containing the index. However, when composing indices into clusters, often iterators generated by a local index must act as if they really belong to the global index. This method allows to set the index that is used as a key to return intervals, and that is contained in singletonSet.

Note that setting this value will only influence index readers created afterwards.

Parameters:
newKeyIndex - the new index to be used as a key for interval retrieval.