Index (MG4J 5.1)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

it.unimi.di.mg4j.index
Class Index

java.lang.Object
  it.unimi.di.mg4j.index.Index

All Implemented Interfaces:: Serializable

Direct Known Subclasses:: BitStreamIndex, IndexCluster, QuasiSuccinctIndex, RemoteIndex

public abstract class Index
extends Object
implements Serializable
extends Object
implements Serializable

An abstract representation of an index.

Concrete subclasses of this class represent abstract index access information: for instance, the basename or IP address/port, flags, etc. It allows to build easily index readers over the index: in turn, index readers provide document iterators.

This class contains just methods declarations, and attributes for all data that is common to any form of index. Note that we use an abstract class, rather than an interface, because interfaces do not allow to declare attributes.

We provide static factory methods (e.g., getInstance(CharSequence)) that return an index given a suitable URI string. If the scheme part is mg4j, then the URI is assumed to point at a remote index. Otherwise, it is assumed to be the basename of a local index. In both cases, a query part introduced by ? can specify additional parameters (key=value pairs separated by ;). For instance, the URI example?inmemory=1 will load the index with basename example, caching its content in core memory. Please have a look at constants in Index.UriKeys (and analogous enums in subclasses) for additional parameters.

If the index is local, by convention this class will locate a property file with extension DiskBasedIndex.PROPERTIES_EXTENSION that is expected to contain a number of key/value pairs (which are quite informative and can be examined manually). In particular, the key Index.PropertyKeys.INDEXCLASS explain which kind of index class should be used to read the index. The file might contain additional keys depending on the value of Index.PropertyKeys.INDEXCLASS (e.g., QuasiSuccinctIndex.PropertyKeys.BYTEORDER). An index usually exposes term or prefix maps and the size list but this is not compulsory (the latter, in particular, is necessary with certain codings).

Thread safety

Indices are a natural candidate for multithreaded access. An instance of this class must be thread safe as long as external data structures provided to its constructors are. For instance, the tool IndexBuilder generates a synchronized ImmutableExternalPrefixMap so that by default the resulting index is thread safe.

For instance, a DiskBasedIndex requires a list of term offsets, term maps, etc. As long as all these data structures are thread safe, the same is true of the index. Data structures created by static factory methods such as DiskBasedIndex.getInstance(CharSequence) are thread safe.

Note that IndexReaders returned by getReader() are not thread safe (even if the method getReader() is). The logic behind this arrangement is that you create as many reader as you need, and then Closeable.close() them. In a multithreaded environment, a pool of index readers can be created, and a custom QueryBuilderVisitor can be used to build DocumentIterators using the given pool of readers. In this case readers are not closed, but rather reused.

Read-once load

Implementations of this class are strongly encouraged to offer read-once constructors and factory methods: property files and other data related to the index (but not to an IndexReader should be read exactly once, and sequentially. This feature is very useful when combining indices.

Since:: 0.9
Author:: Paolo Boldi, Sebastiano Vigna
See Also:: Serialized Form

Nested Class Summary
`class`	`Index.EmptyIndexIterator` An iterator returning no documents based on this index.
`static class`	`Index.PropertyKeys` Symbolic names for properties of a `Index`.
`static class`	`Index.UriKeys` Keys to be used (downcased) in specifiying additional parameters to a MG4J URI.

Field Summary
`String`	`field` The field indexed by this index, or `null`.
`boolean`	`hasCounts` Whether this index contains counts.
`boolean`	`hasPayloads` Whether this index contains payloads; if true, `payload` is non-`null`.
`boolean`	`hasPositions` Whether this index contains positions.
`Index`	`keyIndex` The index used as a key to retrieve intervals.
`int`	`maxCount` The maximum number of positions in an position list, or possibly -1 if this index does not have positions.
`int`	`numberOfDocuments` The number of documents of the collection.
`long`	`numberOfOccurrences` The number of occurrences of the collection, or possibly -1 if it is unknown.
`long`	`numberOfPostings` The number of postings (pairs term/document) of the collection.
`int`	`numberOfTerms` The number of terms of the collection.
`Payload`	`payload` The payload for this index, or `null`.
`PrefixMap<? extends CharSequence>`	`prefixMap` The prefix map for this index, or `null` if the prefix map was not loaded.
`Properties`	`properties` The properties of this index.
`ReferenceSet<Index>`	`singletonSet` An immutable singleton set containing just `keyIndex`.
`IntList`	`sizes` The size of each document, or `null` if sizes are not necessary or not loaded in this index.
`StringMap<? extends CharSequence>`	`termMap` The term map for this index, or `null` if the term map was not loaded.
`TermProcessor`	`termProcessor` The term processor used to build this index.

Constructor Summary
`protected`	`Index(int numberOfDocuments, int numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntList sizes, Properties properties)` Creates a new instance, initialising all fields.

Method Summary
`IndexIterator`	`documents(CharSequence term)` Creates a new `IndexReader` for this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.
`IndexIterator`	`documents(CharSequence prefix, int limit)` Creates a number of instances of `IndexReader` for this index and uses them to return a `MultiTermIndexIterator` over the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, an `UnsupportedOperationException` will be thrown.
`IndexIterator`	`documents(int term)` Creates a new `IndexReader` for this index and uses it to return an index iterator over the documents containing a term.
`IndexIterator`	`getEmptyIndexIterator()`
`IndexIterator`	`getEmptyIndexIterator(CharSequence term)`
`IndexIterator`	`getEmptyIndexIterator(CharSequence term, int termNumber)`
`IndexIterator`	`getEmptyIndexIterator(int term)`
`static Index`	`getInstance(CharSequence uri)` Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.
`static Index`	`getInstance(CharSequence uri, boolean randomAccess)` Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.
`static Index`	`getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes)` Returns a new index using the given URI, searching dynamically for term and prefix maps.
`static Index`	`getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)` Returns a new index using the given URI and no `IOFactory`.
`static Index`	`getInstance(IOFactory ioFactory, CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)` Returns a new index using the given URI.
`IndexReader`	`getReader()` Creates and returns a new `IndexReader` based on this index, using the default buffer size.
`abstract IndexReader`	`getReader(int bufferSize)` Creates and returns a new `IndexReader` based on this index.
`protected static TermProcessor`	`getTermProcessor(Properties properties)`
`void`	`keyIndex(Index newKeyIndex)` Sets the index used as a key to retrieve intervals from iterators generated from this index.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

field

public final String field

The field indexed by this index, or null.

properties

public final Properties properties

The properties of this index. It is stored here for convenience (for instance, if custom keys are added to the property file), but it may be null.

numberOfDocuments

public final int numberOfDocuments

The number of documents of the collection.

numberOfTerms

public final int numberOfTerms

The number of terms of the collection. This field might be set to -1 in some cases (for instance, in certain documental clusters).

numberOfOccurrences

public final long numberOfOccurrences

The number of occurrences of the collection, or possibly -1 if it is unknown.

numberOfPostings

public final long numberOfPostings

The number of postings (pairs term/document) of the collection.

maxCount

public final int maxCount

The maximum number of positions in an position list, or possibly -1 if this index does not have positions.

payload

public final Payload payload

The payload for this index, or null.

hasPayloads

public final boolean hasPayloads

Whether this index contains payloads; if true, payload is non-null.

hasCounts

public final boolean hasCounts

Whether this index contains counts.

hasPositions

public final boolean hasPositions

Whether this index contains positions.

termProcessor

public final TermProcessor termProcessor

The term processor used to build this index.

singletonSet

public ReferenceSet<Index> singletonSet

An immutable singleton set containing just keyIndex.

keyIndex

public Index keyIndex

The index used as a key to retrieve intervals. Usually equal to this, but it is settable.

termMap

public final StringMap<? extends CharSequence> termMap

The term map for this index, or null if the term map was not loaded.

prefixMap

public final PrefixMap<? extends CharSequence> prefixMap

The prefix map for this index, or null if the prefix map was not loaded.

sizes

public final IntList sizes

The size of each document, or null if sizes are not necessary or not loaded in this index.

Constructor Detail

Index

protected Index(int numberOfDocuments,
                int numberOfTerms,
                long numberOfPostings,
                long numberOfOccurrences,
                int maxCount,
                Payload payload,
                boolean hasCounts,
                boolean hasPositions,
                TermProcessor termProcessor,
                String field,
                StringMap<? extends CharSequence> termMap,
                PrefixMap<? extends CharSequence> prefixMap,
                IntList sizes,
                Properties properties)

Creates a new instance, initialising all fields.

Method Detail

getTermProcessor

protected static TermProcessor getTermProcessor(Properties properties)

getInstance

public static Index getInstance(IOFactory ioFactory,
                                CharSequence uri,
                                boolean randomAccess,
                                boolean documentSizes,
                                boolean maps)
                         throws IOException,
                                ConfigurationException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException

Returns a new index using the given URI.

If uri has scheme mg4j, the index is considered to be remote and index creation delegated to IndexServer.getIndex(String, int, boolean, boolean). Otherwise, we delegate to DiskBasedIndex.getInstance(CharSequence, boolean, boolean, boolean, EnumMap).

Parameters:: ioFactory - the factory that will be used to perform I/O, or null (implying the IOFactory.FILESYSTEM_FACTORY for disk-based indices).; uri - the URI defining the index.; randomAccess - whether the index should be accessible randomly.; documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).; maps - if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).
Throws:: IOException; ConfigurationException; URISyntaxException; ClassNotFoundException; SecurityException; InstantiationException; IllegalAccessException; InvocationTargetException; NoSuchMethodException

getInstance

public static Index getInstance(CharSequence uri,
                                boolean randomAccess,
                                boolean documentSizes,
                                boolean maps)
                         throws IOException,
                                ConfigurationException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException

Returns a new index using the given URI and no IOFactory.

Parameters:: uri - the URI defining the index.; randomAccess - whether the index should be accessible randomly.; documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).; maps - if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).
Throws:: IOException; ConfigurationException; URISyntaxException; ClassNotFoundException; SecurityException; InstantiationException; IllegalAccessException; InvocationTargetException; NoSuchMethodException

getInstance

public static Index getInstance(CharSequence uri,
                                boolean randomAccess,
                                boolean documentSizes)
                         throws IOException,
                                ConfigurationException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException

Returns a new index using the given URI, searching dynamically for term and prefix maps.

Parameters:: uri - the URI defining the index.; randomAccess - whether the index should be accessible randomly.; documentSizes - if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).
Throws:: IOException; ConfigurationException; URISyntaxException; ClassNotFoundException; SecurityException; InstantiationException; IllegalAccessException; InvocationTargetException; NoSuchMethodException
See Also:: getInstance(CharSequence, boolean, boolean, boolean)

getInstance

public static Index getInstance(CharSequence uri,
                                boolean randomAccess)
                         throws ConfigurationException,
                                IOException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException

Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.

Parameters:: uri - the URI defining the index.; randomAccess - whether the index should be accessible randomly.
Throws:: ConfigurationException; IOException; URISyntaxException; ClassNotFoundException; SecurityException; InstantiationException; IllegalAccessException; InvocationTargetException; NoSuchMethodException
See Also:: getInstance(CharSequence, boolean, boolean)

getInstance

public static Index getInstance(CharSequence uri)
                         throws ConfigurationException,
                                IOException,
                                URISyntaxException,
                                ClassNotFoundException,
                                SecurityException,
                                InstantiationException,
                                IllegalAccessException,
                                InvocationTargetException,
                                NoSuchMethodException

Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.

Parameters:: uri - the URI defining the index.
Throws:: ConfigurationException; IOException; URISyntaxException; ClassNotFoundException; SecurityException; InstantiationException; IllegalAccessException; InvocationTargetException; NoSuchMethodException
See Also:: getInstance(CharSequence, boolean)

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator()

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator(int term)

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator(CharSequence term)

getEmptyIndexIterator

public IndexIterator getEmptyIndexIterator(CharSequence term,
                                           int termNumber)

getReader

public IndexReader getReader()
                      throws IOException

Creates and returns a new IndexReader based on this index, using the default buffer size. After that, you can use the reader to read this index.

Returns:: a new IndexReader to read this index.
Throws:: IOException

getReader

public abstract IndexReader getReader(int bufferSize)
                               throws IOException

Creates and returns a new IndexReader based on this index. After that, you can use the reader to read this index.

Parameters:: bufferSize - the size of the buffer to be used accessing the reader, or -1 for a default buffer size.
Returns:: a new IndexReader to read this index.
Throws:: IOException

documents

public IndexIterator documents(int term)
                        throws IOException

Creates a new IndexReader for this index and uses it to return an index iterator over the documents containing a term.

Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See IndexReader.documents(int) for a method with the same semantics, but making reader reuse possible.

Parameters:: term - a term.
Throws:: IOException - if an exception occurred while accessing the index.; UnsupportedOperationException - if this index is not accessible by term number.
See Also:: IndexReader.documents(int)

documents

public IndexIterator documents(CharSequence term)
                        throws IOException

Creates a new IndexReader for this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.

Unless the term processor of this index is null, words coming from a query will have to be processed before being used with this method.

Parameters:: term - a term.
Throws:: IOException - if an exception occurred while accessing the index.; UnsupportedOperationException - if the term map is not available for this index.
See Also:: IndexReader.documents(CharSequence)

documents

public IndexIterator documents(CharSequence prefix,
                               int limit)
                        throws IOException,
                               TooManyTermsException

Creates a number of instances of IndexReader for this index and uses them to return a MultiTermIndexIterator over the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, an UnsupportedOperationException will be thrown.

Parameters:: prefix - a prefix.; limit - a limit on the number of terms that will be used to resolve the prefix query; if the terms starting with prefix are more than limit, a TooManyTermsException will be thrown.
Throws:: UnsupportedOperationException - if this index cannot resolve prefixes.; TooManyTermsException - if there are more than limit terms starting with prefix.; IOException

keyIndex

public void keyIndex(Index newKeyIndex)

Sets the index used as a key to retrieve intervals from iterators generated from this index.

This setter is a compromise between clarity of design and efficiency. Each index iterator is based on an index, and when that index is passed to DocumentIterator.intervalIterator(Index), intervals corresponding to the positions of the term in the current document are returned. Analogously, DocumentIterator.indices() returns a singleton set containing the index. However, when composing indices into clusters, often iterators generated by a local index must act as if they really belong to the global index. This method allows to set the index that is used as a key to return intervals, and that is contained in singletonSet.

Note that setting this value will only influence index readers created afterwards.

Parameters:: newKeyIndex - the new index to be used as a key for interval retrieval.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

it.unimi.di.mg4j.index Class Index

Thread safety

Read-once load

field

properties

numberOfDocuments

numberOfTerms

numberOfOccurrences

numberOfPostings

maxCount

payload

hasPayloads

hasCounts

hasPositions

termProcessor

singletonSet

keyIndex

termMap

prefixMap

sizes

Index

getTermProcessor

getInstance

getInstance

getInstance

getInstance

getInstance

getEmptyIndexIterator

getEmptyIndexIterator

getEmptyIndexIterator

getEmptyIndexIterator

getReader

getReader

documents

documents

documents

keyIndex

it.unimi.di.mg4j.index
Class Index