Class Index
- java.lang.Object
-
- it.unimi.di.big.mg4j.index.Index
-
- All Implemented Interfaces:
Serializable
- Direct Known Subclasses:
BitStreamIndex
,IndexCluster
,QuasiSuccinctIndex
public abstract class Index extends Object implements Serializable
An abstract representation of an index.Concrete subclasses of this class represent abstract index access information: for instance, the basename or IP address/port, flags, etc. It allows to build easily index readers over the index: in turn, index readers provide document iterators.
This class contains just methods declarations, and attributes for all data that is common to any form of index. Note that we use an abstract class, rather than an interface, because interfaces do not allow to declare attributes.
We provide static factory methods (e.g.,
getInstance(CharSequence)
) that return an index given a suitable URI string. If the scheme part is mg4j, then the URI is assumed to point at a remote index. Otherwise, it is assumed to be the basename of a local index. In both cases, a query part introduced by ? can specify additional parameters (key=value pairs separated by ;). For instance, the URI example?inmemory=1 will load the index with basename example, caching its content in core memory. Please have a look at constants inIndex.UriKeys
(and analogous enums in subclasses) for additional parameters.If the index is local, by convention this class will locate a property file with extension
DiskBasedIndex.PROPERTIES_EXTENSION
that is expected to contain a number of key/value pairs (which are quite informative and can be examined manually). In particular, the keyIndex.PropertyKeys.INDEXCLASS
explain which kind of index class should be used to read the index. The file might contain additional keys depending on the value ofIndex.PropertyKeys.INDEXCLASS
(e.g.,QuasiSuccinctIndex.PropertyKeys.BYTEORDER
). An index usually exposes term or prefix maps and the size list but this is not compulsory (the latter, in particular, is necessary with certain codings).Thread safety
Indices are a natural candidate for multithreaded access. An instance of this class must be thread safe as long as external data structures provided to its constructors are. For instance, the tool
IndexBuilder
generates a synchronizedImmutableExternalPrefixMap
so that by default the resulting index is thread safe.For instance, a
DiskBasedIndex
requires a list of term offsets, term maps, etc. As long as all these data structures are thread safe, the same is true of the index. Data structures created by static factory methods such asDiskBasedIndex.getInstance(CharSequence)
are thread safe.Note that
IndexReader
s returned bygetReader()
are not thread safe (even if the methodgetReader()
is). The logic behind this arrangement is that you create as many reader as you need, and thenCloseable.close()
them. In a multithreaded environment, a pool of index readers can be created, and a customQueryBuilderVisitor
can be used to buildDocumentIterator
s using the given pool of readers. In this case readers are not closed, but rather reused.Read-once load
Implementations of this class are strongly encouraged to offer read-once constructors and factory methods: property files and other data related to the index (but not to an
IndexReader
should be read exactly once, and sequentially. This feature is very useful when combining indices.- Since:
- 0.9
- Author:
- Paolo Boldi, Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
Index.EmptyIndexIterator
An iterator returning no documents based on this index.static class
Index.PropertyKeys
Symbolic names for properties of aIndex
.static class
Index.UriKeys
Keys to be used (downcased) in specifiying additional parameters to a MG4J URI.
-
Field Summary
Fields Modifier and Type Field Description String
field
The field indexed by this index, ornull
.boolean
hasCounts
Whether this index contains counts.boolean
hasPayloads
Whether this index contains payloads; if true,payload
is non-null
.boolean
hasPositions
Whether this index contains positions.Index
keyIndex
The index used as a key to retrieve intervals.int
maxCount
The maximum number of positions in an position list, or possibly -1 if this index does not have positions.long
numberOfDocuments
The number of documents of the collection.long
numberOfOccurrences
The number of occurrences of the collection, or possibly -1 if it is unknown.long
numberOfPostings
The number of postings (pairs term/document) of the collection.long
numberOfTerms
The number of terms of the collection.Payload
payload
The payload for this index, ornull
.PrefixMap<? extends CharSequence>
prefixMap
The prefix map for this index, ornull
if the prefix map was not loaded.Properties
properties
The properties of this index.ReferenceSet<Index>
singletonSet
An immutable singleton set containing justkeyIndex
.IntBigList
sizes
The size of each document, ornull
if sizes are not necessary or not loaded in this index.StringMap<? extends CharSequence>
termMap
The term map for this index, ornull
if the term map was not loaded.TermProcessor
termProcessor
The term processor used to build this index.
-
Constructor Summary
Constructors Modifier Constructor Description protected
Index(long numberOfDocuments, long numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntBigList sizes, Properties properties)
Creates a new instance, initialising all fields.
-
Method Summary
Modifier and Type Method Description IndexIterator
documents(long term)
Creates a newIndexReader
for this index and uses it to return an index iterator over the documents containing a term.IndexIterator
documents(CharSequence term)
Creates a newIndexReader
for this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.IndexIterator
documents(CharSequence prefix, int limit)
Creates a number of instances ofIndexReader
for this index and uses them to return aMultiTermIndexIterator
over the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, anUnsupportedOperationException
will be thrown.IndexIterator
getEmptyIndexIterator()
IndexIterator
getEmptyIndexIterator(long term)
IndexIterator
getEmptyIndexIterator(CharSequence term)
IndexIterator
getEmptyIndexIterator(CharSequence term, long termNumber)
static Index
getInstance(IOFactory ioFactory, CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)
Returns a new index using the given URI.static Index
getInstance(CharSequence uri)
Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.static Index
getInstance(CharSequence uri, boolean randomAccess)
Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.static Index
getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes)
Returns a new index using the given URI, searching dynamically for term and prefix maps.static Index
getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps)
Returns a new index using the given URI and noIOFactory
.IndexReader
getReader()
Creates and returns a newIndexReader
based on this index, using the default buffer size.abstract IndexReader
getReader(int bufferSize)
Creates and returns a newIndexReader
based on this index.protected static TermProcessor
getTermProcessor(Properties properties)
void
keyIndex(Index newKeyIndex)
Sets the index used as a key to retrieve intervals from iterators generated from this index.
-
-
-
Field Detail
-
field
public final String field
The field indexed by this index, ornull
.
-
properties
public final Properties properties
The properties of this index. It is stored here for convenience (for instance, if custom keys are added to the property file), but it may benull
.
-
numberOfDocuments
public final long numberOfDocuments
The number of documents of the collection.
-
numberOfTerms
public final long numberOfTerms
The number of terms of the collection. This field might be set to -1 in some cases (for instance, in certain documental clusters).
-
numberOfOccurrences
public final long numberOfOccurrences
The number of occurrences of the collection, or possibly -1 if it is unknown.
-
numberOfPostings
public final long numberOfPostings
The number of postings (pairs term/document) of the collection.
-
maxCount
public final int maxCount
The maximum number of positions in an position list, or possibly -1 if this index does not have positions.
-
payload
public final Payload payload
The payload for this index, ornull
.
-
hasPayloads
public final boolean hasPayloads
Whether this index contains payloads; if true,payload
is non-null
.
-
hasCounts
public final boolean hasCounts
Whether this index contains counts.
-
hasPositions
public final boolean hasPositions
Whether this index contains positions.
-
termProcessor
public final TermProcessor termProcessor
The term processor used to build this index.
-
singletonSet
public ReferenceSet<Index> singletonSet
An immutable singleton set containing justkeyIndex
.
-
keyIndex
public Index keyIndex
The index used as a key to retrieve intervals. Usually equal tothis
, but it is settable.
-
termMap
public final StringMap<? extends CharSequence> termMap
The term map for this index, ornull
if the term map was not loaded.
-
prefixMap
public final PrefixMap<? extends CharSequence> prefixMap
The prefix map for this index, ornull
if the prefix map was not loaded.
-
sizes
public final IntBigList sizes
The size of each document, ornull
if sizes are not necessary or not loaded in this index.
-
-
Constructor Detail
-
Index
protected Index(long numberOfDocuments, long numberOfTerms, long numberOfPostings, long numberOfOccurrences, int maxCount, Payload payload, boolean hasCounts, boolean hasPositions, TermProcessor termProcessor, String field, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, IntBigList sizes, Properties properties)
Creates a new instance, initialising all fields.
-
-
Method Detail
-
getTermProcessor
protected static TermProcessor getTermProcessor(Properties properties)
-
getInstance
public static Index getInstance(IOFactory ioFactory, CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI.- Parameters:
ioFactory
- the factory that will be used to perform I/O, ornull
(implying theIOFactory.FILESYSTEM_FACTORY
for disk-based indices).uri
- the URI defining the index.randomAccess
- whether the index should be accessible randomly.documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).maps
- if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).- Throws:
IOException
org.apache.commons.configuration.ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
-
getInstance
public static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes, boolean maps) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI and noIOFactory
.- Parameters:
uri
- the URI defining the index.randomAccess
- whether the index should be accessible randomly.documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).maps
- if true, term and prefix maps will be guessed and loaded (this feature might not be available with some kind of index).- Throws:
IOException
org.apache.commons.configuration.ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
-
getInstance
public static Index getInstance(CharSequence uri, boolean randomAccess, boolean documentSizes) throws IOException, org.apache.commons.configuration.ConfigurationException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps.- Parameters:
uri
- the URI defining the index.randomAccess
- whether the index should be accessible randomly.documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes might be loaded anyway because the compression method for positions requires it).- Throws:
IOException
org.apache.commons.configuration.ConfigurationException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
- See Also:
getInstance(CharSequence, boolean, boolean, boolean)
-
getInstance
public static Index getInstance(CharSequence uri, boolean randomAccess) throws org.apache.commons.configuration.ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps and loading document sizes only if it is necessary.- Parameters:
uri
- the URI defining the index.randomAccess
- whether the index should be accessible randomly.- Throws:
org.apache.commons.configuration.ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
- See Also:
getInstance(CharSequence, boolean, boolean)
-
getInstance
public static Index getInstance(CharSequence uri) throws org.apache.commons.configuration.ConfigurationException, IOException, URISyntaxException, ClassNotFoundException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException
Returns a new index using the given URI, searching dynamically for term and prefix maps, loading offsets but loading document sizes only if it is necessary.- Parameters:
uri
- the URI defining the index.- Throws:
org.apache.commons.configuration.ConfigurationException
IOException
URISyntaxException
ClassNotFoundException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
- See Also:
getInstance(CharSequence, boolean)
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator()
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator(long term)
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator(CharSequence term)
-
getEmptyIndexIterator
public IndexIterator getEmptyIndexIterator(CharSequence term, long termNumber)
-
getReader
public IndexReader getReader() throws IOException
Creates and returns a newIndexReader
based on this index, using the default buffer size. After that, you can use the reader to read this index.- Returns:
- a new
IndexReader
to read this index. - Throws:
IOException
-
getReader
public abstract IndexReader getReader(int bufferSize) throws IOException
Creates and returns a newIndexReader
based on this index. After that, you can use the reader to read this index.- Parameters:
bufferSize
- the size of the buffer to be used accessing the reader, or -1 for a default buffer size.- Returns:
- a new
IndexReader
to read this index. - Throws:
IOException
-
documents
public IndexIterator documents(long term) throws IOException
Creates a newIndexReader
for this index and uses it to return an index iterator over the documents containing a term.Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See
IndexReader.documents(long)
for a method with the same semantics, but making reader reuse possible.- Parameters:
term
- a term.- Throws:
IOException
- if an exception occurred while accessing the index.UnsupportedOperationException
- if this index is not accessible by term number.- See Also:
IndexReader.documents(long)
-
documents
public IndexIterator documents(CharSequence term) throws IOException
Creates a newIndexReader
for this index and uses it to return an index iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.Since the reader is created from scratch, it is essential to dispose the returned iterator after usage. See
IndexReader.documents(long)
for a method with the same semantics, but making reader reuse possible.Unless the term processor of this index is
null
, words coming from a query will have to be processed before being used with this method.- Parameters:
term
- a term.- Throws:
IOException
- if an exception occurred while accessing the index.UnsupportedOperationException
- if the term map is not available for this index.- See Also:
IndexReader.documents(CharSequence)
-
documents
public IndexIterator documents(CharSequence prefix, int limit) throws IOException, TooManyTermsException
Creates a number of instances ofIndexReader
for this index and uses them to return aMultiTermIndexIterator
over the documents containing any term our of a set of terms defined by a prefix; the prefix is given explicitly, and unless the index has a prefix map, anUnsupportedOperationException
will be thrown.- Parameters:
prefix
- a prefix.limit
- a limit on the number of terms that will be used to resolve the prefix query; if the terms starting withprefix
are more thanlimit
, aTooManyTermsException
will be thrown.- Throws:
UnsupportedOperationException
- if this index cannot resolve prefixes.TooManyTermsException
- if there are more thanlimit
terms starting withprefix
.IOException
-
keyIndex
public void keyIndex(Index newKeyIndex)
Sets the index used as a key to retrieve intervals from iterators generated from this index.This setter is a compromise between clarity of design and efficiency. Each index iterator is based on an index, and when that index is passed to
DocumentIterator.intervalIterator(Index)
, intervals corresponding to the positions of the term in the current document are returned. Analogously,DocumentIterator.indices()
returns a singleton set containing the index. However, when composing indices into clusters, often iterators generated by a local index must act as if they really belong to the global index. This method allows to set the index that is used as a key to return intervals, and that is contained insingletonSet
.Note that setting this value will only influence index readers created afterwards.
- Parameters:
newKeyIndex
- the new index to be used as a key for interval retrieval.
-
-