|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.index.DiskBasedIndex
public class DiskBasedIndex
A static container providing facilities to load an index based on data stored on disk.
This class contains several useful static methods
such as readOffsets(InputBitStream, int)
, readSizes(CharSequence, int)
, loadLongBigList(CharSequence, ByteOrder)
and static factor methods such as getInstance(CharSequence, boolean, boolean, boolean, EnumMap)
that take care of reading the properties associated with the index, identify
the correct Index
implementation that
should be used to load the index, and load the necessary data into memory.
As an option, a disk-based index can be loaded into main memory (key: Index.UriKeys.INMEMORY
),
or mapped into main memory (key: Index.UriKeys.MAPPED
) (the value assigned to the keys is irrelevant).
Note that quasi-succinct indices are memory-mapped by default, and for bitstream indices there is a limit of two gigabytes for in-memory indices.
By default the
term-offset list is accessed using a SemiExternalOffsetList
with a step of DEFAULT_OFFSET_STEP
. This behaviour can be changed using
the URI key Index.UriKeys.OFFSETSTEP
.
Disk-based indices are the workhorse of MG4J. All other indices (clustered, remote, etc.) ultimately rely on disk-based indices to provide results.
Note that not all data produced by Scan
and
by the other indexing utilities are actually necessary to run a disk-based
index. Usually the property file and the index files are sufficient: if one
needs random access, also the offsets file must be present, and if the
compression method requires document sizes or if sizes are requested explicitly,
also the sizes file must be present. A StringMap
and possibly a PrefixMap
will be fetched
automatically by getInstance(CharSequence, boolean, boolean)
using standard extensions.
A disk-based index is thread safe as long as the offset list, the size list and the term/prefix map are. The static factory methods provided by this class load offsets and sizes using data structures that are thread safe. If you use directly a constructor, instead, it is your responsibility to pass thread-safe data structures.
Field Summary | |
---|---|
static int |
BUFFER_SIZE
|
static String |
COUNTS_EXTENSION
The extension for the counts bitstream. |
static int |
DEFAULT_OFFSET_STEP
The default value for the query parameter Index.UriKeys.OFFSETSTEP . |
static String |
FREQUENCIES_EXTENSION
Standard extension for the file of frequencies. |
static String |
INDEX_EXTENSION
Standard extension for the index bitstream. |
static String |
OCCURRENCIES_EXTENSION
Standard extension for the file of global counts. |
static String |
OFFSETS_EXTENSION
Standard extension for the file of offsets. |
static String |
OFFSETS_POSTFIX
The postfix to be added to POINTERS_EXTENSIONS , COUNTS_EXTENSION and POSITIONS_EXTENSION for offsets. |
static String |
POINTERS_EXTENSIONS
The extension for the pointers bitstream. |
static String |
POSITIONS_EXTENSION
Standard extension for the positions bitstream of a high-performance index. |
static String |
POSITIONS_NUMBER_OF_BITS_EXTENSION
Standard extension for the file of lengths of positions. |
static String |
PREFIXMAP_EXTENSION
Standard extension for the prefix map. |
static String |
PROPERTIES_EXTENSION
Standard extension for the index properties. |
static String |
SIZES_EXTENSION
Standard extension for the file of sizes. |
static String |
STATS_EXTENSION
Standard extension for the stats file. |
static String |
SUMS_MAX_POSITION_EXTENSION
Standard extension for the file of lengths of positions. |
static String |
TERMMAP_EXTENSION
Standard extension for the term map. |
static String |
TERMS_EXTENSION
Standard extension for the file of terms. |
static String |
UNSORTED_TERMS_EXTENSION
Standard extension for the file of terms, unsorted. |
Method Summary | |
---|---|
static ByteOrder |
byteOrder(String s)
Parses a ByteOrder value. |
static Index |
getInstance(CharSequence basename)
Returns a new local index, trying to guess reasonable term and prefix maps from the basename, loading offsets but loading document sizes only if it is necessary. |
static Index |
getInstance(CharSequence basename,
boolean randomAccess)
Returns a new local index, trying to guess reasonable term and prefix maps from the basename, and loading document sizes only if it is necessary. |
static Index |
getInstance(CharSequence basename,
boolean randomAccess,
boolean documentSizes)
Returns a new disk-based index, guessing reasonable term and prefix maps from the basename. |
static Index |
getInstance(CharSequence basename,
boolean randomAccess,
boolean documentSizes,
boolean maps)
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename. |
static Index |
getInstance(CharSequence basename,
boolean randomAccess,
boolean documentSizes,
boolean maps,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, possibly guessing reasonable term and prefix maps from the basename. |
static Index |
getInstance(CharSequence basename,
Properties properties,
boolean randomAccess,
boolean documentSizes,
boolean maps,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename. |
static Index |
getInstance(CharSequence basename,
Properties properties,
StringMap<? extends CharSequence> termMap,
PrefixMap<? extends CharSequence> prefixMap,
boolean randomAccess,
boolean documentSizes,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, loading exactly the specified parts and using preloaded Properties and the IOFactory.FILESYSTEM_FACTORY . |
static Index |
getInstance(IOFactory ioFactory,
CharSequence basename,
Properties properties,
boolean randomAccess,
boolean documentSizes,
boolean maps,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, using preloaded Properties and possibly guessing reasonable term and prefix maps from the basename. |
static Index |
getInstance(IOFactory ioFactory,
CharSequence basename,
Properties properties,
StringMap<? extends CharSequence> termMap,
PrefixMap<? extends CharSequence> prefixMap,
boolean randomAccess,
boolean documentSizes,
EnumMap<Index.UriKeys,String> queryProperties)
Returns a new disk-based index, loading exactly the specified parts and using preloaded Properties . |
static LongBigArrayBigList |
loadLongBigList(CharSequence filename,
ByteOrder byteOrder)
Commodity method for loading a big list of binary longs with specified endianness into a long big array using the IOFactory.FILESYSTEM_FACTORY . |
static LongBigArrayBigList |
loadLongBigList(IOFactory ioFactory,
CharSequence filename,
ByteOrder byteOrder)
Commodity method for loading a big list of binary longs with specified endianness into a long big array. |
static LongBigArrayBigList |
loadLongBigList(ReadableByteChannel channel,
long length,
ByteOrder byteOrder)
Commodity method for loading from a channel a big list of binary longs with specified endianness into a long big array. |
static PrefixMap<? extends CharSequence> |
loadPrefixMap(IOFactory ioFactory,
String filename)
Utility static method that loads a prefix map. |
static PrefixMap<? extends CharSequence> |
loadPrefixMap(String filename)
Utility static method that loads a prefix map using the IOFactory.FILESYSTEM_FACTORY . |
static StringMap<? extends CharSequence> |
loadStringMap(IOFactory ioFactory,
String filename)
Utility static method that loads a term map. |
static StringMap<? extends CharSequence> |
loadStringMap(String filename)
Utility static method that loads a term map using the IOFactory.FILESYSTEM_FACTORY . |
static LongList |
offsets(IOFactory ioFactory,
String filename,
int numberOfTerms,
int offsetStep)
Returns the list of offsets. |
static LongList |
offsets(String filename,
int numberOfTerms,
int offsetStep)
Returns the list of offsets using the IOFactory.FILESYSTEM_FACTORY . |
static LongList |
readOffsets(CharSequence filename,
int T)
Utility method to load a compressed offset file into a list using the IOFactory.FILESYSTEM_FACTORY . |
static LongList |
readOffsets(InputBitStream in,
int T)
Utility method to load a compressed offset file into a list. |
static LongList |
readOffsets(IOFactory ioFactory,
CharSequence filename,
int T)
Utility method to load a compressed offset file into a list. |
static IntList |
readSizes(CharSequence filename,
int N)
Utility method to load a compressed size file into a list using the IOFactory.FILESYSTEM_FACTORY . |
static IntList |
readSizes(IOFactory ioFactory,
CharSequence filename,
int N)
Utility method to load a compressed size file into a list. |
static IntList |
readSizesSuccinct(CharSequence filename,
int N)
Deprecated. This method is an ancestral residue. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_OFFSET_STEP
Index.UriKeys.OFFSETSTEP
.
public static final String INDEX_EXTENSION
public static final String POSITIONS_EXTENSION
public static final String PROPERTIES_EXTENSION
public static final String SIZES_EXTENSION
public static final String OFFSETS_EXTENSION
public static final String POSITIONS_NUMBER_OF_BITS_EXTENSION
public static final String SUMS_MAX_POSITION_EXTENSION
public static final String OCCURRENCIES_EXTENSION
public static final String FREQUENCIES_EXTENSION
public static final String TERMS_EXTENSION
public static final String UNSORTED_TERMS_EXTENSION
public static final String TERMMAP_EXTENSION
public static final String PREFIXMAP_EXTENSION
public static final String STATS_EXTENSION
public static final String POINTERS_EXTENSIONS
public static final String COUNTS_EXTENSION
public static final String OFFSETS_POSTFIX
POINTERS_EXTENSIONS
, COUNTS_EXTENSION
and POSITIONS_EXTENSION
for offsets.
public static final int BUFFER_SIZE
Method Detail |
---|
public static LongList readOffsets(InputBitStream in, int T) throws IOException
in
- the input bit stream providing the offsets (see BitStreamIndexWriter
).T
- the number of terms indexed.
T
that gives the number
of bytes of the index file.
IOException
public static LongList readOffsets(IOFactory ioFactory, CharSequence filename, int T) throws IOException
ioFactory
- the factory that will be used to perform I/O.filename
- the file containing the offsets (see BitStreamIndexWriter
).T
- the number of terms indexed.
T
that gives the number
of bytes of the index file.
IOException
public static LongList readOffsets(CharSequence filename, int T) throws IOException
IOFactory.FILESYSTEM_FACTORY
.
filename
- the file containing the offsets (see BitStreamIndexWriter
).T
- the number of terms indexed.
T
that gives the number
of bytes of the index file.
IOException
public static IntList readSizes(IOFactory ioFactory, CharSequence filename, int N) throws IOException
ioFactory
- the factory that will be used to perform I/O.filename
- the file containing the γ-coded sizes (see BitStreamIndexWriter
).N
- the number of documents.
IOException
public static IntList readSizes(CharSequence filename, int N) throws IOException
IOFactory.FILESYSTEM_FACTORY
.
filename
- the file containing the γ-coded sizes (see BitStreamIndexWriter
).N
- the number of documents.
IOException
@Deprecated public static IntList readSizesSuccinct(CharSequence filename, int N) throws IOException
filename
- the filename containing the γ-coded sizes (see BitStreamIndexWriter
).N
- the number of documents indexed.
IllegalStateException
- if ioFactory
is not IOFactory.FILESYSTEM_FACTORY
.
IOException
public static LongBigArrayBigList loadLongBigList(IOFactory ioFactory, CharSequence filename, ByteOrder byteOrder) throws IOException
ioFactory
- the factory that will be used to perform I/O.filename
- the file containing the longs.byteOrder
- the endianness of the longs.
file
.
IOException
public static LongBigArrayBigList loadLongBigList(CharSequence filename, ByteOrder byteOrder) throws IOException
IOFactory.FILESYSTEM_FACTORY
.
filename
- the file containing the longs.byteOrder
- the endianness of the longs.
file
.
IOException
public static LongBigArrayBigList loadLongBigList(ReadableByteChannel channel, long length, ByteOrder byteOrder) throws IOException
channel
- the channel.byteOrder
- the endianness of the longs.
channel
.
IOException
public static ByteOrder byteOrder(String s)
ByteOrder
value.
s
- a string (either BIG_ENDIAN or LITTLE_ENDIAN).
ByteOrder.BIG_ENDIAN
or ByteOrder.LITTLE_ENDIAN
).public static StringMap<? extends CharSequence> loadStringMap(IOFactory ioFactory, String filename) throws IOException
ioFactory
- the factory that will be used to perform I/O.filename
- the name of the file containing the term map.
null
if the file did not exist.
IOException
- if some IOException (other than FileNotFoundException
) occurred.public static StringMap<? extends CharSequence> loadStringMap(String filename) throws IOException
IOFactory.FILESYSTEM_FACTORY
.
filename
- the name of the file containing the term map.
null
if the file did not exist.
IOException
- if some IOException (other than FileNotFoundException
) occurred.public static PrefixMap<? extends CharSequence> loadPrefixMap(IOFactory ioFactory, String filename) throws IOException
ioFactory
- the factory that will be used to perform I/O.filename
- the name of the file containing the prefix map.
null
if the file did not exist.
IOException
- if some IOException (other than FileNotFoundException
) occurred.public static PrefixMap<? extends CharSequence> loadPrefixMap(String filename) throws IOException
IOFactory.FILESYSTEM_FACTORY
.
filename
- the name of the file containing the prefix map.
null
if the file did not exist.
IOException
- if some IOException (other than FileNotFoundException
) occurred.public static LongList offsets(IOFactory ioFactory, String filename, int numberOfTerms, int offsetStep) throws FileNotFoundException, IOException
ioFactory
- the factory that will be used to perform I/O.filename
- the file containing the offsets.numberOfTerms
- the number of terms.offsetStep
- the offset step.
offsetStep
is less than zero, a memory-mapped,
synchronized SemiExternalOffsetList
with offset step
equal to -offsetStep
; if it is zero, an
in-memory list; if it is greater than than zero,
we return a synchronized SemiExternalOffsetList
with offset step
equal to -offsetStep
.
FileNotFoundException
IOException
public static LongList offsets(String filename, int numberOfTerms, int offsetStep) throws FileNotFoundException, IOException
IOFactory.FILESYSTEM_FACTORY
.
filename
- the file containing the offsets.numberOfTerms
- the number of terms.offsetStep
- the offset step.
offsetStep
is less than zero, a memory-mapped,
synchronized SemiExternalOffsetList
with offset step
equal to -offsetStep
; if it is zero, an
in-memory list; if it is greater than than zero,
we return a synchronized SemiExternalOffsetList
with offset step
equal to -offsetStep
.
FileNotFoundException
IOException
public static Index getInstance(IOFactory ioFactory, CharSequence basename, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, boolean randomAccess, boolean documentSizes, EnumMap<Index.UriKeys,String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
.
ioFactory
- the factory that will be used to perform I/O.basename
- the basename of the index.properties
- the properties obtained from the given basename.termMap
- the term map for this index, or null
for no term map.prefixMap
- the prefix map for this index, or null
for no prefix map.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static Index getInstance(CharSequence basename, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, boolean randomAccess, boolean documentSizes, EnumMap<Index.UriKeys,String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
and the IOFactory.FILESYSTEM_FACTORY
.
basename
- the basename of the index.properties
- the properties obtained from the given basename.termMap
- the term map for this index, or null
for no term map.prefixMap
- the prefix map for this index, or null
for no prefix map.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static Index getInstance(IOFactory ioFactory, CharSequence basename, Properties properties, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
and possibly guessing reasonable term and prefix maps from the basename.
ioFactory
- the factory that will be used to perform I/O.basename
- the basename of the index.properties
- the properties obtained by stemming basename
.randomAccess
- whether the index should be accessible randomly.documentSizes
- if true, document sizes will be loaded.maps
- if true, term and prefix maps will be guessed and loaded.queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
IllegalAccessException
InstantiationException
ClassNotFoundException
IOException
getInstance(CharSequence, Properties, StringMap, PrefixMap, boolean, boolean, EnumMap)
public static Index getInstance(CharSequence basename, Properties properties, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
and possibly guessing reasonable term and prefix maps from the basename.
basename
- the basename of the index.properties
- the properties obtained by stemming basename
.randomAccess
- whether the index should be accessible randomly.documentSizes
- if true, document sizes will be loaded.maps
- if true, term and prefix maps will be guessed and loaded.queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
IllegalAccessException
InstantiationException
ClassNotFoundException
IOException
getInstance(CharSequence, Properties, StringMap, PrefixMap, boolean, boolean, EnumMap)
public static Index getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<Index.UriKeys,String> queryProperties) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements PrefixMap
. Otherwise, we search for a prefix map (basename stemmed with .prefixmap)
and, if it implements StringMap
and no term map has been found, we use it as prefix map.
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).maps
- if true, term and prefix maps will be guessed and loaded (this
feature might not be available with some kind of index).queryProperties
- a map containing associations between Index.UriKeys
and values, or null
.
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static Index getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
Properties
and possibly guessing reasonable term and prefix maps from the basename.
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements PrefixMap
. Otherwise, we search for a prefix map (basename stemmed with .prefixmap)
and, if it implements StringMap
and no term map has been found, we use it as prefix map.
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).maps
- if true, term and prefix maps will be guessed and loaded (this
feature might not be available with some kind of index).
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
getInstance(CharSequence, boolean, boolean, boolean, EnumMap)
public static Index getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).documentSizes
- if true, document sizes will be loaded (note that sometimes document sizes
might be loaded anyway because the compression method for positions requires it).
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static Index getInstance(CharSequence basename, boolean randomAccess) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
basename
- the basename of the index.randomAccess
- whether the index should be accessible randomly (e.g., if it will
be possible to call IndexReader.documents(int)
on the index readers returned by the index).
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
public static Index getInstance(CharSequence basename) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException
basename
- the basename of the index.
ConfigurationException
ClassNotFoundException
IOException
InstantiationException
IllegalAccessException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |