Class DiskBasedIndex

  • public class DiskBasedIndex
    extends Object
    A static container providing facilities to load an index based on data stored on disk.

    This class contains several useful static methods such as readOffsets(InputBitStream, long), readSizes(CharSequence, long), loadLongBigList(CharSequence, ByteOrder) and static factor methods such as getInstance(CharSequence, boolean, boolean, boolean, EnumMap) that take care of reading the properties associated with the index, identify the correct Index implementation that should be used to load the index, and load the necessary data into memory.

    As an option, a disk-based index can be loaded into main memory (key: Index.UriKeys.INMEMORY), or mapped into main memory (key: Index.UriKeys.MAPPED) (the value assigned to the keys is irrelevant).

    Note that quasi-succinct indices are memory-mapped by default, and for bitstream indices there is a limit of two gigabytes for in-memory indices.

    By default the term-offset list is accessed using a SemiExternalOffsetBigList with a step of DEFAULT_OFFSET_STEP. This behaviour can be changed using the URI key Index.UriKeys.OFFSETSTEP.

    Disk-based indices are the workhorse of MG4J. All other indices (clustered, remote, etc.) ultimately rely on disk-based indices to provide results.

    Note that not all data produced by Scan and by the other indexing utilities are actually necessary to run a disk-based index. Usually the property file and the index files are sufficient: if one needs random access, also the offsets file must be present, and if the compression method requires document sizes or if sizes are requested explicitly, also the sizes file must be present. A StringMap and possibly a PrefixMap will be fetched automatically by getInstance(CharSequence, boolean, boolean) using standard extensions.

    Thread safety

    A disk-based index is thread safe as long as the offset list, the size list and the term/prefix map are. The static factory methods provided by this class load offsets and sizes using data structures that are thread safe. If you use directly a constructor, instead, it is your responsibility to pass thread-safe data structures.

    Sebastiano Vigna