Class SemiExternalOffsetBigList

  • All Implemented Interfaces:
    BigList<Long>, LongBigList, LongCollection, LongIterable, LongStack, Size64, Stack<Long>, Comparable<BigList<? extends Long>>, Iterable<Long>, Collection<Long>

    public class SemiExternalOffsetBigList
    extends AbstractLongBigList
    Provides semi-external random access to offsets of an index.

    This class is a semi-external LongList that MG4J uses as default for accessing term offsets.

    When the number of terms in the index grows, storing each offset as a long in an array can consume hundred of megabytes of memory, and most of this memory is wasted, as it is occupied by offsets of hapax legomena (terms occurring just once in the collection). Instead, this class accesses offsets in their compressed forms, and provides entry points for random access to each offset. At construction time, entry points are computed with a certain step, which is the number of offsets accessible from each entry point, or, equivalently, the maximum number of offsets that will be necessary to read to access a given offset.

    This class uses a small (CACHE_MAX_SIZE entries) map to keep track of the most recently used indices, so to answer queries to those indices more quickly.

    Warning: This class is not thread safe, and needs to be synchronised to be used in a multithreaded environment.

    Author:
    Fabien Campagne, Sebastiano Vigna
    • Field Detail

      • CACHE_MAX_SIZE

        public static final int CACHE_MAX_SIZE
        The maximum number of entry in the cache map.
        See Also:
        Constant Field Values
    • Constructor Detail

      • SemiExternalOffsetBigList

        public SemiExternalOffsetBigList​(InputBitStream offsetRawData,
                                         int offsetStep,
                                         long numOffsets)
                                  throws IOException
        Creates a new semi-external list.
        Parameters:
        offsetRawData - a bit stream containing the offsets in compressed form (γ-encoded deltas).
        offsetStep - the step used to build random-access entry points.
        numOffsets - the overall number of offsets (i.e., the number of terms).
        Throws:
        IOException
    • Method Detail

      • getLong

        public final long getLong​(long index)
      • size64

        public long size64()