Class TermCollectionVisitor

  • All Implemented Interfaces:
    DocumentIteratorVisitor<Boolean>

    public class TermCollectionVisitor
    extends AbstractDocumentIteratorVisitor
    A visitor collecting information about terms appearing in a DocumentIterator.

    The purpose of this visitor is that of exploring before iteration the structure of a DocumentIterator to count how many terms are actually used, and set up some appearing in all leaves of nonzero frequency (the latter condition is used to skip empty iterators), possibly considering just a subset of indices. For this visitor to work, all leaves of nonzero frequency must return a non-null value on a call to IndexIterator.term().

    During the visit, we keep track of which index/term pair have been already seen. Each pair is assigned an distinct offset—a number between zero and the overall number of distinct pairs—which is stored into each index iterator id and is used afterwards to access quickly data about the pair. Note that duplicate index/term pairs get the same offset. The overall number of distinct pairs is returned by numberOfPairs() after a visit.

    The indices appearing in some valid pair are recorded; they are accessible as a vector returned by indices(), and the map from positions in this vector to indices is inverted by indexMap().

    If you need to fix the index map, there's a special prepare(ReferenceSet) method. In that case only terms associated with indices in the provided set will be collected.

    Warning: the semantics of prepare(ReferenceSet) described above has been implemented in MG4J 4.0. Previously, the effect of prepare(ReferenceSet) was just that of adding artificially indices to the index set.

    The offset assigned to each pair index/term is returned by offset(Index, String). Should you need to know the terms associated with each index, they are returned by terms(Index).

    After a term collection, usually counters are set up by a visit of CounterSetupVisitor.

    • Constructor Detail

      • TermCollectionVisitor

        public TermCollectionVisitor()
        Creates a new term-collection visitor.
    • Method Detail

      • prepare

        public TermCollectionVisitor prepare​(ReferenceSet<Index> indices)
        Prepares this term-collection visitor, possibly specifying the indices that should be collected.
        Parameters:
        indices - the set of indices that will be collected; if empty, the all indices will be collected (e.g., the call is equivalent to prepare()).
        Returns:
        this term-collection visitor.
      • numberOfPairs

        public int numberOfPairs()
        Returns the number of distinct index/term pair corresponding to nonzero-frequency index iterators in the last visit.
        Returns:
        the number distinct index/term pair corresponding to nonzero-frequency index iterators.
      • indices

        public Index[] indices()
        Returns the indices met during pair collection.

        Note that the returned array does not include indices only associated to index iterators of zero frequency, unless prepare(ReferenceSet) was called with a nonempty argument.

        Returns:
        the indices met during term collection.
      • indexMap

        public Reference2IntMap<Index> indexMap()
        Returns a map from indices met during term collection to their position into indices().

        Note that the returned map does not include as keys indices only associated to index iterators of zero frequency, unless prepare(ReferenceSet) was called with a nonempty argument.

        Returns:
        a map from indices met during term collection to their position into indices().
      • terms

        public String[] terms​(Index index)
        Returns the terms associated with the given index.
        Parameters:
        index - an index.
        Returns:
        the terms associated with index, in the same order in which they appeared during the visit, skipping duplicates, if some nonzero-frequency iterator based on index was found; null otherwise.
      • term2Id

        public Object2IntLinkedOpenHashMap<String> term2Id()
        Returns the a map associating terms appearing in the query with ids.
        Returns:
        a map from terms appearing in the query (in indices with counts) to ids.
      • offset

        public int offset​(Index index,
                          String term)
        Returns the offset associated with a given pair index/term.
        Parameters:
        index - an index appearing in indices().
        term - a term appearing in the array returned by terms(Index) with argument index.
        Returns:
        the offset associated with the pair index/term.