it.unimi.di.mg4j.search.visitor
Class TermCollectionVisitor

java.lang.Object
  extended by it.unimi.di.mg4j.search.visitor.AbstractDocumentIteratorVisitor
      extended by it.unimi.di.mg4j.search.visitor.TermCollectionVisitor
All Implemented Interfaces:
DocumentIteratorVisitor<Boolean>

public class TermCollectionVisitor
extends AbstractDocumentIteratorVisitor

A visitor collecting information about terms appearing in a DocumentIterator.

The purpose of this visitor is that of exploring before iteration the structure of a DocumentIterator to count how many terms are actually used, and set up some preliminary access data. More precisely, we count the distinct pairs index/term appearing in all leaves of nonzero frequency (the latter condition is used to skip empty iterators), possibly considering just a subset of indices. For this visitor to work, all leaves of nonzero frequency must return a non-null value on a call to IndexIterator.term().

During the visit, we keep track of which index/term pair have been already seen. Each pair is assigned an distinct offset—a number between zero and the overall number of distinct pairs—which is stored into each index iterator id and is used afterwards to access quickly data about the pair. Note that duplicate index/term pairs get the same offset. The overall number of distinct pairs is returned by numberOfPairs() after a visit.

The indices appearing in some valid pair are recorded; they are accessible as a vector returned by indices(), and the map from positions in this vector to indices is inverted by indexMap().

If you need to fix the index map, there's a special prepare(ReferenceSet) method. In that case only terms associated with indices in the provided set will be collected.

Warning: the semantics of prepare(ReferenceSet) described above has been implemented in MG4J 4.0. Previously, the effect of prepare(ReferenceSet) was just that of adding artificially indices to the index set.

The offset assigned to each pair index/term is returned by offset(Index, String). Should you need to know the terms associated with each index, they are returned by terms(Index).

After a term collection, usually counters are set up by a visit of CounterSetupVisitor.


Constructor Summary
TermCollectionVisitor()
          Creates a new term-collection visitor.
 
Method Summary
 Reference2IntMap<Index> indexMap()
          Returns a map from indices met during term collection to their position into indices().
 Index[] indices()
          Returns the indices met during pair collection.
 int numberOfPairs()
          Returns the number of distinct index/term pair corresponding to nonzero-frequency index iterators in the last visit.
 int offset(Index index, String term)
          Returns the offset associated with a given pair index/term.
 TermCollectionVisitor prepare()
          Prepares this term-collection visitor.
 TermCollectionVisitor prepare(ReferenceSet<Index> indices)
          Prepares this term-collection visitor, possibly specifying the indices that should be collected.
 Object2IntLinkedOpenHashMap<String> term2Id()
          Returns the a map associating terms appearing in the query with ids.
 String[] terms(Index index)
          Returns the terms associated with the given index.
 String toString()
           
 Boolean visit(IndexIterator indexIterator)
          Visits an IndexIterator leaf.
 
Methods inherited from class it.unimi.di.mg4j.search.visitor.AbstractDocumentIteratorVisitor
newArray, visit, visit, visit, visitPost, visitPre
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TermCollectionVisitor

public TermCollectionVisitor()
Creates a new term-collection visitor.

Method Detail

prepare

public TermCollectionVisitor prepare()
Prepares this term-collection visitor.

Specified by:
prepare in interface DocumentIteratorVisitor<Boolean>
Overrides:
prepare in class AbstractDocumentIteratorVisitor
Returns:
this term-collection visitor.

prepare

public TermCollectionVisitor prepare(ReferenceSet<Index> indices)
Prepares this term-collection visitor, possibly specifying the indices that should be collected.

Parameters:
indices - the set of indices that will be collected; if empty, the all indices will be collected (e.g., the call is equivalent to prepare()).
Returns:
this term-collection visitor.

visit

public Boolean visit(IndexIterator indexIterator)
              throws IOException
Description copied from interface: DocumentIteratorVisitor
Visits an IndexIterator leaf.

Parameters:
indexIterator - the leaf to be visited.
Returns:
an appropriate return value if the visit should continue, or null.
Throws:
IOException

numberOfPairs

public int numberOfPairs()
Returns the number of distinct index/term pair corresponding to nonzero-frequency index iterators in the last visit.

Returns:
the number distinct index/term pair corresponding to nonzero-frequency index iterators.

indices

public Index[] indices()
Returns the indices met during pair collection.

Note that the returned array does not include indices only associated to index iterators of zero frequency, unless prepare(ReferenceSet) was called with a nonempty argument.

Returns:
the indices met during term collection.

indexMap

public Reference2IntMap<Index> indexMap()
Returns a map from indices met during term collection to their position into indices().

Note that the returned map does not include as keys indices only associated to index iterators of zero frequency, unless prepare(ReferenceSet) was called with a nonempty argument.

Returns:
a map from indices met during term collection to their position into indices().

terms

public String[] terms(Index index)
Returns the terms associated with the given index.

Parameters:
index - an index.
Returns:
the terms associated with index, in the same order in which they appeared during the visit, skipping duplicates, if some nonzero-frequency iterator based on index was found; null otherwise.

term2Id

public Object2IntLinkedOpenHashMap<String> term2Id()
Returns the a map associating terms appearing in the query with ids.

Returns:
a map from terms appearing in the query (in indices with counts) to ids.

offset

public int offset(Index index,
                  String term)
Returns the offset associated with a given pair index/term.

Parameters:
index - an index appearing in indices().
term - a term appearing in the array returned by terms(Index) with argument index.
Returns:
the offset associated with the pair index/term.

toString

public String toString()
Overrides:
toString in class Object