it.unimi.di.mg4j.search.score
Class BM25FScorer

java.lang.Object
  extended by it.unimi.di.mg4j.search.score.AbstractScorer
      extended by it.unimi.di.mg4j.search.score.AbstractWeightedScorer
          extended by it.unimi.di.mg4j.search.score.BM25FScorer
All Implemented Interfaces:
DelegatingScorer, Scorer, FlyweightPrototype<Scorer>

public class BM25FScorer
extends AbstractWeightedScorer
implements DelegatingScorer

A scorer that implements the BM25F ranking scheme.

BM25F is an evolution of BM25 described by Stephen Robertson, Hugo Zaragoza and Michael Taylor in “Simple BM25 extension to multiple weighted fields”, CIKM '04: Proceedings of the thirteenth ACM international Conference on Information and Knowledge Management, pages 42−49, ACM Press, 2004.

The idea behind BM25F is that adding up (albeit with weights) BM25 scores from different fields breaks down the nonlinearity of BM25. Instead, we should work on a virtual document collection: more precisely, we should behave as if all fields were concatenated in a single stream of text. For instance, if weights are integers, the formula behaves as if the text of each field is concatenated as many times as its weight to form a global text, which is then scored using BM25.

Note that, for this to happen, we would need to know the corresponding frequency—that is, for each term, the number of documents in which the term appears in at least one of the fields. This number must be provided at construction time: more precisely, you must specify a StringMap that maps each term appearing in some field to an index into a LongList containing the correct frequencies. These data is accessed only in the preparatory phase, so access can be reasonably slow.

Important: the only source of knowledge about the overall set of indices involved in query resolution is given by calls to AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap). That is, this scorer will assume that all indices appearing in a query are also keys of the weight function passed to AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap). An exception will be raised if these guidelines are not followed.

Computing frequency data

The tool Paste can be used to create the metadata of the virtual collection. To do so, simply run Paste on the indices of all fields over which you want to compute BM25F with the --metadata-only option. The resulting frequency file is what you need to pass to the constructor, and from the term file you can build a StringMap (e.g., using an ImmutableExternalPrefixMap) that will be used to index the frequencies.

Boldi's variant

Providing global frequency data makes it possible to compute the classical BM25F formula. If no frequency data is provided, this class implements Paolo Boldi's variant of BM25. In this case, we multiply the IDF score by the weighted count of each term to form the virtual count that will be passed through BM25's nonlinear function.

Using this scorer

This scorer assigns to each pair index/term reachable by true paths a score that depends on the virtual count of the term, which is the count of the term for the given index multiplied by the weight of the index. To obtain the “classical” BM25F score you must write a query q that contains no index selector and multiplexes it on all indices, e.g., a:q | b:q | c:q. If a term appears only in some specific index/query pair, its score will be computed using a smaller virtual count, obtained just by adding up the values associated with the actually present index/query pairs. Usually, the simplest way to obtain this result is to use a MultiIndexTermExpander, which can be even set from the command-line interface provided by Query.

Correctness

The code in this scorer is verified by unit tests developed jointly with Hugo Zaragoza. This is an important point, as the definition of BM25F contains many subtleties.

Author:
Sebastiano Vigna
See Also:
BM25Scorer

Nested Class Summary
protected static class BM25FScorer.Visitor
           
 
Field Summary
 Reference2DoubleMap<Index> bByIndex
          The parameter b; you must provide one value for each index.
static double DEFAULT_B
          The default value used for the parameter b.
static double DEFAULT_K1
          The default value used for the parameter k1.
static double EPSILON_SCORE
          The value of the document-frequency part for terms appearing in more than half of the documents.
 double k1
          The parameter k1.
 
Fields inherited from class it.unimi.di.mg4j.search.score.AbstractWeightedScorer
index2Weight
 
Fields inherited from class it.unimi.di.mg4j.search.score.AbstractScorer
documentIterator, indexIterator
 
Constructor Summary
BM25FScorer(double k1, Reference2DoubleMap<Index> b)
          Creates a BM25F scorer using Boldi's variant (frequencies are not needed).
BM25FScorer(double k1, Reference2DoubleMap<Index> b, StringMap<? extends CharSequence> termMap, LongList frequencies)
          Creates a BM25F scorer.
BM25FScorer(double k1, StringMap<? extends CharSequence> termMap, LongList frequencies, Object2DoubleMap<String> b)
          Creates a BM25F scorer.
BM25FScorer(String... arg)
          Creates a BM25F scorer using parameters specified by strings.
 
Method Summary
 BM25FScorer copy()
           
 double score()
          Computes a score by calling Scorer.score(Index) for each index in the current document iterator, and adding the weighted results.
 double score(Index index)
          Returns a score for the current document of the last document iterator given to Scorer.wrap(DocumentIterator), but considering only a given index (optional operation).
 boolean usesIntervals()
          Whether this scorer uses intervals.
 void wrap(DocumentIterator d)
          Wraps the given document iterator.
 
Methods inherited from class it.unimi.di.mg4j.search.score.AbstractWeightedScorer
getWeights, setWeights
 
Methods inherited from class it.unimi.di.mg4j.search.score.AbstractScorer
nextDocument
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface it.unimi.di.mg4j.search.score.Scorer
getWeights, nextDocument, setWeights
 

Field Detail

DEFAULT_K1

public static final double DEFAULT_K1
The default value used for the parameter k1.

See Also:
Constant Field Values

DEFAULT_B

public static final double DEFAULT_B
The default value used for the parameter b.

See Also:
Constant Field Values

EPSILON_SCORE

public static final double EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents.

See Also:
Constant Field Values

k1

public final double k1
The parameter k1.


bByIndex

public final Reference2DoubleMap<Index> bByIndex
The parameter b; you must provide one value for each index.

Constructor Detail

BM25FScorer

public BM25FScorer(double k1,
                   Reference2DoubleMap<Index> b,
                   StringMap<? extends CharSequence> termMap,
                   LongList frequencies)
Creates a BM25F scorer.

Parameters:
k1 - the k1 parameter.
b - the b parameter, specified as a map from indices to values.
termMap - a map from terms to positions in frequencies, or null if frequencies is null.
frequencies - the frequencies, or null for Boldi's variant.

BM25FScorer

public BM25FScorer(double k1,
                   StringMap<? extends CharSequence> termMap,
                   LongList frequencies,
                   Object2DoubleMap<String> b)
Creates a BM25F scorer.

This constructor exists to provide a typed counterpart to the string-based constructor (mainly for documentation purposes).

Parameters:
k1 - the k1 parameter.
termMap - a map from terms to positions in frequencies, or null if frequencies is null.
frequencies - the frequencies, or null for Boldi's variant.
b - the b parameter, specified as a map from indices to values.

BM25FScorer

public BM25FScorer(double k1,
                   Reference2DoubleMap<Index> b)
Creates a BM25F scorer using Boldi's variant (frequencies are not needed).

Parameters:
k1 - the k1 parameter.
b - the b parameter, specified as a map from indices to values.

BM25FScorer

public BM25FScorer(String... arg)
            throws NumberFormatException,
                   FileNotFoundException,
                   IOException,
                   ClassNotFoundException
Creates a BM25F scorer using parameters specified by strings.

This constructor has string parameters that correspond to the arguments of BM25FScorer(double, StringMap, LongList, Object2DoubleMap). The two middle arguments can be omitted by specifying them as empty. The last argument is represented by a number of assignments index=b, separated by commas (as if they were multiple arguments), which will be compacted into a function representing the values of b.

Throws:
NumberFormatException
FileNotFoundException
IOException
ClassNotFoundException
Method Detail

copy

public BM25FScorer copy()
Specified by:
copy in interface DelegatingScorer
Specified by:
copy in interface Scorer
Specified by:
copy in interface FlyweightPrototype<Scorer>

score

public double score()
             throws IOException
Description copied from class: AbstractWeightedScorer
Computes a score by calling Scorer.score(Index) for each index in the current document iterator, and adding the weighted results.

Specified by:
score in interface Scorer
Overrides:
score in class AbstractWeightedScorer
Returns:
the combined weighted score.
Throws:
IOException

score

public double score(Index index)
Description copied from interface: Scorer
Returns a score for the current document of the last document iterator given to Scorer.wrap(DocumentIterator), but considering only a given index (optional operation).

Specified by:
score in interface Scorer
Parameters:
index - the only index to be considered.
Returns:
the score.

wrap

public void wrap(DocumentIterator d)
          throws IOException
Description copied from class: AbstractScorer
Wraps the given document iterator.

This method records internally the provided iterator.

Specified by:
wrap in interface Scorer
Overrides:
wrap in class AbstractWeightedScorer
Parameters:
d - the document iterator that will be used in subsequent calls to Scorer.score() and Scorer.score(Index).
Throws:
IOException

usesIntervals

public boolean usesIntervals()
Description copied from interface: Scorer
Whether this scorer uses intervals.

This method is essential when aggregating scorers, because if several scores need intervals, a CachingDocumentIterator will be necessary.

Specified by:
usesIntervals in interface Scorer
Returns:
true if this scorer uses intervals.