Class BM25FScorer

  • All Implemented Interfaces:
    DelegatingScorer, Scorer, FlyweightPrototype<Scorer>

    public class BM25FScorer
    extends AbstractWeightedScorer
    implements DelegatingScorer
    A scorer that implements the BM25F ranking scheme.

    BM25F is an evolution of BM25 described by Stephen Robertson, Hugo Zaragoza and Michael Taylor in “Simple BM25 extension to multiple weighted fields”, CIKM '04: Proceedings of the thirteenth ACM international Conference on Information and Knowledge Management, pages 42−49, ACM Press, 2004.

    The idea behind BM25F is that adding up (albeit with weights) BM25 scores from different fields breaks down the nonlinearity of BM25. Instead, we should work on a virtual document collection: more precisely, we should behave as if all fields were concatenated in a single stream of text. For instance, if weights are integers, the formula behaves as if the text of each field is concatenated as many times as its weight to form a global text, which is then scored using BM25.

    Note that, for this to happen, we would need to know the corresponding frequency—that is, for each term, the number of documents in which the term appears in at least one of the fields. This number must be provided at construction time: more precisely, you must specify a StringMap that maps each term appearing in some field to an index into a LongList containing the correct frequencies. These data is accessed only in the preparatory phase, so access can be reasonably slow.

    Important: the only source of knowledge about the overall set of indices involved in query resolution is given by calls to AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap). That is, this scorer will assume that all indices appearing in a query are also keys of the weight function passed to AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap). An exception will be raised if these guidelines are not followed.

    Computing frequency data

    The tool Paste can be used to create the metadata of the virtual collection. To do so, simply run Paste on the indices of all fields over which you want to compute BM25F with the --metadata-only option. The resulting frequency file is what you need to pass to the constructor, and from the term file you can build a StringMap (e.g., using an ImmutableExternalPrefixMap) that will be used to index the frequencies.

    Boldi's variant

    Providing global frequency data makes it possible to compute the classical BM25F formula. If no frequency data is provided, this class implements Paolo Boldi's variant of BM25. In this case, we multiply the IDF score by the weighted count of each term to form the virtual count that will be passed through BM25's nonlinear function.

    Using this scorer

    This scorer assigns to each pair index/term reachable by true paths a score that depends on the virtual count of the term, which is the count of the term for the given index multiplied by the weight of the index. To obtain the “classical” BM25F score you must write a query q that contains no index selector and multiplexes it on all indices, e.g., a:q | b:q | c:q. If a term appears only in some specific index/query pair, its score will be computed using a smaller virtual count, obtained just by adding up the values associated with the actually present index/query pairs. Usually, the simplest way to obtain this result is to use a MultiIndexTermExpander, which can be even set from the command-line interface provided by Query.

    Correctness

    The code in this scorer is verified by unit tests developed jointly with Hugo Zaragoza. This is an important point, as the definition of BM25F contains many subtleties.

    Author:
    Sebastiano Vigna
    See Also:
    BM25Scorer