Class BM25FScorer
- java.lang.Object
-
- it.unimi.di.big.mg4j.search.score.AbstractScorer
-
- it.unimi.di.big.mg4j.search.score.AbstractWeightedScorer
-
- it.unimi.di.big.mg4j.search.score.BM25FScorer
-
- All Implemented Interfaces:
DelegatingScorer
,Scorer
,FlyweightPrototype<Scorer>
public class BM25FScorer extends AbstractWeightedScorer implements DelegatingScorer
A scorer that implements the BM25F ranking scheme.BM25F is an evolution of BM25 described by Stephen Robertson, Hugo Zaragoza and Michael Taylor in “Simple BM25 extension to multiple weighted fields”, CIKM '04: Proceedings of the thirteenth ACM international Conference on Information and Knowledge Management, pages 42−49, ACM Press, 2004.
The idea behind BM25F is that adding up (albeit with weights) BM25 scores from different fields breaks down the nonlinearity of BM25. Instead, we should work on a virtual document collection: more precisely, we should behave as if all fields were concatenated in a single stream of text. For instance, if weights are integers, the formula behaves as if the text of each field is concatenated as many times as its weight to form a global text, which is then scored using BM25.
Note that, for this to happen, we would need to know the corresponding frequency—that is, for each term, the number of documents in which the term appears in at least one of the fields. This number must be provided at construction time: more precisely, you must specify a
StringMap
that maps each term appearing in some field to an index into aLongList
containing the correct frequencies. These data is accessed only in the preparatory phase, so access can be reasonably slow.Important: the only source of knowledge about the overall set of indices involved in query resolution is given by calls to
AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap)
. That is, this scorer will assume that all indices appearing in a query are also keys of the weight function passed toAbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap)
. An exception will be raised if these guidelines are not followed.Computing frequency data
The tool
Paste
can be used to create the metadata of the virtual collection. To do so, simply runPaste
on the indices of all fields over which you want to compute BM25F with the --metadata-only option. The resulting frequency file is what you need to pass to the constructor, and from the term file you can build aStringMap
(e.g., using anImmutableExternalPrefixMap
) that will be used to index the frequencies.Boldi's variant
Providing global frequency data makes it possible to compute the classical BM25F formula. If no frequency data is provided, this class implements Paolo Boldi's variant of BM25. In this case, we multiply the IDF score by the weighted count of each term to form the virtual count that will be passed through BM25's nonlinear function.
Using this scorer
This scorer assigns to each pair index/term reachable by true paths a score that depends on the virtual count of the term, which is the count of the term for the given index multiplied by the weight of the index. To obtain the “classical” BM25F score you must write a query q that contains no index selector and multiplexes it on all indices, e.g., a:q | b:q | c:q. If a term appears only in some specific index/query pair, its score will be computed using a smaller virtual count, obtained just by adding up the values associated with the actually present index/query pairs. Usually, the simplest way to obtain this result is to use a
MultiIndexTermExpander
, which can be even set from the command-line interface provided byQuery
.Correctness
The code in this scorer is verified by unit tests developed jointly with Hugo Zaragoza. This is an important point, as the definition of BM25F contains many subtleties.
- Author:
- Sebastiano Vigna
- See Also:
BM25Scorer
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
BM25FScorer.Visitor
-
Field Summary
Fields Modifier and Type Field Description Reference2DoubleMap<Index>
bByIndex
The parameter b; you must provide one value for each index.static double
DEFAULT_B
The default value used for the parameter b.static double
DEFAULT_K1
The default value used for the parameter k1.static double
EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents.double
k1
The parameter k1.-
Fields inherited from class it.unimi.di.big.mg4j.search.score.AbstractWeightedScorer
index2Weight
-
Fields inherited from class it.unimi.di.big.mg4j.search.score.AbstractScorer
documentIterator, indexIterator
-
-
Constructor Summary
Constructors Constructor Description BM25FScorer(double k1, StringMap<? extends CharSequence> termMap, LongBigList frequencies, Object2DoubleMap<String> b)
Creates a BM25F scorer.BM25FScorer(double k1, Reference2DoubleMap<Index> b)
Creates a BM25F scorer using Boldi's variant (frequencies are not needed).BM25FScorer(double k1, Reference2DoubleMap<Index> b, StringMap<? extends CharSequence> termMap, LongBigList frequencies)
Creates a BM25F scorer.BM25FScorer(String... arg)
Creates a BM25F scorer using parameters specified by strings.
-
Method Summary
Modifier and Type Method Description BM25FScorer
copy()
double
score()
Computes a score by callingScorer.score(Index)
for each index in the current document iterator, and adding the weighted results.double
score(Index index)
Returns a score for the current document of the last document iterator given toScorer.wrap(DocumentIterator)
, but considering only a given index (optional operation).boolean
usesIntervals()
Whether this scorer uses intervals.void
wrap(DocumentIterator d)
Wraps the given document iterator.-
Methods inherited from class it.unimi.di.big.mg4j.search.score.AbstractWeightedScorer
getWeights, setWeights
-
Methods inherited from class it.unimi.di.big.mg4j.search.score.AbstractScorer
nextDocument
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.search.score.Scorer
getWeights, nextDocument, setWeights
-
-
-
-
Field Detail
-
DEFAULT_K1
public static final double DEFAULT_K1
The default value used for the parameter k1.- See Also:
- Constant Field Values
-
DEFAULT_B
public static final double DEFAULT_B
The default value used for the parameter b.- See Also:
- Constant Field Values
-
EPSILON_SCORE
public static final double EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents.- See Also:
- Constant Field Values
-
k1
public final double k1
The parameter k1.
-
bByIndex
public final Reference2DoubleMap<Index> bByIndex
The parameter b; you must provide one value for each index.
-
-
Constructor Detail
-
BM25FScorer
public BM25FScorer(double k1, Reference2DoubleMap<Index> b, StringMap<? extends CharSequence> termMap, LongBigList frequencies)
Creates a BM25F scorer.- Parameters:
k1
- the k1 parameter.b
- the b parameter, specified as a map from indices to values.termMap
- a map from terms to positions infrequencies
, ornull
iffrequencies
isnull
.frequencies
- the frequencies, ornull
for Boldi's variant.
-
BM25FScorer
public BM25FScorer(double k1, StringMap<? extends CharSequence> termMap, LongBigList frequencies, Object2DoubleMap<String> b)
Creates a BM25F scorer.This constructor exists to provide a typed counterpart to the string-based constructor (mainly for documentation purposes).
- Parameters:
k1
- the k1 parameter.termMap
- a map from terms to positions infrequencies
, ornull
iffrequencies
isnull
.frequencies
- the frequencies, ornull
for Boldi's variant.b
- the b parameter, specified as a map from indices to values.
-
BM25FScorer
public BM25FScorer(double k1, Reference2DoubleMap<Index> b)
Creates a BM25F scorer using Boldi's variant (frequencies are not needed).- Parameters:
k1
- the k1 parameter.b
- the b parameter, specified as a map from indices to values.
-
BM25FScorer
public BM25FScorer(String... arg) throws NumberFormatException, FileNotFoundException, IOException, ClassNotFoundException
Creates a BM25F scorer using parameters specified by strings.This constructor has string parameters that correspond to the arguments of
BM25FScorer(double, StringMap, LongBigList, Object2DoubleMap)
. The two middle arguments can be omitted by specifying them as empty. The last argument is represented by a number of assignments index=b, separated by commas (as if they were multiple arguments), which will be compacted into a function representing the values of b.
-
-
Method Detail
-
copy
public BM25FScorer copy()
- Specified by:
copy
in interfaceFlyweightPrototype<Scorer>
- Specified by:
copy
in interfaceScorer
-
score
public double score() throws IOException
Description copied from class:AbstractWeightedScorer
Computes a score by callingScorer.score(Index)
for each index in the current document iterator, and adding the weighted results.- Specified by:
score
in interfaceScorer
- Overrides:
score
in classAbstractWeightedScorer
- Returns:
- the combined weighted score.
- Throws:
IOException
-
score
public double score(Index index)
Description copied from interface:Scorer
Returns a score for the current document of the last document iterator given toScorer.wrap(DocumentIterator)
, but considering only a given index (optional operation).
-
wrap
public void wrap(DocumentIterator d) throws IOException
Description copied from class:AbstractScorer
Wraps the given document iterator.This method records internally the provided iterator.
- Specified by:
wrap
in interfaceScorer
- Overrides:
wrap
in classAbstractWeightedScorer
- Parameters:
d
- the document iterator that will be used in subsequent calls toScorer.score()
andScorer.score(Index)
.- Throws:
IOException
-
usesIntervals
public boolean usesIntervals()
Description copied from interface:Scorer
Whether this scorer uses intervals.This method is essential when aggregating scorers, because if several scores need intervals, a
CachingDocumentIterator
will be necessary.- Specified by:
usesIntervals
in interfaceScorer
- Returns:
- true if this scorer uses intervals.
-
-