|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.search.score.AbstractScorer it.unimi.di.mg4j.search.score.AbstractWeightedScorer it.unimi.di.mg4j.search.score.BM25FScorer
public class BM25FScorer
A scorer that implements the BM25F ranking scheme.
BM25F is an evolution of BM25 described by Stephen Robertson, Hugo Zaragoza and Michael Taylor in “Simple BM25 extension to multiple weighted fields”, CIKM '04: Proceedings of the thirteenth ACM international Conference on Information and Knowledge Management, pages 42−49, ACM Press, 2004.
The idea behind BM25F is that adding up (albeit with weights) BM25 scores from different fields breaks down the nonlinearity of BM25. Instead, we should work on a virtual document collection: more precisely, we should behave as if all fields were concatenated in a single stream of text. For instance, if weights are integers, the formula behaves as if the text of each field is concatenated as many times as its weight to form a global text, which is then scored using BM25.
Note that, for this to happen, we would need to know the corresponding frequency—that is, for each term, the number of documents in which
the term appears in at least one of the fields. This number must be provided at construction time: more precisely, you must specify
a StringMap
that maps each term appearing in some field to an index into a LongList
containing
the correct frequencies. These data is
accessed only in the preparatory phase, so access can be reasonably slow.
Important: the only source of knowledge about the overall set of indices involved in query resolution is given
by calls to AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap)
. That is, this scorer will assume that all indices
appearing in a query are also keys of the weight function passed to AbstractWeightedScorer.setWeights(it.unimi.dsi.fastutil.objects.Reference2DoubleMap)
. An
exception will be raised if these guidelines are not followed.
The tool Paste
can be used to create the metadata of the virtual collection. To do so, simply run Paste
on
the indices of all fields over which you want to compute BM25F with the --metadata-only option.
The resulting frequency file is what you need to pass
to the constructor, and from the term file you can build a StringMap
(e.g., using an ImmutableExternalPrefixMap
)
that will be used to index the frequencies.
Providing global frequency data makes it possible to compute the classical BM25F formula. If no frequency data is provided, this class implements Paolo Boldi's variant of BM25. In this case, we multiply the IDF score by the weighted count of each term to form the virtual count that will be passed through BM25's nonlinear function.
This scorer assigns to each pair index/term reachable by true paths
a score that depends on the virtual count of the term, which is the count of the term for the given index multiplied by the weight of the index.
To obtain the “classical” BM25F score you must write a query q that contains no index selector and multiplexes it on all indices,
e.g., a:q | b:q | c:q. If a term appears only in some specific index/query pair, its score will be computed
using a smaller virtual count, obtained just by adding up the values associated with the actually present index/query pairs. Usually, the
simplest way to obtain this result is to use a MultiIndexTermExpander
, which can be even set from the command-line interface
provided by Query
.
The code in this scorer is verified by unit tests developed jointly with Hugo Zaragoza. This is an important point, as the definition of BM25F contains many subtleties.
BM25Scorer
Nested Class Summary | |
---|---|
protected static class |
BM25FScorer.Visitor
|
Field Summary | |
---|---|
Reference2DoubleMap<Index> |
bByIndex
The parameter b; you must provide one value for each index. |
static double |
DEFAULT_B
The default value used for the parameter b. |
static double |
DEFAULT_K1
The default value used for the parameter k1. |
static double |
EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents. |
double |
k1
The parameter k1. |
Fields inherited from class it.unimi.di.mg4j.search.score.AbstractWeightedScorer |
---|
index2Weight |
Fields inherited from class it.unimi.di.mg4j.search.score.AbstractScorer |
---|
documentIterator, indexIterator |
Constructor Summary | |
---|---|
BM25FScorer(double k1,
Reference2DoubleMap<Index> b)
Creates a BM25F scorer using Boldi's variant (frequencies are not needed). |
|
BM25FScorer(double k1,
Reference2DoubleMap<Index> b,
StringMap<? extends CharSequence> termMap,
LongList frequencies)
Creates a BM25F scorer. |
|
BM25FScorer(double k1,
StringMap<? extends CharSequence> termMap,
LongList frequencies,
Object2DoubleMap<String> b)
Creates a BM25F scorer. |
|
BM25FScorer(String... arg)
Creates a BM25F scorer using parameters specified by strings. |
Method Summary | |
---|---|
BM25FScorer |
copy()
|
double |
score()
Computes a score by calling Scorer.score(Index) for
each index in the current document iterator, and adding the weighted results. |
double |
score(Index index)
Returns a score for the current document of the last document iterator given to Scorer.wrap(DocumentIterator) , but
considering only a given index (optional operation). |
boolean |
usesIntervals()
Whether this scorer uses intervals. |
void |
wrap(DocumentIterator d)
Wraps the given document iterator. |
Methods inherited from class it.unimi.di.mg4j.search.score.AbstractWeightedScorer |
---|
getWeights, setWeights |
Methods inherited from class it.unimi.di.mg4j.search.score.AbstractScorer |
---|
nextDocument |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface it.unimi.di.mg4j.search.score.Scorer |
---|
getWeights, nextDocument, setWeights |
Field Detail |
---|
public static final double DEFAULT_K1
public static final double DEFAULT_B
public static final double EPSILON_SCORE
public final double k1
public final Reference2DoubleMap<Index> bByIndex
Constructor Detail |
---|
public BM25FScorer(double k1, Reference2DoubleMap<Index> b, StringMap<? extends CharSequence> termMap, LongList frequencies)
k1
- the k1 parameter.b
- the b parameter, specified as a map from indices to values.termMap
- a map from terms to positions in frequencies
, or null
if frequencies
is null
.frequencies
- the frequencies, or null
for Boldi's variant.public BM25FScorer(double k1, StringMap<? extends CharSequence> termMap, LongList frequencies, Object2DoubleMap<String> b)
This constructor exists to provide a typed counterpart to the string-based constructor (mainly for documentation purposes).
k1
- the k1 parameter.termMap
- a map from terms to positions in frequencies
, or null
if frequencies
is null
.frequencies
- the frequencies, or null
for Boldi's variant.b
- the b parameter, specified as a map from indices to values.public BM25FScorer(double k1, Reference2DoubleMap<Index> b)
k1
- the k1 parameter.b
- the b parameter, specified as a map from indices to values.public BM25FScorer(String... arg) throws NumberFormatException, FileNotFoundException, IOException, ClassNotFoundException
This constructor has string parameters that correspond to the arguments of BM25FScorer(double, StringMap, LongList, Object2DoubleMap)
.
The two middle arguments can be omitted by specifying them as empty. The last argument is represented by a number of
assignments index=b, separated by commas (as if they were multiple arguments), which
will be compacted into a function representing the values of b.
NumberFormatException
FileNotFoundException
IOException
ClassNotFoundException
Method Detail |
---|
public BM25FScorer copy()
copy
in interface DelegatingScorer
copy
in interface Scorer
copy
in interface FlyweightPrototype<Scorer>
public double score() throws IOException
AbstractWeightedScorer
Scorer.score(Index)
for
each index in the current document iterator, and adding the weighted results.
score
in interface Scorer
score
in class AbstractWeightedScorer
IOException
public double score(Index index)
Scorer
Scorer.wrap(DocumentIterator)
, but
considering only a given index (optional operation).
score
in interface Scorer
index
- the only index to be considered.
public void wrap(DocumentIterator d) throws IOException
AbstractScorer
This method records internally the provided iterator.
wrap
in interface Scorer
wrap
in class AbstractWeightedScorer
d
- the document iterator that will be used in subsequent calls to
Scorer.score()
and Scorer.score(Index)
.
IOException
public boolean usesIntervals()
Scorer
This method is essential when aggregating scorers,
because if several scores need intervals, a CachingDocumentIterator
will be necessary.
usesIntervals
in interface Scorer
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |