Class BM25Scorer
- java.lang.Object
-
- it.unimi.di.big.mg4j.search.score.AbstractScorer
-
- it.unimi.di.big.mg4j.search.score.AbstractWeightedScorer
-
- it.unimi.di.big.mg4j.search.score.BM25Scorer
-
- All Implemented Interfaces:
DelegatingScorer
,Scorer
,FlyweightPrototype<Scorer>
public class BM25Scorer extends AbstractWeightedScorer implements DelegatingScorer
A scorer that implements the BM25 ranking scheme.BM25 is the name of a ranking scheme for text derived from the probabilistic model. The essential feature of the scheme is that of assigning to each term appearing in a given document a weight depending both on the count (the number of occurrences of the term in the document), on the frequency (the number of the documents in which the term appears) and on the document length (in words). It was devised in the early nineties, and it provides a significant improvement over the classical TF/IDF scheme. Karen Spärck Jones, Steve Walker and Stephen E. Robertson give a full account of BM25 and of the probabilistic model in “A probabilistic model of information retrieval: development and comparative experiments”, Inf. Process. Management 36(6):779−840, 2000.
There are a number of incarnations with small variations of the formula itself. Here, the weight assigned to a term which appears in f documents out of a collection of N documents w.r.t. to a document of length l in which the term appears c times is
log( (N − f + 1/2) / (f + 1/2) ) ( k1 + 1 ) c ⁄ ( c + k1 ((1 − b) + bl / L) ),where L is the average document length, and k1 and b are parameters that default toDEFAULT_K1
andDEFAULT_B
: these values were chosen following the suggestions given in “Efficiency vs. effectiveness in Terabyte-scale information retrieval”, by Stefan Büttcher and Charles L. A. Clarke, in Proceedings of the 14th Text REtrieval Conference (TREC 2005). Gaithersburg, USA, November 2005. The logarithmic part (a.k.a. idf (inverse document-frequency) part) is actually maximised withEPSILON_SCORE
, so it is never negative (the net effect being that terms appearing in more than half of the documents have almost no weight).Evaluation
This class has two modes of evaluation, generic and flat. The generic evaluator uses an internal visitor building on
CounterSetupVisitor
and related classes (by means ofDocumentIterator.acceptOnTruePaths(it.unimi.di.big.mg4j.search.visitor.DocumentIteratorVisitor)
) to take into consideration only terms that are actually involved in query semantics for the current document. The flat evaluator simulates the behaviour of the generic evaluator on a special subset of queries, that is, queries that are formed by an index iterator or a composite document iterator whose underlying queries are all index iterators, by means of a simple loop. This is significantly faster than the generic evaluator (as there is no recursive visit) either if document iterator is a subclass ofAbstractIntersectionDocumentIterator
, or if it is a subclass ofAbstractUnionDocumentIterator
and the disjuncts are not too many (less thanMAX_FLAT_DISJUNCTS
).- Author:
- Mauro Mereu, Sebastiano Vigna
-
-
Field Summary
Fields Modifier and Type Field Description static boolean
DEBUG
static double
DEFAULT_B
The default value used for the parameter b.static double
DEFAULT_K1
The default value used for the parameter k1.static double
EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents.static Logger
LOGGER
static int
MAX_FLAT_DISJUNCTS
Disjunctive queries on index iterators are handled using the flat evaluator only if they contain less than this number of disjuncts.-
Fields inherited from class it.unimi.di.big.mg4j.search.score.AbstractWeightedScorer
index2Weight
-
Fields inherited from class it.unimi.di.big.mg4j.search.score.AbstractScorer
documentIterator, indexIterator
-
-
Constructor Summary
Constructors Constructor Description BM25Scorer()
Creates a BM25 scorer usingDEFAULT_K1
andDEFAULT_B
as parameters.BM25Scorer(double k1, double b)
Creates a BM25 scorer using specified k1 and b parameters.BM25Scorer(String k1, String b)
Creates a BM25 scorer using specified k1 and b parameters specified by strings.
-
Method Summary
Modifier and Type Method Description BM25Scorer
copy()
double
score()
Computes a score by callingScorer.score(Index)
for each index in the current document iterator, and adding the weighted results.double
score(Index index)
Returns a score for the current document of the last document iterator given toScorer.wrap(DocumentIterator)
, but considering only a given index (optional operation).boolean
usesIntervals()
Whether this scorer uses intervals.void
wrap(DocumentIterator d)
Wraps the given document iterator.-
Methods inherited from class it.unimi.di.big.mg4j.search.score.AbstractWeightedScorer
getWeights, setWeights
-
Methods inherited from class it.unimi.di.big.mg4j.search.score.AbstractScorer
nextDocument
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface it.unimi.di.big.mg4j.search.score.Scorer
getWeights, nextDocument, setWeights
-
-
-
-
Field Detail
-
LOGGER
public static final Logger LOGGER
-
DEBUG
public static final boolean DEBUG
- See Also:
- Constant Field Values
-
DEFAULT_K1
public static final double DEFAULT_K1
The default value used for the parameter k1.- See Also:
- Constant Field Values
-
DEFAULT_B
public static final double DEFAULT_B
The default value used for the parameter b.- See Also:
- Constant Field Values
-
EPSILON_SCORE
public static final double EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents.- See Also:
- Constant Field Values
-
MAX_FLAT_DISJUNCTS
public static final int MAX_FLAT_DISJUNCTS
Disjunctive queries on index iterators are handled using the flat evaluator only if they contain less than this number of disjuncts. The generic evaluator is more efficient if there are several disjuncts, as it invokesIndexIterator.count()
only on the terms that are part of the front. This value is largely architecture, query, term-distribution, and whatever else dependent.- See Also:
- Constant Field Values
-
-
Constructor Detail
-
BM25Scorer
public BM25Scorer()
Creates a BM25 scorer usingDEFAULT_K1
andDEFAULT_B
as parameters.
-
BM25Scorer
public BM25Scorer(double k1, double b)
Creates a BM25 scorer using specified k1 and b parameters.- Parameters:
k1
- the k1 parameter.b
- the b parameter.
-
-
Method Detail
-
copy
public BM25Scorer copy()
- Specified by:
copy
in interfaceFlyweightPrototype<Scorer>
- Specified by:
copy
in interfaceScorer
-
score
public double score() throws IOException
Description copied from class:AbstractWeightedScorer
Computes a score by callingScorer.score(Index)
for each index in the current document iterator, and adding the weighted results.- Specified by:
score
in interfaceScorer
- Overrides:
score
in classAbstractWeightedScorer
- Returns:
- the combined weighted score.
- Throws:
IOException
-
score
public double score(Index index)
Description copied from interface:Scorer
Returns a score for the current document of the last document iterator given toScorer.wrap(DocumentIterator)
, but considering only a given index (optional operation).
-
wrap
public void wrap(DocumentIterator d) throws IOException
Description copied from class:AbstractScorer
Wraps the given document iterator.This method records internally the provided iterator.
- Specified by:
wrap
in interfaceScorer
- Overrides:
wrap
in classAbstractWeightedScorer
- Parameters:
d
- the document iterator that will be used in subsequent calls toScorer.score()
andScorer.score(Index)
.- Throws:
IOException
-
usesIntervals
public boolean usesIntervals()
Description copied from interface:Scorer
Whether this scorer uses intervals.This method is essential when aggregating scorers, because if several scores need intervals, a
CachingDocumentIterator
will be necessary.- Specified by:
usesIntervals
in interfaceScorer
- Returns:
- true if this scorer uses intervals.
-
-