|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object it.unimi.di.mg4j.search.score.AbstractScorer it.unimi.di.mg4j.search.score.AbstractWeightedScorer it.unimi.di.mg4j.search.score.BM25Scorer
public class BM25Scorer
A scorer that implements the BM25 ranking scheme.
BM25 is the name of a ranking scheme for text derived from the probabilistic model. The essential feature of the scheme is that of assigning to each term appearing in a given document a weight depending both on the count (the number of occurrences of the term in the document), on the frequency (the number of the documents in which the term appears) and on the document length (in words). It was devised in the early nineties, and it provides a significant improvement over the classical TF/IDF scheme. Karen Spärck Jones, Steve Walker and Stephen E. Robertson give a full account of BM25 and of the probabilistic model in “A probabilistic model of information retrieval: development and comparative experiments”, Inf. Process. Management 36(6):779−840, 2000.
There are a number of incarnations with small variations of the formula itself. Here, the weight assigned to a term which appears in f documents out of a collection of N documents w.r.t. to a document of length l in which the term appears c times is
DEFAULT_K1
and DEFAULT_B
: these values were chosen
following the suggestions given in
“Efficiency vs. effectiveness in Terabyte-scale information retrieval”, by Stefan Büttcher and Charles L. A. Clarke,
in Proceedings of the 14th Text REtrieval
Conference (TREC 2005). Gaithersburg, USA, November 2005. The logarithmic part (a.k.a.
idf (inverse document-frequency) part) is actually
maximised with EPSILON_SCORE
, so it is never negative (the net effect being that terms appearing
in more than half of the documents have almost no weight).
This class has two modes of evaluation, generic and flat. The generic evaluator uses an internal
visitor building on CounterSetupVisitor
and related classes (by means of DocumentIterator.acceptOnTruePaths(it.unimi.di.mg4j.search.visitor.DocumentIteratorVisitor)
)
to take into consideration only terms that are actually involved in query semantics for the current document.
The flat evaluator simulates the behaviour of the generic evaluator on a special subset of queries, that is, queries that
are formed by an index iterator or a composite document
iterator whose underlying queries are all index iterators, by means of a simple loop. This is significantly faster
than the generic evaluator (as there is no recursive visit) either if document iterator is a subclass of AbstractIntersectionDocumentIterator
,
or if it is a subclass of AbstractUnionDocumentIterator
and the disjuncts are not too many (less than MAX_FLAT_DISJUNCTS
).
Field Summary | |
---|---|
static boolean |
DEBUG
|
static double |
DEFAULT_B
The default value used for the parameter b. |
static double |
DEFAULT_K1
The default value used for the parameter k1. |
static double |
EPSILON_SCORE
The value of the document-frequency part for terms appearing in more than half of the documents. |
static Logger |
LOGGER
|
static int |
MAX_FLAT_DISJUNCTS
Disjunctive queries on index iterators are handled using the flat evaluator only if they contain less than this number of disjuncts. |
Fields inherited from class it.unimi.di.mg4j.search.score.AbstractWeightedScorer |
---|
index2Weight |
Fields inherited from class it.unimi.di.mg4j.search.score.AbstractScorer |
---|
documentIterator, indexIterator |
Constructor Summary | |
---|---|
BM25Scorer()
Creates a BM25 scorer using DEFAULT_K1 and DEFAULT_B as parameters. |
|
BM25Scorer(double k1,
double b)
Creates a BM25 scorer using specified k1 and b parameters. |
|
BM25Scorer(String k1,
String b)
Creates a BM25 scorer using specified k1 and b parameters specified by strings. |
Method Summary | |
---|---|
BM25Scorer |
copy()
|
double |
score()
Computes a score by calling Scorer.score(Index) for
each index in the current document iterator, and adding the weighted results. |
double |
score(Index index)
Returns a score for the current document of the last document iterator given to Scorer.wrap(DocumentIterator) , but
considering only a given index (optional operation). |
boolean |
usesIntervals()
Whether this scorer uses intervals. |
void |
wrap(DocumentIterator d)
Wraps the given document iterator. |
Methods inherited from class it.unimi.di.mg4j.search.score.AbstractWeightedScorer |
---|
getWeights, setWeights |
Methods inherited from class it.unimi.di.mg4j.search.score.AbstractScorer |
---|
nextDocument |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface it.unimi.di.mg4j.search.score.Scorer |
---|
getWeights, nextDocument, setWeights |
Field Detail |
---|
public static final Logger LOGGER
public static final boolean DEBUG
public static final double DEFAULT_K1
public static final double DEFAULT_B
public static final double EPSILON_SCORE
public static final int MAX_FLAT_DISJUNCTS
IndexIterator.count()
only on the terms that are part of the front. This value is largely architecture, query,
term-distribution, and whatever else dependent.
Constructor Detail |
---|
public BM25Scorer()
DEFAULT_K1
and DEFAULT_B
as parameters.
public BM25Scorer(double k1, double b)
k1
- the k1 parameter.b
- the b parameter.public BM25Scorer(String k1, String b)
k1
- the k1 parameter.b
- the b parameter.Method Detail |
---|
public BM25Scorer copy()
copy
in interface DelegatingScorer
copy
in interface Scorer
copy
in interface FlyweightPrototype<Scorer>
public double score() throws IOException
AbstractWeightedScorer
Scorer.score(Index)
for
each index in the current document iterator, and adding the weighted results.
score
in interface Scorer
score
in class AbstractWeightedScorer
IOException
public double score(Index index)
Scorer
Scorer.wrap(DocumentIterator)
, but
considering only a given index (optional operation).
score
in interface Scorer
index
- the only index to be considered.
public void wrap(DocumentIterator d) throws IOException
AbstractScorer
This method records internally the provided iterator.
wrap
in interface Scorer
wrap
in class AbstractWeightedScorer
d
- the document iterator that will be used in subsequent calls to
Scorer.score()
and Scorer.score(Index)
.
IOException
public boolean usesIntervals()
Scorer
This method is essential when aggregating scorers,
because if several scores need intervals, a CachingDocumentIterator
will be necessary.
usesIntervals
in interface Scorer
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |