Package smile.nlp.relevance
Class BM25
java.lang.Object
smile.nlp.relevance.BM25
- All Implemented Interfaces:
RelevanceRanker
The BM25 weighting scheme, often called Okapi weighting, after the system in
which it was first implemented, was developed as a way of building a
probabilistic model sensitive to term frequency and document length while
not introducing too many additional parameters into the model. It is not
a single function, but actually a whole family of scoring functions, with
slightly different components and parameters.
At the extreme values of the coefficient b, BM25 turns into ranking functions known as BM11 (for b = 1) and BM15 (for b = 0). BM25F is a modification of BM25 in which the document is considered to be composed of several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.
BM25 and its newer variants represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as web search.
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptiondouble
Returns the relevance score between a set of terms and a document based on a corpus.double
Returns the relevance score between a term and a document based on a corpus.double
score
(double freq, int docSize, double avgDocSize, long N, long n) Returns the relevance score between a term and a document based on a corpus.double
score
(double freq, long N, long n) Returns the relevance score between a term and a document based on a corpus.double
score
(int termFreq, int docSize, double avgDocSize, int titleTermFreq, int titleSize, double avgTitleSize, int anchorTermFreq, int anchorSize, double avgAnchorSize, long N, long n) Returns the relevance score between a term and a document based on a corpus.
-
Constructor Details
-
BM25
public BM25()Default constructor with k1 = 1.2, b = 0.75, delta = 1.0. -
BM25
public BM25(double k1, double b, double delta) Constructor.- Parameters:
k1
- is a positive tuning parameter that calibrates the document term frequency scaling. A k1 value of 0 corresponds to a binary model (no term frequency), and a large value corresponds to using raw term frequency.b
- b is another tuning parameter (0 <= b <= 1
) which determines the scaling by document length: b = 1 corresponds to fully scaling the term weight by the document length, while b = 0 corresponds to no length normalization.delta
- the control parameter in BM25+. The standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevance to shorter documents that do not contain the query term at all.
-
-
Method Details
-
score
public double score(int termFreq, int docSize, double avgDocSize, int titleTermFreq, int titleSize, double avgTitleSize, int anchorTermFreq, int anchorSize, double avgAnchorSize, long N, long n) Returns the relevance score between a term and a document based on a corpus.- Parameters:
termFreq
- the term frequency in the text body.docSize
- the text length.avgDocSize
- the average text length in the corpus.titleTermFreq
- the term frequency in the title.titleSize
- the title length.avgTitleSize
- the average title length in the corpus.anchorTermFreq
- the term frequency in the anchor.anchorSize
- the anchor length.avgAnchorSize
- the average anchor length in the corpus.N
- the number of documents in the corpus.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-
score
public double score(double freq, long N, long n) Returns the relevance score between a term and a document based on a corpus.- Parameters:
freq
- the normalized term frequency of searching term in the document to rank.N
- the number of documents in the corpus.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-
score
public double score(double freq, int docSize, double avgDocSize, long N, long n) Returns the relevance score between a term and a document based on a corpus.- Parameters:
freq
- the frequency of searching term in the document to rank.docSize
- the size of document to rank.avgDocSize
- the average size of documents in the corpus.N
- the number of documents in the corpus.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-
rank
Description copied from interface:RelevanceRanker
Returns the relevance score between a term and a document based on a corpus.- Specified by:
rank
in interfaceRelevanceRanker
- Parameters:
corpus
- the corpus.doc
- the document to rank.term
- the searching term.tf
- the term frequency in the document.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-
rank
Description copied from interface:RelevanceRanker
Returns the relevance score between a set of terms and a document based on a corpus.- Specified by:
rank
in interfaceRelevanceRanker
- Parameters:
corpus
- the corpus.doc
- the document to rank.terms
- the searching terms.tf
- the term frequencies in the document.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-