Class TFIDF
- All Implemented Interfaces:
RelevanceRanker
One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document d, let tfmax(d) be the maximum tf over all terms in d. Then, we compute a normalized term frequency for each term t in document d by
tf = a + (1? a) tft,d / tfmax(d)
where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in d. The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. Maximum tf normalization does suffer from the following issues:
- The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
- A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
- More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.
- See Also:
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptiondouble
rank
(int tf, int maxtf, long N, long n) Returns the relevance score between a term and a document based on a corpus.double
Returns the relevance score between a set of terms and a document based on a corpus.double
Returns the relevance score between a term and a document based on a corpus.
-
Constructor Details
-
TFIDF
public TFIDF()Constructor. -
TFIDF
public TFIDF(double smoothing) Constructor.- Parameters:
smoothing
- the smoothing parameter in maximum tf normalization.
-
-
Method Details
-
rank
public double rank(int tf, int maxtf, long N, long n) Returns the relevance score between a term and a document based on a corpus.- Parameters:
tf
- the frequency of searching term in the document to rank.maxtf
- the maximum frequency over all terms in the document.N
- the number of documents in the corpus.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-
rank
Description copied from interface:RelevanceRanker
Returns the relevance score between a term and a document based on a corpus.- Specified by:
rank
in interfaceRelevanceRanker
- Parameters:
corpus
- the corpus.doc
- the document to rank.term
- the searching term.tf
- the term frequency in the document.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-
rank
Description copied from interface:RelevanceRanker
Returns the relevance score between a set of terms and a document based on a corpus.- Specified by:
rank
in interfaceRelevanceRanker
- Parameters:
corpus
- the corpus.doc
- the document to rank.terms
- the searching terms.tf
- the term frequencies in the document.n
- the number of documents containing the given term in the corpus;- Returns:
- the relevance score.
-