Package smile.nlp.keyword
Interface CooccurrenceKeywords
public interface CooccurrenceKeywords
Keyword extraction from a single document using word co-occurrence statistical information.
The algorithm was proposed by Y. Matsuo and M. Ishizuka. It consists of six steps:
- Stem words by Porter algorithm and extract phrases based APRIORI algorithm (upto 4 words with frequency more than 3 times). Discard stop words.
- Select the top frequent terms up to 30% of running terms.
- Clustering frequent terms. Two terms are in the same cluster if either their Jensen-Shannon divergence or mutual information is above the threshold (0.95 * log 2, and log 2, respectively).
- Calculate the expected co-occurrence probability
- Calculate the refined χ2 values that removes the maximal term.
- Output a given number of terms of largest refined χ2 values.
-
Method Summary
-
Method Details
-
of
Returns the top 10 keywords.- Parameters:
text
- A single document.- Returns:
- The top 10 keywords.
-
of
Returns a given number of top keywords.- Parameters:
text
- A single document.maxNumKeywords
- the maximum number of keywords.- Returns:
- The top keywords.
-