public interface CooccurrenceKeywords

Keyword extraction from a single document using word co-occurrence statistical information. The algorithm was proposed by Y. Matsuo and M. Ishizuka. It consists of six steps:

Stem words by Porter algorithm and extract phrases based APRIORI algorithm (upto 4 words with frequency more than 3 times). Discard stop words.
Select the top frequent terms up to 30% of running terms.
Clustering frequent terms. Two terms are in the same cluster if either their Jensen-Shannon divergence or mutual information is above the threshold (0.95 * log 2, and log 2, respectively).
Calculate the expected co-occurrence probability
Calculate the refined χ2 values that removes the maximal term.
Output a given number of terms of largest refined χ2 values.

Method Summary

Static Methods

Modifier and Type

Method

Description

static NGram[]

of(String text)

Returns the top 10 keywords.

static NGram[]

of(String text, int maxNumKeywords)

Returns a given number of top keywords.

Method Details
- of
  
  static NGram[] of(String text)
  
  Returns the top 10 keywords.
  
  Parameters:
  
  text - A single document.
  
  Returns:
  
  The top 10 keywords.
- of
  
  static NGram[] of(String text, int maxNumKeywords)
  
  Returns a given number of top keywords.
  
  Parameters:
  
  text - A single document.
  
  maxNumKeywords - the maximum number of keywords.
  
  Returns:
  
  The top keywords.

Interface CooccurrenceKeywords

Method Summary

Method Details

of

of