Package smile.nlp.collocation
Class NGram
java.lang.Object
smile.nlp.NGram
smile.nlp.collocation.NGram
- All Implemented Interfaces:
Comparable<NGram>
An n-gram is a contiguous sequence of n words from a given sequence of text.
An n-gram of size 1 is referred to as an unigram; size 2 is a bigram;
size 3 is a trigram.
-
Field Summary
-
Constructor Summary
-
Method Summary
-
Field Details
-
count
public final int countThe frequency of n-gram in the corpus.
-
-
Constructor Details
-
NGram
Constructor.- Parameters:
words
- the n-gram word sequence.count
- the frequency of n-gram in the corpus.
-
-
Method Details
-
toString
-
compareTo
- Specified by:
compareTo
in interfaceComparable<NGram>
-
of
Extracts n-gram phrases by an Apiori-like algorithm. The algorithm was proposed in "A Study Using n-gram Features for Text Categorization" by Johannes Furnkranz.The algorithm takes a collection of sentences and generates all n-grams of length at most MaxNGramSize that occur at least MinFrequency times in the sentences.
- Parameters:
sentences
- A collection of sentences (already split).maxNGramSize
- The maximum length of n-gramminFrequency
- The minimum frequency of n-gram in the sentences.- Returns:
- An array of n-gram sets. The i-th entry is the set of i-grams.
-