Package smile.nlp
Class SimpleCorpus
java.lang.Object
smile.nlp.SimpleCorpus
- All Implemented Interfaces:
Corpus
An in-memory text corpus. Useful for text feature engineering.
-
Constructor Summary
ConstructorDescriptionConstructor.SimpleCorpus
(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations) Constructor. -
Method Summary
Modifier and TypeMethodDescriptionAdds a document to the corpus.int
Returns the average size of documents in the corpus.bigrams()
Returns the iterator over the bigrams in the corpus.int
Returns the total frequency of the term in the corpus.int
Returns the total frequency of the bigram in the corpus.long
nbigram()
Returns the number of bigrams in the corpus.int
ndoc()
Returns the number of documents in the corpus.int
nterm()
Returns the number of unique terms in the corpus.Returns the iterator over the set of documents containing the given term.search
(RelevanceRanker ranker, String term) Returns the iterator over the set of documents containing the given term in descending order of relevance.search
(RelevanceRanker ranker, String[] terms) Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.long
size()
Returns the number of words in the corpus.terms()
Returns the iterator over the terms in the corpus.
-
Constructor Details
-
SimpleCorpus
public SimpleCorpus()Constructor. -
SimpleCorpus
public SimpleCorpus(SentenceSplitter splitter, Tokenizer tokenizer, StopWords stopWords, Punctuations punctuations) Constructor.- Parameters:
splitter
- the sentence splitter.tokenizer
- the word tokenizer.stopWords
- the set of stop words to exclude.punctuations
- the set of punctuation marks to exclude. Set to null to keep all punctuation marks.
-
-
Method Details
-
add
Adds a document to the corpus.- Parameters:
text
- the document text.- Returns:
- the document.
-
size
public long size()Description copied from interface:Corpus
Returns the number of words in the corpus. -
ndoc
public int ndoc()Description copied from interface:Corpus
Returns the number of documents in the corpus. -
nterm
public int nterm()Description copied from interface:Corpus
Returns the number of unique terms in the corpus. -
nbigram
public long nbigram()Description copied from interface:Corpus
Returns the number of bigrams in the corpus. -
avgDocSize
public int avgDocSize()Description copied from interface:Corpus
Returns the average size of documents in the corpus.- Specified by:
avgDocSize
in interfaceCorpus
- Returns:
- the average size of documents in the corpus.
-
count
Description copied from interface:Corpus
Returns the total frequency of the term in the corpus. -
count
Description copied from interface:Corpus
Returns the total frequency of the bigram in the corpus. -
terms
Description copied from interface:Corpus
Returns the iterator over the terms in the corpus. -
bigrams
Description copied from interface:Corpus
Returns the iterator over the bigrams in the corpus. -
search
Description copied from interface:Corpus
Returns the iterator over the set of documents containing the given term. -
search
Description copied from interface:Corpus
Returns the iterator over the set of documents containing the given term in descending order of relevance. -
search
Description copied from interface:Corpus
Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
-