Package smile.nlp
Interface Corpus
- All Known Implementing Classes:
SimpleCorpus
public interface Corpus
A corpus is a collection of documents.
-
Method Summary
Modifier and TypeMethodDescriptionint
Returns the average size of documents in the corpus.long
Returns the number of bigrams in the corpus.bigrams()
Returns the iterator over the bigrams in the corpus.int
Returns the total frequency of the term in the corpus.int
Returns the total frequency of the bigram in the corpus.int
docCount()
Returns the number of documents in the corpus.Returns the iterator over the set of documents containing the given term.search
(RelevanceRanker ranker, String term) Returns the iterator over the set of documents containing the given term in descending order of relevance.search
(RelevanceRanker ranker, String[] terms) Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.long
size()
Returns the number of words in the corpus.int
Returns the number of unique terms in the corpus.terms()
Returns the iterator over the terms in the corpus.
-
Method Details
-
size
long size()Returns the number of words in the corpus.- Returns:
- the number of words in the corpus.
-
docCount
int docCount()Returns the number of documents in the corpus.- Returns:
- the number of documents in the corpus.
-
termCount
int termCount()Returns the number of unique terms in the corpus.- Returns:
- the number of unique terms in the corpus.
-
bigramCount
long bigramCount()Returns the number of bigrams in the corpus.- Returns:
- the number of bigrams in the corpus.
-
avgDocSize
int avgDocSize()Returns the average size of documents in the corpus.- Returns:
- the average size of documents in the corpus.
-
count
Returns the total frequency of the term in the corpus.- Parameters:
term
- the term.- Returns:
- the total frequency of the term in the corpus.
-
count
Returns the total frequency of the bigram in the corpus.- Parameters:
bigram
- the bigram.- Returns:
- the total frequency of the bigram in the corpus.
-
terms
Returns the iterator over the terms in the corpus.- Returns:
- the iterator of terms.
-
bigrams
Returns the iterator over the bigrams in the corpus.- Returns:
- the iterator of bigrams.
-
search
Returns the iterator over the set of documents containing the given term.- Parameters:
term
- the search term.- Returns:
- the iterator of documents containing the term.
-
search
Returns the iterator over the set of documents containing the given term in descending order of relevance.- Parameters:
ranker
- the relevance ranker.term
- the search term.- Returns:
- the iterator of documents in descending order of relevance.
-
search
Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.- Parameters:
ranker
- the relevance ranker.terms
- the search terms.- Returns:
- the iterator of documents in descending order of relevance.
-