Package smile.nlp

Interface Corpus

All Known Implementing Classes:
SimpleCorpus

public interface Corpus
A corpus is a collection of documents.
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    Returns the average size of documents in the corpus.
    long
    Returns the number of bigrams in the corpus.
    Returns the iterator over the bigrams in the corpus.
    int
    count(String term)
    Returns the total frequency of the term in the corpus.
    int
    count(Bigram bigram)
    Returns the total frequency of the bigram in the corpus.
    int
    Returns the number of documents in the corpus.
    search(String term)
    Returns the iterator over the set of documents containing the given term.
    search(RelevanceRanker ranker, String term)
    Returns the iterator over the set of documents containing the given term in descending order of relevance.
    search(RelevanceRanker ranker, String[] terms)
    Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
    long
    Returns the number of words in the corpus.
    int
    Returns the number of unique terms in the corpus.
    Returns the iterator over the terms in the corpus.
  • Method Details

    • size

      long size()
      Returns the number of words in the corpus.
      Returns:
      the number of words in the corpus.
    • docCount

      int docCount()
      Returns the number of documents in the corpus.
      Returns:
      the number of documents in the corpus.
    • termCount

      int termCount()
      Returns the number of unique terms in the corpus.
      Returns:
      the number of unique terms in the corpus.
    • bigramCount

      long bigramCount()
      Returns the number of bigrams in the corpus.
      Returns:
      the number of bigrams in the corpus.
    • avgDocSize

      int avgDocSize()
      Returns the average size of documents in the corpus.
      Returns:
      the average size of documents in the corpus.
    • count

      int count(String term)
      Returns the total frequency of the term in the corpus.
      Parameters:
      term - the term.
      Returns:
      the total frequency of the term in the corpus.
    • count

      int count(Bigram bigram)
      Returns the total frequency of the bigram in the corpus.
      Parameters:
      bigram - the bigram.
      Returns:
      the total frequency of the bigram in the corpus.
    • terms

      Iterator<String> terms()
      Returns the iterator over the terms in the corpus.
      Returns:
      the iterator of terms.
    • bigrams

      Iterator<Bigram> bigrams()
      Returns the iterator over the bigrams in the corpus.
      Returns:
      the iterator of bigrams.
    • search

      Iterator<Text> search(String term)
      Returns the iterator over the set of documents containing the given term.
      Parameters:
      term - the search term.
      Returns:
      the iterator of documents containing the term.
    • search

      Iterator<Relevance> search(RelevanceRanker ranker, String term)
      Returns the iterator over the set of documents containing the given term in descending order of relevance.
      Parameters:
      ranker - the relevance ranker.
      term - the search term.
      Returns:
      the iterator of documents in descending order of relevance.
    • search

      Iterator<Relevance> search(RelevanceRanker ranker, String[] terms)
      Returns the iterator over the set of documents containing (at least one of) the given terms in descending order of relevance.
      Parameters:
      ranker - the relevance ranker.
      terms - the search terms.
      Returns:
      the iterator of documents in descending order of relevance.