smile.nlp.tokenizer

package smile.nlp.tokenizer

Sentence splitter and word tokenizer.

Related Packages

Package

Description

smile.nlp

Natural language processing.
Class

Description

BreakIteratorSentenceSplitter

A sentence splitter based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).

BreakIteratorTokenizer

A word tokenizer based on the java.text.BreakIterator, which supports multiple natural languages (selected by locale setting).

ParagraphSplitter

A paragraph splitter segments text into paragraphs.

PennTreebankTokenizer

A word tokenizer that tokenizes English sentences using the conventions used by the Penn Treebank.

SentenceSplitter

A sentence splitter segments text into sentences (a string of words satisfying the grammatical rules of a language).

SimpleParagraphSplitter

This is a simple paragraph splitter.

SimpleSentenceSplitter

This is a simple sentence splitter for English.

SimpleTokenizer

A word tokenizer that tokenizes English sentences with some differences from TreebankWordTokenizer, notably on handling not-contractions.

Tokenizer

A token is a string of characters, categorized according to the rules as a symbol.