Package smile.nlp.tokenizer
Class PennTreebankTokenizer
java.lang.Object
smile.nlp.tokenizer.PennTreebankTokenizer
A word tokenizer that tokenizes English sentences using the conventions
used by the Penn Treebank. Most punctuation is split from adjoining words.
Verb contractions and the Anglo-Saxon genitive of nouns are split into their
component morphemes, and each morpheme is tagged separately. Examples
- children's -> children 's
- parents' -> parents '
- won't --> wo n't
- can't -> ca n't
- weren't -> were n't
- cannot -> can not
- 'tisn't -> 't is n't
- 'tis -> 't is
- gonna -> gon na
- I'm -> I 'm
- he'll -> he 'll