Class SimpleSentenceSplitter
- All Implemented Interfaces:
SentenceSplitter
Recognizing the end of a sentence is not an easy task for a computer. In English, punctuation marks that usually appear at the end of a sentence may not indicate the end of a sentence. The period is the worst offender. A period can end a sentence, but it can also be part of an abbreviation or acronym, an ellipsis, a decimal number, or part of a bracket of periods surrounding a Roman numeral. A period can even act both as the end of an abbreviation and the end of a sentence at the same time. Other the other hand, some poems may not contain any sentence punctuation at all.
Another problem punctuation mark is the single quote, which can introduce a quote or start a contraction such as 'tis. Leading-quote contractions are uncommon in contemporary English texts, but appear frequently in Early Modern English texts.
This tokenizer assumes that the text has already been segmented into paragraphs. Any carriage returns will be replaced by whitespace.
References
- Paul Clough. A Perl program for sentence splitting using rules.
-
Method Summary
Modifier and TypeMethodDescriptionstatic SimpleSentenceSplitter
Returns the singleton instance.String[]
Splits the text into sentences.
-
Method Details
-
getInstance
Returns the singleton instance.- Returns:
- the singleton instance.
-
split
Description copied from interface:SentenceSplitter
Splits the text into sentences.- Specified by:
split
in interfaceSentenceSplitter
- Parameters:
text
- the text.- Returns:
- the sentences.
-