IN - The type of the tokens in the sentencespublic class WordToSentenceProcessor<IN> extends java.lang.Object implements ListProcessor<IN,java.util.List<IN>>
<p>' tag. If two of these follow each other, they are
coalesced: no empty Sentence is output. The end-of-file is not
represented in this Set, but the code behaves as if it were a member.
| Modifier and Type | Field and Description |
|---|---|
static java.util.Set<java.lang.String> |
DEFAULT_BOUNDARY_FOLLOWERS |
static java.util.Set<java.lang.String> |
DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD |
| Constructor and Description |
|---|
WordToSentenceProcessor()
Create a
WordToSentenceProcessor using a sensible default
list of tokens to split on for English/Latin writing systems. |
WordToSentenceProcessor(java.lang.String boundaryTokenRegex)
Flexibly set the set of acceptable sentence boundary tokens, but with
a default set of allowed boundary following tokens and sentence boundary
to discard tokens (based on English and Penn Treebank encoding).
|
WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
java.util.Set<java.lang.String> boundaryFollowers,
java.util.Set<java.lang.String> boundaryToDiscard)
Flexibly set the set of acceptable sentence boundary tokens,
the set of tokens commonly following sentence boundaries, and also
the set of tokens that are sentences boundaries that should be
discarded.
|
| Modifier and Type | Method and Description |
|---|---|
void |
addHtmlSentenceBoundaryToDiscard(java.util.Set<java.lang.String> set) |
boolean |
allowEmptySentences() |
boolean |
isOneSentence() |
java.util.List<java.util.List<IN>> |
process(java.util.List<? extends IN> words)
Take a List (including a Sentence) of input, and return a
List that has been processed in some way.
|
<L,F> Document<L,F,java.util.List<IN>> |
processDocument(Document<L,F,IN> in) |
void |
setAllowEmptySentences(boolean allowEmptySentences) |
void |
setOneSentence(boolean oneSentence) |
void |
setSentenceBoundaryToDiscard(java.util.Set<java.lang.String> regexSet) |
java.util.List<java.util.List<IN>> |
wordsToSentences(java.util.List<? extends IN> words)
Returns a List of Lists where each element is built from a run
of Words in the input Document.
|
public static final java.util.Set<java.lang.String> DEFAULT_BOUNDARY_FOLLOWERS
public static final java.util.Set<java.lang.String> DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
public WordToSentenceProcessor()
WordToSentenceProcessor using a sensible default
list of tokens to split on for English/Latin writing systems.
The default set is: {".","?","!"} and
any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!.public WordToSentenceProcessor(java.lang.String boundaryTokenRegex)
boundaryTokenRegex - The set of boundary tokenspublic WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
java.util.Set<java.lang.String> boundaryFollowers,
java.util.Set<java.lang.String> boundaryToDiscard)
public void setSentenceBoundaryToDiscard(java.util.Set<java.lang.String> regexSet)
public boolean isOneSentence()
public void setOneSentence(boolean oneSentence)
public boolean allowEmptySentences()
public void setAllowEmptySentences(boolean allowEmptySentences)
public void addHtmlSentenceBoundaryToDiscard(java.util.Set<java.lang.String> set)
public java.util.List<java.util.List<IN>> process(java.util.List<? extends IN> words)
ListProcessorprocess in interface ListProcessor<IN,java.util.List<IN>>public java.util.List<java.util.List<IN>> wordsToSentences(java.util.List<? extends IN> words)
PTBTokenizer).words - A list of already tokenized words (must implement HasWord or be a String)WordToSentenceProcessor(String, Set, Set, Pattern, Pattern)