public class FrenchTokenizer<T extends HasWord> extends AbstractTokenizer<T>
The tokenizer implicitly inserts segmentation markers by not normalizing the apostrophe and hyphen. Detokenization can thus be performed by right-concatenating apostrophes and left-concatenating hyphens.
A single instance of an French Tokenizer is not thread safe, as it uses a non-threadsafe jflex object to do the processing. Multiple instances can be created safely, though. A single instance of a FrenchTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
| Modifier and Type | Class and Description |
|---|---|
static class |
FrenchTokenizer.FrenchTokenizerFactory<T extends HasWord> |
nextToken| Constructor and Description |
|---|
FrenchTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties) |
| Modifier and Type | Method and Description |
|---|---|
static TokenizerFactory<CoreLabel> |
factory() |
static TokenizerFactory<CoreLabel> |
ftbFactory() |
protected T |
getNext()
Internally fetches the next token.
|
static void |
main(java.lang.String[] args)
A fast, rule-based tokenizer for Modern Standard French.
|
static FrenchTokenizer<CoreLabel> |
newFrenchTokenizer(java.io.Reader r,
java.util.Properties lexerProperties) |
public FrenchTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties)
public static FrenchTokenizer<CoreLabel> newFrenchTokenizer(java.io.Reader r, java.util.Properties lexerProperties)
protected T getNext()
AbstractTokenizergetNext in class AbstractTokenizer<T extends HasWord>public static TokenizerFactory<CoreLabel> factory()
public static TokenizerFactory<CoreLabel> ftbFactory()
public static void main(java.lang.String[] args)
Currently, this tokenizer does not do line splitting. It assumes that the input file is delimited by the system line separator. The output will be equivalently delimited.
args -