edu.stanford.nlp.ie.regexp
Class NumberSequenceClassifier

java.lang.Object
  extended by edu.stanford.nlp.ie.AbstractSequenceClassifier<CoreLabel>
      extended by edu.stanford.nlp.ie.regexp.NumberSequenceClassifier
All Implemented Interfaces:
Function<java.lang.String,java.lang.String>

public class NumberSequenceClassifier
extends AbstractSequenceClassifier<CoreLabel>

A set of deterministic rules for marking certain entities, to add categories and to correct for failures of statistical NER taggers. This is an extremely simple and ungeneralized implementation of AbstractSequenceClassifier that was written for PASCAL RTE. It could profitably be extended and generalized. It marks a NUMBER category based on part-of-speech tags in a deterministic manner. It marks an ORDINAL category based on word form in a deterministic manner. It tags as MONEY currency signs and things tagged CD after a currency sign. It marks a number before a month name as a DATE. It marks as a DATE a word of the form xx/xx/xxxx (where x is a digit from a suitable range). It marks as a TIME a word of the form x(x):xx (where x is a digit). It marks everything else tagged "CD" as a NUMBER, and instances of "and" appearing between CD tags in contexts suggestive of a number. It requires text to be POS-tagged (have the getString(TagAnnotation.class) attribute). Effectively these rules assume that this classifier will be used as a secondary classifier by code such as ClassifierCombiner: it will mark most CD as NUMBER, and it is assumed that something else with higher priority is marking ones that are PERCENT, ADDRESS, etc.

Author:
Christopher Manning, Mihai (integrated with NumberNormalizer, SUTime)

Field Summary
static java.util.regex.Pattern AM_PM
           
static java.util.regex.Pattern ARMY_TIME_MORNING
           
static java.util.regex.Pattern CURRENCY_SYMBOL_PATTERN
           
static java.util.regex.Pattern CURRENCY_WORD_PATTERN
           
static java.util.regex.Pattern DATE_PATTERN
           
static java.util.regex.Pattern DATE_PATTERN2
           
static java.util.regex.Pattern DAY_PATTERN
           
static java.util.regex.Pattern GENERIC_TIME_WORDS
           
static java.util.regex.Pattern MONTH_PATTERN
           
static java.util.regex.Pattern ORDINAL_PATTERN
           
static java.util.regex.Pattern PERCENT_SYMBOL_PATTERN
           
static java.util.regex.Pattern PERCENT_WORD_PATTERN
           
static java.util.regex.Pattern TIME_PATTERN
           
static java.util.regex.Pattern TIME_PATTERN2
           
static boolean USE_SUTIME_DEFAULT
           
static java.lang.String USE_SUTIME_PROPERTY
           
static java.util.regex.Pattern YEAR_PATTERN
           
 
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
classIndex, featureFactory, flags, knownLCWords, pad, windowSize
 
Constructor Summary
NumberSequenceClassifier()
           
NumberSequenceClassifier(boolean useSUTime)
           
NumberSequenceClassifier(java.util.Properties props, boolean useSUTime, java.util.Properties sutimeProps)
           
 
Method Summary
static CoreMap alignSentence(CoreMap sentence)
          Copies one sentence replicating only information necessary for SUTime
 java.util.List<CoreLabel> classify(java.util.List<CoreLabel> document)
          Classify a List of CoreLabels.
 java.util.List<CoreLabel> classifyWithGlobalInformation(java.util.List<CoreLabel> tokens, CoreMap document, CoreMap sentence)
          Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence.
static java.util.List<CoreLabel> copyTokens(java.util.List<CoreLabel> srcTokens, CoreMap srcSentence)
          Create a copy of srcTokens, detecting on the fly if character offsets need adjusting
 void loadClassifier(java.io.ObjectInputStream in, java.util.Properties props)
          Load a classifier from the specified input stream.
static void main(java.lang.String[] args)
           
 void printProbsDocument(java.util.List<CoreLabel> document)
           
 void serializeClassifier(java.lang.String serializePath)
          Serialize a sequence classifier to a file on the given path.
 void train(java.util.Collection<java.util.List<CoreLabel>> docs, DocumentReaderAndWriter<CoreLabel> readerAndWriter)
          Trains a classifier from a Collection of sequences.
static void transferAnnotations(CoreLabel src, CoreLabel dst)
          Transfer from src to dst all annotations generated bu SUTime and NumberNormalizer
 
Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResults, countResultsIOB, countResultsIOB2, defaultReaderAndWriter, getSampler, getSequenceModel, getViterbiSearchGraph, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadJarClassifier, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbsDocuments, printResults, reinit, segmentString, segmentString, tallyOneEntityIOB, train, train, train, train, train, train, windowSize, writeAnswers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

USE_SUTIME_DEFAULT

public static final boolean USE_SUTIME_DEFAULT

USE_SUTIME_PROPERTY

public static final java.lang.String USE_SUTIME_PROPERTY
See Also:
Constant Field Values

MONTH_PATTERN

public static final java.util.regex.Pattern MONTH_PATTERN

YEAR_PATTERN

public static final java.util.regex.Pattern YEAR_PATTERN

DAY_PATTERN

public static final java.util.regex.Pattern DAY_PATTERN

DATE_PATTERN

public static final java.util.regex.Pattern DATE_PATTERN

DATE_PATTERN2

public static final java.util.regex.Pattern DATE_PATTERN2

TIME_PATTERN

public static final java.util.regex.Pattern TIME_PATTERN

TIME_PATTERN2

public static final java.util.regex.Pattern TIME_PATTERN2

AM_PM

public static final java.util.regex.Pattern AM_PM

CURRENCY_WORD_PATTERN

public static final java.util.regex.Pattern CURRENCY_WORD_PATTERN

CURRENCY_SYMBOL_PATTERN

public static final java.util.regex.Pattern CURRENCY_SYMBOL_PATTERN

ORDINAL_PATTERN

public static final java.util.regex.Pattern ORDINAL_PATTERN

ARMY_TIME_MORNING

public static final java.util.regex.Pattern ARMY_TIME_MORNING

GENERIC_TIME_WORDS

public static final java.util.regex.Pattern GENERIC_TIME_WORDS

PERCENT_WORD_PATTERN

public static final java.util.regex.Pattern PERCENT_WORD_PATTERN

PERCENT_SYMBOL_PATTERN

public static final java.util.regex.Pattern PERCENT_SYMBOL_PATTERN
Constructor Detail

NumberSequenceClassifier

public NumberSequenceClassifier()

NumberSequenceClassifier

public NumberSequenceClassifier(boolean useSUTime)

NumberSequenceClassifier

public NumberSequenceClassifier(java.util.Properties props,
                                boolean useSUTime,
                                java.util.Properties sutimeProps)
Method Detail

classify

public java.util.List<CoreLabel> classify(java.util.List<CoreLabel> document)
Classify a List of CoreLabels.

Specified by:
classify in class AbstractSequenceClassifier<CoreLabel>
Parameters:
document - A List of CoreLabels.
Returns:
the same List, but with the elements annotated with their answers.

classifyWithGlobalInformation

public java.util.List<CoreLabel> classifyWithGlobalInformation(java.util.List<CoreLabel> tokens,
                                                               CoreMap document,
                                                               CoreMap sentence)
Description copied from class: AbstractSequenceClassifier
Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence. This is needed for SUTime (NumberSequenceClassifier), which requires the document date to resolve relative dates.

Specified by:
classifyWithGlobalInformation in class AbstractSequenceClassifier<CoreLabel>
Returns:
Classified version of the input tokenSequence

alignSentence

public static CoreMap alignSentence(CoreMap sentence)
Copies one sentence replicating only information necessary for SUTime

Parameters:
sentence -

transferAnnotations

public static void transferAnnotations(CoreLabel src,
                                       CoreLabel dst)
Transfer from src to dst all annotations generated bu SUTime and NumberNormalizer

Parameters:
src -
dst -

copyTokens

public static java.util.List<CoreLabel> copyTokens(java.util.List<CoreLabel> srcTokens,
                                                   CoreMap srcSentence)
Create a copy of srcTokens, detecting on the fly if character offsets need adjusting

Parameters:
srcTokens -
srcSentence -

train

public void train(java.util.Collection<java.util.List<CoreLabel>> docs,
                  DocumentReaderAndWriter<CoreLabel> readerAndWriter)
Description copied from class: AbstractSequenceClassifier
Trains a classifier from a Collection of sequences. Note that the Collection can be (and usually is) an ObjectBank.

Specified by:
train in class AbstractSequenceClassifier<CoreLabel>
Parameters:
docs - An ObjectBank or a collection of sequences of IN
readerAndWriter - A DocumentReaderAndWriter to use when loading test files

printProbsDocument

public void printProbsDocument(java.util.List<CoreLabel> document)
Specified by:
printProbsDocument in class AbstractSequenceClassifier<CoreLabel>

serializeClassifier

public void serializeClassifier(java.lang.String serializePath)
Description copied from class: AbstractSequenceClassifier
Serialize a sequence classifier to a file on the given path.

Specified by:
serializeClassifier in class AbstractSequenceClassifier<CoreLabel>
Parameters:
serializePath - The path/filename to write the classifier to.

loadClassifier

public void loadClassifier(java.io.ObjectInputStream in,
                           java.util.Properties props)
                    throws java.io.IOException,
                           java.lang.ClassCastException,
                           java.lang.ClassNotFoundException
Description copied from class: AbstractSequenceClassifier
Load a classifier from the specified input stream. The classifier is reinitialized from the flags serialized in the classifier.

Specified by:
loadClassifier in class AbstractSequenceClassifier<CoreLabel>
Parameters:
in - The InputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
java.io.IOException - If there are problems accessing the input stream
java.lang.ClassCastException - If there are problems interpreting the serialized data
java.lang.ClassNotFoundException - If there are problems interpreting the serialized data

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Throws:
java.lang.Exception


Stanford NLP Group