edu.stanford.nlp.ie.regexp
Class RegexNERSequenceClassifier

java.lang.Object
  extended by edu.stanford.nlp.ie.AbstractSequenceClassifier<CoreLabel>
      extended by edu.stanford.nlp.ie.regexp.RegexNERSequenceClassifier
All Implemented Interfaces:
Function<java.lang.String,java.lang.String>

public class RegexNERSequenceClassifier
extends AbstractSequenceClassifier<CoreLabel>

A sequence classifier that labels tokens with types based on a simple manual mapping from regular expressions to the types of the entities they are meant to describe. The user provides a file formatted as follows:

    regex1    TYPE    overwritableType1,Type2...    priority
    regex2    TYPE    overwritableType1,Type2...    priority
    ...
 
where each argument is tab-separated, and the last two arguments are optional. Several regexes can be associated with a single type. In the case where multiple regexes match a phrase, the priority ranking is used to choose between the possible types. This classifier is designed to be used as part of a full NER system to label entities that don't fall into the usual NER categories. It only records the label if the token has not already been NER-annotated, or it has been annotated but the NER-type has been designated overwritable (the third argument). Note that this is evaluated token-wise in this classifier, and so it may assign a label against a token sequence that is partly background and partly overwritable. (In contrast, RegexNERAnnotator doesn't allow this.) It assigns labels to AnswerAnnotation, while checking for existing labels in NamedEntityTagAnnotation. The first column regex may be a sequence of regex, each separated by whitespace (matching "\\s+"). The regex will match if the successive regex match a sequence of tokens in the input. Spaces can only be used to separate regular expression tokens; within tokens \\s or similar non-space representations need to be used instead. Notes: Following Java regex conventions, some characters in the file need to be escaped. Only a single backslash should be used though, as these are not String literals. The input to RegexNER will have already been tokenized. So, for example, with our usual English tokenization, things like genitives and commas at the end of words will be separated in the input and matched as a separate token. This class isn't implemented very efficiently, since every regex is evaluated at every token position. So it can and does get quite slow if you have a lot of patterns in your NER rules. TokensRegex is a more general framework to provide the functionality of this class. But at present we still use this class.

Author:
jtibs, Mihai

Field Summary
static java.lang.String DEFAULT_VALID_POS
           
 
Fields inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
classIndex, featureFactory, flags, knownLCWords, pad, windowSize
 
Constructor Summary
RegexNERSequenceClassifier(java.io.BufferedReader reader, boolean ignoreCase, boolean overwriteMyLabels, java.lang.String validPosRegex)
          Make a new instance of this classifier.
RegexNERSequenceClassifier(java.lang.String mapping, boolean ignoreCase, boolean overwriteMyLabels)
           
RegexNERSequenceClassifier(java.lang.String mapping, boolean ignoreCase, boolean overwriteMyLabels, java.lang.String validPosRegex)
          Make a new instance of this classifier.
 
Method Summary
 java.util.List<CoreLabel> classify(java.util.List<CoreLabel> document)
          Classify a List of something that extendsCoreMap.
 java.util.List<CoreLabel> classifyWithGlobalInformation(java.util.List<CoreLabel> tokenSeq, CoreMap doc, CoreMap sent)
          Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence.
 void loadClassifier(java.io.ObjectInputStream in, java.util.Properties props)
          Load a classifier from the specified input stream.
 void printProbsDocument(java.util.List<CoreLabel> document)
           
 void serializeClassifier(java.lang.String serializePath)
          Serialize a sequence classifier to a file on the given path.
 void train(java.util.Collection<java.util.List<CoreLabel>> docs, DocumentReaderAndWriter<CoreLabel> readerAndWriter)
          Trains a classifier from a Collection of sequences.
 
Methods inherited from class edu.stanford.nlp.ie.AbstractSequenceClassifier
apply, backgroundSymbol, classify, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswers, classifyAndWriteAnswersKBest, classifyAndWriteAnswersKBest, classifyAndWriteViterbiSearchGraph, classifyFile, classifyFilesAndWriteAnswers, classifyFilesAndWriteAnswers, classifyKBest, classifyRaw, classifySentence, classifySentenceWithGlobalInformation, classifyStdin, classifyStdin, classifyToCharacterOffsets, classifyToString, classifyToString, classifyWithInlineXML, countResults, countResults, countResultsIOB, countResultsIOB2, defaultReaderAndWriter, getSampler, getSequenceModel, getViterbiSearchGraph, labels, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifier, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadClassifierNoExceptions, loadJarClassifier, makeObjectBankFromFile, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromFiles, makeObjectBankFromReader, makeObjectBankFromString, makePlainTextReaderAndWriter, makeReaderAndWriter, plainTextReaderAndWriter, printFeatureLists, printFeatures, printProbs, printProbsDocuments, printResults, reinit, segmentString, segmentString, tallyOneEntityIOB, train, train, train, train, train, train, windowSize, writeAnswers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_VALID_POS

public static final java.lang.String DEFAULT_VALID_POS
See Also:
Constant Field Values
Constructor Detail

RegexNERSequenceClassifier

public RegexNERSequenceClassifier(java.lang.String mapping,
                                  boolean ignoreCase,
                                  boolean overwriteMyLabels)

RegexNERSequenceClassifier

public RegexNERSequenceClassifier(java.lang.String mapping,
                                  boolean ignoreCase,
                                  boolean overwriteMyLabels,
                                  java.lang.String validPosRegex)
Make a new instance of this classifier. The ignoreCase option allows case-insensitive regular expression matching, allowing the idea that the provided file might just be a manual list of the possible entities for each type.

Parameters:
mapping - A String describing a file/classpath/URI for the RegexNER patterns
ignoreCase - The regex in the mapping file should be compiled ignoring case
overwriteMyLabels - If true, this classifier overwrites NE labels generated through this regex NER. This is necessary because sometimes the RegexNERSequenceClassifier is run successively over the same text (e.g., to overwrite some older annotations).
validPosRegex - May be null or an empty String, in which case any (or no) POS is valid in matching. Otherwise, this is a regex which is matched with find() [not matches()] and which must be matched by the POS of at least one word in the sequence for it to be labeled via any matching rules. (Note that this is a postfilter; using this will not speed up matching.)

RegexNERSequenceClassifier

public RegexNERSequenceClassifier(java.io.BufferedReader reader,
                                  boolean ignoreCase,
                                  boolean overwriteMyLabels,
                                  java.lang.String validPosRegex)
Make a new instance of this classifier. The ignoreCase option allows case-insensitive regular expression matching, allowing the idea that the provided file might just be a manual list of the possible entities for each type.

Parameters:
reader - A Reader for the RegexNER patterns
ignoreCase - The regex in the mapping file should be compiled ignoring case
overwriteMyLabels - If true, this classifier overwrites NE labels generated through this regex NER. This is necessary because sometimes the RegexNERSequenceClassifier is run successively over the same text (e.g., to overwrite some older annotations).
validPosRegex - May be null or an empty String, in which case any (or no) POS is valid in matching. Otherwise, this is a regex, and only words with a POS that match the regex will be labeled via any matching rules.
Method Detail

classify

public java.util.List<CoreLabel> classify(java.util.List<CoreLabel> document)
Description copied from class: AbstractSequenceClassifier
Classify a List of something that extendsCoreMap. The classifications are added in place to the items of the document, which is also returned by this method

Specified by:
classify in class AbstractSequenceClassifier<CoreLabel>
Parameters:
document - A List of something that extends CoreMap.
Returns:
The same List, but with the elements annotated with their answers (stored under the CoreAnnotations.AnswerAnnotation key).

classifyWithGlobalInformation

public java.util.List<CoreLabel> classifyWithGlobalInformation(java.util.List<CoreLabel> tokenSeq,
                                                               CoreMap doc,
                                                               CoreMap sent)
Description copied from class: AbstractSequenceClassifier
Classify a List of something that extends CoreMap using as additional information whatever is stored in the document and sentence. This is needed for SUTime (NumberSequenceClassifier), which requires the document date to resolve relative dates.

Specified by:
classifyWithGlobalInformation in class AbstractSequenceClassifier<CoreLabel>
Returns:
Classified version of the input tokenSequence

train

public void train(java.util.Collection<java.util.List<CoreLabel>> docs,
                  DocumentReaderAndWriter<CoreLabel> readerAndWriter)
Description copied from class: AbstractSequenceClassifier
Trains a classifier from a Collection of sequences. Note that the Collection can be (and usually is) an ObjectBank.

Specified by:
train in class AbstractSequenceClassifier<CoreLabel>
Parameters:
docs - An ObjectBank or a collection of sequences of IN
readerAndWriter - A DocumentReaderAndWriter to use when loading test files

printProbsDocument

public void printProbsDocument(java.util.List<CoreLabel> document)
Specified by:
printProbsDocument in class AbstractSequenceClassifier<CoreLabel>

serializeClassifier

public void serializeClassifier(java.lang.String serializePath)
Description copied from class: AbstractSequenceClassifier
Serialize a sequence classifier to a file on the given path.

Specified by:
serializeClassifier in class AbstractSequenceClassifier<CoreLabel>
Parameters:
serializePath - The path/filename to write the classifier to.

loadClassifier

public void loadClassifier(java.io.ObjectInputStream in,
                           java.util.Properties props)
                    throws java.io.IOException,
                           java.lang.ClassCastException,
                           java.lang.ClassNotFoundException
Description copied from class: AbstractSequenceClassifier
Load a classifier from the specified input stream. The classifier is reinitialized from the flags serialized in the classifier.

Specified by:
loadClassifier in class AbstractSequenceClassifier<CoreLabel>
Parameters:
in - The InputStream to load the serialized classifier from
props - This Properties object will be used to update the SeqClassifierFlags which are read from the serialized classifier
Throws:
java.io.IOException - If there are problems accessing the input stream
java.lang.ClassCastException - If there are problems interpreting the serialized data
java.lang.ClassNotFoundException - If there are problems interpreting the serialized data


Stanford NLP Group