edu.stanford.nlp.ie
Class NumberNormalizer

java.lang.Object
  extended by edu.stanford.nlp.ie.NumberNormalizer

public class NumberNormalizer
extends java.lang.Object

Provides functions for converting words to numbers Unlike QuantifiableEntityNormalizer that normalizes various types of quantifiable entities like money and dates, NumberNormalizer only normalizes numeric expressions (e.g. one => 1, two hundred => 200.0 )
This code is somewhat hacked together, so should be reworked.
There is a library in perl for parsing english numbers: http://blog.cordiner.net/2010/01/02/parsing-english-numbers-with-perl/

TODO: To be merged into QuantifiableEntityNormalizer. It can be used by QuantifiableEntityNormalizer to first convert numbers expressed as words into numeric quantities before figuring out how to do higher level combos (like one hundred dollars and five cents)
TODO: Known to not handle the following: oh: two oh one non-integers: one and a half, one point five, three fifth funky numbers: pi
TODO: This class is very language dependent Should really be AmericanEnglishNumberNormalizer
TODO: Make things not static

Author:
Angel Chang

Field Summary
protected static java.util.regex.Pattern digitsPattern
           
 
Method Summary
static java.util.List<CoreMap> findAndAnnotateNumericExpressions(CoreMap annotation)
           
static java.util.List<CoreMap> findAndAnnotateNumericExpressionsWithRanges(CoreMap annotation)
           
static java.util.List<CoreMap> findAndMergeNumbers(CoreMap annotationRaw)
          Takes annotation and identifies numbers in the annotation Returns a list of tokens (as CoreMaps) with numbers merged As by product, also marks each individual token with the TokenBeginAnnotation and TokenEndAnnotation - this is mainly to make it easier to the rest of the code to figure out what the token offsets are.
static java.util.List<CoreMap> findNumberRanges(CoreMap annotation)
          Find and mark number ranges Ranges are NUM1 [-|to] NUM2 where NUM2 > NUM1 Each number range is marked with - CoreAnnotations.NumericTypeAnnotation.class: NUMBER_RANGE - CoreAnnotations.NumericObjectAnnotation.class: Pair representing the start/end of the range
static java.util.List<CoreMap> findNumbers(CoreMap annotation)
          Find and mark numbers (does not need NumberSequenceClassifier) Each token is annotated with the numeric value and type - CoreAnnotations.NumericTypeAnnotation.class: ORDINAL, UNIT (hundred, thousand,..., dozen, gross,...), NUMBER - CoreAnnotations.NumericValueAnnotation.class: Number representing the numeric value of the token ( two thousand => 2 1000 ) Tries also to separate individual numbers like four five six, while keeping numbers like four hundred and seven together Annotate tokens belonging to each composite number with - CoreAnnotations.NumericCompositeTypeAnnotation.class: ORDINAL (1st, 2nd), NUMBER (one hundred) - CoreAnnotations.NumericCompositeValueAnnotation.class: Number representing the composite numeric value ( two thousand => 2000 2000 ) Also returns list of CoreMap representing the identified numbers The function is overly aggressive in marking possible numbers - should either do more checks or use in conjunction with NumberSequenceClassifier to avoid marking certain tokens (like second/NN) as numbers...
static Env getNewEnv()
           
static void initEnv(Env env)
           
static void setVerbose(boolean verbose)
           
static java.lang.Number wordToNumber(java.lang.String str)
          Fairly generous utility function to convert a string representing a number (hopefully) to a Number Assumes that something else has somehow determined that the string makes ONE suitable number The value of the number is determined by: 0.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

digitsPattern

protected static java.util.regex.Pattern digitsPattern
Method Detail

setVerbose

public static void setVerbose(boolean verbose)

wordToNumber

public static java.lang.Number wordToNumber(java.lang.String str)
Fairly generous utility function to convert a string representing a number (hopefully) to a Number Assumes that something else has somehow determined that the string makes ONE suitable number The value of the number is determined by: 0. Breaking up the string into pieces using whitespace (stuff like "and", "-", "," is turned into whitespace) 1. Determining the numeric value of the pieces 2. Finding the numeric value of each piece 3. Combining the pieces together to form the overall value a. Find the largest component and its value (X) b. Let B = overall value of pieces to the left (recursive) c. Let C = overall value of pieces to the right recursive) d. The overall value = B*X + C

Parameters:
str -
Returns:
numeric value of string

getNewEnv

public static Env getNewEnv()

initEnv

public static void initEnv(Env env)

findNumbers

public static java.util.List<CoreMap> findNumbers(CoreMap annotation)
Find and mark numbers (does not need NumberSequenceClassifier) Each token is annotated with the numeric value and type - CoreAnnotations.NumericTypeAnnotation.class: ORDINAL, UNIT (hundred, thousand,..., dozen, gross,...), NUMBER - CoreAnnotations.NumericValueAnnotation.class: Number representing the numeric value of the token ( two thousand => 2 1000 ) Tries also to separate individual numbers like four five six, while keeping numbers like four hundred and seven together Annotate tokens belonging to each composite number with - CoreAnnotations.NumericCompositeTypeAnnotation.class: ORDINAL (1st, 2nd), NUMBER (one hundred) - CoreAnnotations.NumericCompositeValueAnnotation.class: Number representing the composite numeric value ( two thousand => 2000 2000 ) Also returns list of CoreMap representing the identified numbers The function is overly aggressive in marking possible numbers - should either do more checks or use in conjunction with NumberSequenceClassifier to avoid marking certain tokens (like second/NN) as numbers...

Parameters:
annotation -
Returns:
list of CoreMap representing the identified numbers

findNumberRanges

public static java.util.List<CoreMap> findNumberRanges(CoreMap annotation)
Find and mark number ranges Ranges are NUM1 [-|to] NUM2 where NUM2 > NUM1 Each number range is marked with - CoreAnnotations.NumericTypeAnnotation.class: NUMBER_RANGE - CoreAnnotations.NumericObjectAnnotation.class: Pair representing the start/end of the range

Parameters:
annotation - - annotation where numbers have already been identified
Returns:
list of CoreMap representing the identified number ranges

findAndMergeNumbers

public static java.util.List<CoreMap> findAndMergeNumbers(CoreMap annotationRaw)
Takes annotation and identifies numbers in the annotation Returns a list of tokens (as CoreMaps) with numbers merged As by product, also marks each individual token with the TokenBeginAnnotation and TokenEndAnnotation - this is mainly to make it easier to the rest of the code to figure out what the token offsets are. Note that this copies the annotation, since it modifies token offsets in the original

Parameters:
annotationRaw - The annotation to find numbers in
Returns:
list of CoreMap representing the identified numbers

findAndAnnotateNumericExpressions

public static java.util.List<CoreMap> findAndAnnotateNumericExpressions(CoreMap annotation)

findAndAnnotateNumericExpressionsWithRanges

public static java.util.List<CoreMap> findAndAnnotateNumericExpressionsWithRanges(CoreMap annotation)


Stanford NLP Group