public class GetPatternsFromDataMultiClass<E extends Pattern> extends Object implements Serializable
The multi-threaded class (nthread parameter for number of
threads) takes as input.
To use the default options, run
java -mx1000m edu.stanford.nlp.patterns.GetPatternsFromDataMultiClass -file text_file -seedWordsFiles label1,seedwordlist1;label2,seedwordlist2;... -outDir output_directory (optional)
fileFormat: (Optional) Default is text. Valid values are text
(or txt) and ser, where the serialized file is of the type Map<String,
List<CoreLabel>>.
file: (Required) Input file(s) (default assumed text). Can be
one or more of (concatenated by comma or semi-colon): file, directory, files
with regex in the filename (for example: "mydir/health-.*-processed.txt")
seedWordsFiles: (Required)
label1,file_seed_words1;label2,file_seed_words2;... where file_seed_words are
files with list of seed words, one in each line
outDir: (Optional) output directory where visualization/output
files are stored
For other flags, see individual comments for each flag.
To use a properties file, see
projects/core/data/edu/stanford/nlp/patterns/surface/example.properties or patterns/example.properties (depends on which codebase you are using)
as an example for the flags and their brief descriptions. Run the code as:
java -mx1000m -cp classpath edu.stanford.nlp.patterns.GetPatternsFromDataMultiClass -props dir-as-above/example.properties
IMPORTANT: Many flags are described in the classes
ConstantsAndVariables, CreatePatterns, and
PhraseScorer.
| Modifier and Type | Class and Description |
|---|---|
static class |
GetPatternsFromDataMultiClass.LabelWithSeedWords
Warning: sets labels of words that are not in the given seed set as O!!!
|
static class |
GetPatternsFromDataMultiClass.PatternScoring
RlogF is from Riloff 1996, when R's denominator is (pos+neg+unlabeled)
|
| Constructor and Description |
|---|
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Map<String,Set<CandidatePhrase>> seedSets,
boolean labelUsingSeedSets) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Map<String,Set<CandidatePhrase>> seedSets,
boolean labelUsingSeedSets,
Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Map<String,Set<CandidatePhrase>> seedSets,
boolean labelUsingSeedSets,
Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass,
Map<String,Class> generalizeClasses,
Map<String,Map<Class,Object>> ignoreClasses)
generalize classes basically maps label strings to a map of generalized
strings and the corresponding class ignoreClasses have to be boolean
|
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Set<CandidatePhrase> seedSet,
boolean labelUsingSeedSets,
Class answerClass,
String answerLabel) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Set<CandidatePhrase> seedSet,
boolean labelUsingSeedSets,
Class answerClass,
String answerLabel,
Map<String,Class> generalizeClasses,
Map<Class,Object> ignoreClasses) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Set<CandidatePhrase> seedSet,
boolean labelUsingSeedSets,
String answerLabel) |
GetPatternsFromDataMultiClass(Properties props,
Map<String,DataInstance> sents,
Set<CandidatePhrase> seedSet,
boolean labelUsingSeedSets,
String answerLabel,
Map<String,Class> generalizeClasses,
Map<Class,Object> ignoreClasses) |
| Modifier and Type | Method and Description |
|---|---|
static void |
countResults(List<CoreLabel> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN,
String background,
Counter<String> wordTP,
Counter<String> wordTN,
Counter<String> wordFP,
Counter<String> wordFN,
Class<? extends TypesafeMap.Key<String>> whichClassToCompare,
boolean evalPerEntity) |
static boolean |
countResultsPerEntity(List<CoreLabel> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN,
String background,
Counter<String> wordTP,
Counter<String> wordTN,
Counter<String> wordFP,
Counter<String> wordFN,
Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
COPIED from CRFClassifier: Count the successes and failures of the model on
the given document.
|
static void |
countResultsPerToken(List<CoreLabel> doc,
Counter<String> entityTP,
Counter<String> entityFP,
Counter<String> entityFN,
String background,
Counter<String> wordTP,
Counter<String> wordTN,
Counter<String> wordFP,
Counter<String> wordFN,
Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
Count the successes and failures of the model on the given document
***token-based***.
|
static String |
elapsedTime(Date d1,
Date d2) |
void |
evaluate(Map<String,DataInstance> testSentences,
boolean evalPerEntity) |
static <D> Counter<D> |
FScore(Counter<D> precision,
Counter<D> recall,
double beta) |
double |
FScore(double precision,
double recall,
double beta) |
static List<File> |
getAllFiles(String file) |
Map<String,String> |
getAllOptions() |
static void |
getFeatures(SemanticGraph graph,
IndexedWord vertex,
boolean isHead,
Collection<String> features,
GrammaticalRelation reln) |
Map<String,Counter<E>> |
getLearnedPatterns() |
Counter<E> |
getLearnedPatterns(String label) |
Set<String> |
getNonBackgroundLabels(CoreLabel l) |
PatternsForEachToken |
getPatsForEachToken() |
Counter<E> |
getPatterns(String label,
Set<E> alreadyIdentifiedPatterns,
E p0,
Counter<CandidatePhrase> p0Set,
Set<E> ignorePatterns) |
static Class |
getPatternScoringClass(GetPatternsFromDataMultiClass.PatternScoring patternScoring) |
static List<Integer> |
getSubListIndex(String[] l1,
String[] l2,
String[] subl2,
Set<String> doNotLabelTheseWords,
HashSet<String> seenFuzzyMatches,
int minLen4Fuzzy,
boolean fuzzyMatch)
If l1 is a part of l2, it finds the starting index of l1 in l2 If l1 is not
a sub-array of l2, then it returns -1 note that l2 should have the exact
elements and order as in l1
|
static <E> List<List<E>> |
getThreadBatches(List<E> keyset,
int numThreads) |
void |
iterateExtractApply() |
void |
iterateExtractApply(Map<String,E> p0,
Map<String,Counter<CandidatePhrase>> p0Set,
String wordsOutputFile,
String sentsOutFile,
String patternsOutFile,
Map<String,Set<E>> ignorePatterns) |
Pair<Counter<E>,Counter<CandidatePhrase>> |
iterateExtractApply4Label(String label,
E p0,
Counter<CandidatePhrase> p0Set,
BufferedWriter wordsOutput,
String sentsOutFile,
BufferedWriter patternsOut,
Set<E> ignorePatterns,
int numIter,
Set<CandidatePhrase> ignoreWords,
CollectionValuedMap<E,Triple<String,Integer,Integer>> matchedTokensByPat,
TwoDimensionalCounter<String,E> terms) |
void |
labelWords(String label,
Map<String,DataInstance> sents,
Collection<CandidatePhrase> identifiedWords) |
void |
labelWords(String label,
Map<String,DataInstance> sents,
Collection<CandidatePhrase> identifiedWords,
String outFile,
CollectionValuedMap<E,Triple<String,Integer,Integer>> matchedTokensByPat) |
static void |
main(String[] args) |
static String |
matchedTokensByPhraseJsonString() |
static String |
matchedTokensByPhraseJsonString(String phrase) |
static <E> Counter<E> |
normalizeSoftMaxMinMaxScores(Counter<E> scores,
boolean minMaxNorm,
boolean softmax,
boolean oneMinusSoftMax) |
void |
processSents(Map<String,DataInstance> sents,
Boolean deleteExistingIndex) |
static Pair |
processSents(Properties props,
Set<String> labels) |
static Map<String,Set<CandidatePhrase>> |
readSeedWords(Properties props) |
static Map<String,Set<CandidatePhrase>> |
readSeedWords(String seedWordsFiles) |
static Map<String,Set<CandidatePhrase>> |
readSeedWordsFromJSONString(String str) |
void |
removeOverLappingLabels(Map<String,DataInstance> sents)
If a token is labeled for two or more labels, then keep the one that has the longest matching phrase.
|
static <E extends Pattern> |
run(Properties props)
Execute the system give a properties file or object.
|
static void |
runLabelSeedWords(Map<String,DataInstance> sents,
Class answerclass,
String label,
Collection<CandidatePhrase> seedWords,
ConstantsAndVariables constVars,
boolean overwriteExistingLabels)
Warning: sets labels of words that are not in the given seed set as O!!!
|
static Map<String,DataInstance> |
runPOSNEROnTokens(List<CoreMap> sentsCM,
String posModelPath,
boolean useTargetNERRestriction,
String prefix,
boolean useTargetParserParentRestriction,
String numThreads,
PatternFactory.PatternType type) |
void |
setLearnedPatterns(Counter<E> patterns,
String label) |
static <E> List<List<E>> |
splitIntoNumThreadsWithSampling(List<E> c,
int n,
int numThreads) |
static int |
tokenize(Iterator<String> textReader,
String posModelPath,
boolean lowercase,
boolean useTargetNERRestriction,
String sentIDPrefix,
boolean useTargetParserParentRestriction,
String numThreads,
boolean batchProcessSents,
int numMaxSentencesPerBatchFile,
File saveSentencesSerDirFile,
Map<String,DataInstance> sents,
int numFilesTillNow,
PatternFactory.PatternType type) |
static void |
writeColumnOutput(String outFile,
boolean batchProcessSents,
Map<String,Class<? extends TypesafeMap.Key<String>>> answerclasses) |
void |
writeLabeledData(String outFile) |
public Map<String,TwoDimensionalCounter<String,E extends Pattern>> wordsPatExtracted
public ScorePhrases scorePhrases
public ConstantsAndVariables constVars
public CreatePatterns createPats
public Map<String,TwoDimensionalCounter<E extends Pattern,CandidatePhrase>> patternsandWords
public TwoDimensionalCounter<String,ConstantsAndVariables.ScorePhraseMeasures> phInPatScoresCache
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Set<CandidatePhrase> seedSet, boolean labelUsingSeedSets, String answerLabel) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Set<CandidatePhrase> seedSet, boolean labelUsingSeedSets, Class answerClass, String answerLabel) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Set<CandidatePhrase> seedSet, boolean labelUsingSeedSets, String answerLabel, Map<String,Class> generalizeClasses, Map<Class,Object> ignoreClasses) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Set<CandidatePhrase> seedSet, boolean labelUsingSeedSets, Class answerClass, String answerLabel, Map<String,Class> generalizeClasses, Map<Class,Object> ignoreClasses) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Map<String,Set<CandidatePhrase>> seedSets, boolean labelUsingSeedSets) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, ClassNotFoundException, InterruptedException, ExecutionException
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Map<String,Set<CandidatePhrase>> seedSets, boolean labelUsingSeedSets, Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public GetPatternsFromDataMultiClass(Properties props, Map<String,DataInstance> sents, Map<String,Set<CandidatePhrase>> seedSets, boolean labelUsingSeedSets, Map<String,Class<? extends TypesafeMap.Key<String>>> answerClass, Map<String,Class> generalizeClasses, Map<String,Map<Class,Object>> ignoreClasses) throws IOException, InstantiationException, IllegalAccessException, IllegalArgumentException, InvocationTargetException, NoSuchMethodException, SecurityException, InterruptedException, ExecutionException, ClassNotFoundException
public PatternsForEachToken getPatsForEachToken()
public void removeOverLappingLabels(Map<String,DataInstance> sents)
PatternsAnnotations.Ln set, which is already done in runLabelSeedWords function.public static Map<String,DataInstance> runPOSNEROnTokens(List<CoreMap> sentsCM, String posModelPath, boolean useTargetNERRestriction, String prefix, boolean useTargetParserParentRestriction, String numThreads, PatternFactory.PatternType type)
public static int tokenize(Iterator<String> textReader, String posModelPath, boolean lowercase, boolean useTargetNERRestriction, String sentIDPrefix, boolean useTargetParserParentRestriction, String numThreads, boolean batchProcessSents, int numMaxSentencesPerBatchFile, File saveSentencesSerDirFile, Map<String,DataInstance> sents, int numFilesTillNow, PatternFactory.PatternType type) throws InterruptedException, ExecutionException, IOException
public static List<Integer> getSubListIndex(String[] l1, String[] l2, String[] subl2, Set<String> doNotLabelTheseWords, HashSet<String> seenFuzzyMatches, int minLen4Fuzzy, boolean fuzzyMatch)
l1 - array you want to find in l2l2 - public static void runLabelSeedWords(Map<String,DataInstance> sents, Class answerclass, String label, Collection<CandidatePhrase> seedWords, ConstantsAndVariables constVars, boolean overwriteExistingLabels) throws InterruptedException, ExecutionException, IOException
public static void getFeatures(SemanticGraph graph, IndexedWord vertex, boolean isHead, Collection<String> features, GrammaticalRelation reln)
public void processSents(Map<String,DataInstance> sents, Boolean deleteExistingIndex) throws IOException, ClassNotFoundException
IOExceptionClassNotFoundExceptionpublic Counter<E> getPatterns(String label, Set<E> alreadyIdentifiedPatterns, E p0, Counter<CandidatePhrase> p0Set, Set<E> ignorePatterns) throws IOException, ClassNotFoundException
IOExceptionClassNotFoundExceptionpublic static Class getPatternScoringClass(GetPatternsFromDataMultiClass.PatternScoring patternScoring)
public static <E> List<List<E>> splitIntoNumThreadsWithSampling(List<E> c, int n, int numThreads)
public static <E> Counter<E> normalizeSoftMaxMinMaxScores(Counter<E> scores, boolean minMaxNorm, boolean softmax, boolean oneMinusSoftMax)
public void labelWords(String label, Map<String,DataInstance> sents, Collection<CandidatePhrase> identifiedWords) throws IOException
IOExceptionpublic void labelWords(String label, Map<String,DataInstance> sents, Collection<CandidatePhrase> identifiedWords, String outFile, CollectionValuedMap<E,Triple<String,Integer,Integer>> matchedTokensByPat) throws IOException
IOExceptionpublic void iterateExtractApply()
throws IOException,
ClassNotFoundException
IOExceptionClassNotFoundExceptionpublic void iterateExtractApply(Map<String,E> p0, Map<String,Counter<CandidatePhrase>> p0Set, String wordsOutputFile, String sentsOutFile, String patternsOutFile, Map<String,Set<E>> ignorePatterns) throws IOException, ClassNotFoundException
p0 - Null in most cases. only used for BPBp0Set - Null in most caseswordsOutputFile - If null, output is in the output directorysentsOutFile - patternsOutFile - ignorePatterns - IOExceptionClassNotFoundExceptionpublic static String matchedTokensByPhraseJsonString()
public Pair<Counter<E>,Counter<CandidatePhrase>> iterateExtractApply4Label(String label, E p0, Counter<CandidatePhrase> p0Set, BufferedWriter wordsOutput, String sentsOutFile, BufferedWriter patternsOut, Set<E> ignorePatterns, int numIter, Set<CandidatePhrase> ignoreWords, CollectionValuedMap<E,Triple<String,Integer,Integer>> matchedTokensByPat, TwoDimensionalCounter<String,E> terms) throws IOException, ClassNotFoundException
IOExceptionClassNotFoundExceptionpublic static boolean countResultsPerEntity(List<CoreLabel> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN, String background, Counter<String> wordTP, Counter<String> wordTN, Counter<String> wordFP, Counter<String> wordFN, Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
public static void countResultsPerToken(List<CoreLabel> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN, String background, Counter<String> wordTP, Counter<String> wordTN, Counter<String> wordFP, Counter<String> wordFN, Class<? extends TypesafeMap.Key<String>> whichClassToCompare)
public static void countResults(List<CoreLabel> doc, Counter<String> entityTP, Counter<String> entityFP, Counter<String> entityFN, String background, Counter<String> wordTP, Counter<String> wordTN, Counter<String> wordFP, Counter<String> wordFN, Class<? extends TypesafeMap.Key<String>> whichClassToCompare, boolean evalPerEntity)
public void writeLabeledData(String outFile) throws IOException, ClassNotFoundException
IOExceptionClassNotFoundExceptionpublic static void writeColumnOutput(String outFile, boolean batchProcessSents, Map<String,Class<? extends TypesafeMap.Key<String>>> answerclasses) throws IOException, ClassNotFoundException
IOExceptionClassNotFoundExceptionpublic void evaluate(Map<String,DataInstance> testSentences, boolean evalPerEntity) throws IOException
IOExceptionpublic double FScore(double precision,
double recall,
double beta)
public static Map<String,Set<CandidatePhrase>> readSeedWordsFromJSONString(String str)
public static Map<String,Set<CandidatePhrase>> readSeedWords(Properties props)
public static Map<String,Set<CandidatePhrase>> readSeedWords(String seedWordsFiles)
public static Pair processSents(Properties props, Set<String> labels) throws IOException, ExecutionException, InterruptedException, ClassNotFoundException
public static <E extends Pattern> GetPatternsFromDataMultiClass<E> run(Properties props) throws IOException, ClassNotFoundException, IllegalAccessException, InterruptedException, ExecutionException, InstantiationException, NoSuchMethodException, InvocationTargetException, SQLException
public static void main(String[] args)