public class ArabicUnknownWordModel extends BaseUnknownWordModel
getSignature(String, int).
Implementation note: the contents of this class tend to overlap somewhat
with EnglishUnknownWordModel and were originally included in BaseLexicon.| Modifier and Type | Field and Description |
|---|---|
protected boolean |
smartMutation |
protected int |
unknownPrefixSize |
protected int |
unknownSuffixSize |
NULL_ITW, nullTag, nullWord, tagHash, tagIndex, trainOptions, unknown, unknownLevel, unSeenCounter, useFirst, useGT, VERBOSE, wordIndex| Constructor and Description |
|---|
ArabicUnknownWordModel(Options op,
Lexicon lex,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex)
This constructor creates an UWM with empty data structures.
|
ArabicUnknownWordModel(Options op,
Lexicon lex,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex,
ClassicCounter<IntTaggedWord> unSeenCounter) |
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
getSignature(java.lang.String word,
int loc)
6-9 were added for Arabic.
|
int |
getSignatureIndex(int index,
int sentencePosition,
java.lang.String word)
Returns the index of the signature of the word numbered wordIndex, where
the signature is the String representation of unknown word features.
|
int |
getUnknownLevel()
Get the level of equivalence classing for the model.
|
float |
score(IntTaggedWord iTW,
int loc,
double c_Tseen,
double total,
double smooth,
java.lang.String word)
Currently we don't consider loc or the other parameters in determining
score in the default implementation; only English uses them.
|
void |
setUnknownLevel(int unknownLevel)
One unknown word model may allow different options to be set; for example,
several models of unknown words for a given language could be included in one
class.
|
addTagging, getLexicon, score, scoreGT, scoreProbTagGivenWordSignature, unSeenCounterprotected boolean smartMutation
protected int unknownSuffixSize
protected int unknownPrefixSize
public ArabicUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter)
public ArabicUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)
public float score(IntTaggedWord iTW, int loc, double c_Tseen, double total, double smooth, java.lang.String word)
BaseUnknownWordModelscore in interface UnknownWordModelscore in class BaseUnknownWordModeliTW - An IntTaggedWord pairing a word and POS tagloc - The position in the sentence. In the default implementation
this is used only for unknown words to change their
probability distribution when sentence initial. Now,
a negative value c_Tseen - Total count of this tag (on seen words) in trainingtotal - Total count of word tokens in trainingsmooth - Weighting on prior P(T|U) in estimateword - The word itself; useful so we don't look it up in the indexpublic int getSignatureIndex(int index,
int sentencePosition,
java.lang.String word)
getSignatureIndex in interface UnknownWordModelgetSignatureIndex in class BaseUnknownWordModelpublic java.lang.String getSignature(java.lang.String word,
int loc)
getSignature in interface UnknownWordModelgetSignature in class BaseUnknownWordModelword - The word to make a signature forloc - Its position in the sentence (mainly so sentence-initial
capitalized words can be treated differently)public void setUnknownLevel(int unknownLevel)
UnknownWordModelsetUnknownLevel in interface UnknownWordModelsetUnknownLevel in class BaseUnknownWordModelunknownLevel - Provides a choice between different unknown word
processing schemespublic int getUnknownLevel()
UnknownWordModelgetUnknownLevel in interface UnknownWordModelgetUnknownLevel in class BaseUnknownWordModel