edu.stanford.nlp.international.arabic.process
Class ArabicSegmenter

java.lang.Object
  extended by edu.stanford.nlp.international.arabic.process.ArabicSegmenter
All Implemented Interfaces:
WordSegmenter, ThreadsafeProcessor<java.lang.String,java.lang.String>, java.io.Serializable

public class ArabicSegmenter
extends java.lang.Object
implements WordSegmenter, java.io.Serializable, ThreadsafeProcessor<java.lang.String,java.lang.String>

Arabic word segmentation model based on conditional random fields (CRF). This is a re-implementation (with extensions) of the model described in (Green and DeNero, 2012).

This package includes a JFlex-based orthographic normalization package that runs on the input prior to processing by the CRF-based segmentation model. The normalization options are configurable, but must be consistent for both training and test data.

Author:
Spence Green
See Also:
Serialized Form

Constructor Summary
ArabicSegmenter(ArabicSegmenter other)
          Copy constructor.
ArabicSegmenter(java.util.Properties props)
           
 
Method Summary
 void finishTraining()
           
 void initializeTraining(double numTrees)
           
 void loadSegmenter(java.lang.String filename)
           
 void loadSegmenter(java.lang.String filename, java.util.Properties p)
           
static void main(java.lang.String[] args)
           
 ThreadsafeProcessor<java.lang.String,java.lang.String> newInstance()
          Return a new threadsafe instance.
 java.lang.String process(java.lang.String nextInput)
          Set the input item that will be processed when a thread is allocated to this processor.
 long segment(java.io.BufferedReader br, java.io.PrintWriter pwOut)
          Segment all strings from an input.
 java.util.List<HasWord> segment(java.lang.String line)
           
 java.lang.String segmentString(java.lang.String line)
           
 void serializeSegmenter(java.lang.String filename)
           
 void train()
          Train a segmenter from raw text.
 void train(java.util.Collection<Tree> trees)
           
 void train(java.util.List<TaggedWord> sentence)
           
 void train(Tree tree)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ArabicSegmenter

public ArabicSegmenter(java.util.Properties props)

ArabicSegmenter

public ArabicSegmenter(ArabicSegmenter other)
Copy constructor.

Parameters:
other -
Method Detail

initializeTraining

public void initializeTraining(double numTrees)
Specified by:
initializeTraining in interface WordSegmenter

train

public void train(java.util.Collection<Tree> trees)
Specified by:
train in interface WordSegmenter

train

public void train(Tree tree)
Specified by:
train in interface WordSegmenter

train

public void train(java.util.List<TaggedWord> sentence)
Specified by:
train in interface WordSegmenter

finishTraining

public void finishTraining()
Specified by:
finishTraining in interface WordSegmenter

process

public java.lang.String process(java.lang.String nextInput)
Description copied from interface: ThreadsafeProcessor
Set the input item that will be processed when a thread is allocated to this processor.

Specified by:
process in interface ThreadsafeProcessor<java.lang.String,java.lang.String>

newInstance

public ThreadsafeProcessor<java.lang.String,java.lang.String> newInstance()
Description copied from interface: ThreadsafeProcessor
Return a new threadsafe instance.

Specified by:
newInstance in interface ThreadsafeProcessor<java.lang.String,java.lang.String>
Returns:

segment

public java.util.List<HasWord> segment(java.lang.String line)
Specified by:
segment in interface WordSegmenter

segmentString

public java.lang.String segmentString(java.lang.String line)

segment

public long segment(java.io.BufferedReader br,
                    java.io.PrintWriter pwOut)
Segment all strings from an input.

Parameters:
br - -- input stream to segment
pwOut - -- output stream to write the segmenter text
Returns:
number of input characters segmented

train

public void train()
Train a segmenter from raw text. Gold segmentation markers are required.


serializeSegmenter

public void serializeSegmenter(java.lang.String filename)

loadSegmenter

public void loadSegmenter(java.lang.String filename,
                          java.util.Properties p)

loadSegmenter

public void loadSegmenter(java.lang.String filename)
Specified by:
loadSegmenter in interface WordSegmenter

main

public static void main(java.lang.String[] args)
Parameters:
args -


Stanford NLP Group