edu.stanford.nlp.process
Class ChineseDocumentToSentenceProcessor

java.lang.Object
  extended by edu.stanford.nlp.process.ChineseDocumentToSentenceProcessor
All Implemented Interfaces:
java.io.Serializable

public class ChineseDocumentToSentenceProcessor
extends java.lang.Object
implements java.io.Serializable

Convert a Chinese Document into a List of sentence Strings.

Author:
Pi-Chuan Chang
See Also:
Serialized Form

Constructor Summary
ChineseDocumentToSentenceProcessor()
           
ChineseDocumentToSentenceProcessor(java.lang.String normalizationTableFile)
           
 
Method Summary
 java.util.List<java.lang.String> fromHTML(java.lang.String inputString)
          Strip off HTML tags before processing.
static java.util.List<java.lang.String> fromPlainText(java.lang.String contentString)
           
static java.util.List<java.lang.String> fromPlainText(java.lang.String contentString, boolean segmented)
           
static void main(java.lang.String[] args)
          usage: java ChineseDocumentToSentenceProcessor [-segmentIBM] -file filename [-encoding encoding]
 java.lang.String normalization(java.lang.String in)
          This should now become disused, and other people should call ChineseUtils directly! CDM June 2006.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChineseDocumentToSentenceProcessor

public ChineseDocumentToSentenceProcessor()

ChineseDocumentToSentenceProcessor

public ChineseDocumentToSentenceProcessor(java.lang.String normalizationTableFile)
Parameters:
normalizationTableFile - A file listing character pairs for normalization. Currently the normalization table must be in UTF-8. If this parameter is null, the default normalization of the zero-argument constructor is used.
Method Detail

normalization

public java.lang.String normalization(java.lang.String in)
This should now become disused, and other people should call ChineseUtils directly! CDM June 2006.


main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
usage: java ChineseDocumentToSentenceProcessor [-segmentIBM] -file filename [-encoding encoding]

The -segmentIBM option is for IBM GALE-specific splitting of an XML element into sentences.

Throws:
java.io.IOException

fromHTML

public java.util.List<java.lang.String> fromHTML(java.lang.String inputString)
                                          throws java.io.IOException
Strip off HTML tags before processing. Only the simplest tag stripping is implemented.

Parameters:
inputString - Chinese document text which contains HTML tags
Returns:
a List of sentence strings
Throws:
java.io.IOException

fromPlainText

public static java.util.List<java.lang.String> fromPlainText(java.lang.String contentString)
                                                      throws java.io.IOException
Parameters:
contentString - Chinese document text
Returns:
a List of sentence strings
Throws:
java.io.IOException

fromPlainText

public static java.util.List<java.lang.String> fromPlainText(java.lang.String contentString,
                                                             boolean segmented)
                                                      throws java.io.IOException
Throws:
java.io.IOException


Stanford NLP Group