L - label typepublic class DocumentReader<L> extends Object
| Modifier and Type | Field and Description |
|---|---|
protected BufferedReader |
in
Reader used to read in document text.
|
protected boolean |
keepOriginalText
Whether to keep source text in document along with tokenized words.
|
protected TokenizerFactory<? extends HasWord> |
tokenizerFactory
Tokenizer used to chop up document text into words.
|
| Constructor and Description |
|---|
DocumentReader()
Constructs a new DocumentReader without an initial input source.
|
DocumentReader(Reader in)
Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
|
DocumentReader(Reader in,
TokenizerFactory<? extends HasWord> tokenizerFactory,
boolean keepOriginalText)
Constructs a new DocumentReader that will read text from the given
Reader and tokenize it into words using the given Tokenizer.
|
| Modifier and Type | Method and Description |
|---|---|
static BufferedReader |
getBufferedReader(Reader in)
Wraps the given Reader in a BufferedReader or returns it directly if it
is already a BufferedReader.
|
boolean |
getKeepOriginalText()
Returns whether created documents will store their source text along with tokenized words.
|
Reader |
getReader()
Returns the reader for the text input source of this DocumentReader.
|
static Reader |
getReader(File file)
Returns a Reader that reads in the given file.
|
static Reader |
getReader(InputStream in)
Returns a Reader that reads in the given InputStream.
|
static Reader |
getReader(String text)
Returns a Reader that reads in the given text.
|
static Reader |
getReader(URL url)
Returns a Reader that reads in the given URL.
|
TokenizerFactory<? extends HasWord> |
getTokenizerFactory()
Returns the tokenizer used to chop up text into words for the documents.
|
protected BasicDocument<L> |
parseDocumentText(String text)
Creates a new Document for the given text.
|
BasicDocument<L> |
readDocument()
Reads the next document's worth of text from the reader and turns it into
a Document.
|
protected String |
readNextDocumentText()
Reads the next document's worth of text from the reader.
|
static String |
readText(Reader in)
Returns everything that can be read from the given Reader as a String.
|
void |
setKeepOriginalText(boolean keepOriginalText)
Sets whether created documents should store their source text along with tokenized words.
|
void |
setReader(Reader in)
Sets the reader from which to read and create documents.
|
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
Sets the tokenizer used to chop up text into words for the documents.
|
protected BufferedReader in
protected TokenizerFactory<? extends HasWord> tokenizerFactory
protected boolean keepOriginalText
public DocumentReader()
setReader(java.io.Reader) before trying to read any documents.
Uses a PTBTokenizer and keeps original text.public DocumentReader(Reader in)
in - The Readerpublic DocumentReader(Reader in, TokenizerFactory<? extends HasWord> tokenizerFactory, boolean keepOriginalText)
public Reader getReader()
public void setReader(Reader in)
public TokenizerFactory<? extends HasWord> getTokenizerFactory()
public void setTokenizerFactory(TokenizerFactory<? extends HasWord> tokenizerFactory)
public boolean getKeepOriginalText()
public void setKeepOriginalText(boolean keepOriginalText)
public BasicDocument<L> readDocument() throws IOException
readNextDocumentText()
and passes it to parseDocumentText(java.lang.String) to create the document.
Subclasses may wish to override either or both of those methods to handle
custom formats of document collections and individual documents
respectively. This method can also be overridden in its entirety to
provide custom reading and construction of documents from input text.IOExceptionprotected String readNextDocumentText() throws IOException
IOExceptionprotected BasicDocument<L> parseDocumentText(String text)
public static BufferedReader getBufferedReader(Reader in)
public static String readText(Reader in) throws IOException
IOExceptionpublic static Reader getReader(String text)
public static Reader getReader(File file) throws FileNotFoundException
FileNotFoundExceptionpublic static Reader getReader(URL url) throws IOException
IOExceptionpublic static Reader getReader(InputStream in)