|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.ObjectJet.Lex.Tokenizer
Tokenizer contains the methods for dividing a string into tokens.
The rules generally follow those of the Penn Tree Bank, although hyphenated items are separated, with the hyphen a separate token, and single quotes (') are always treated as separate tokens unless part of a standard suffix ('s, 'm, 'd, 're, 've, n't, 'll).
For a capitalized word, we set the feature case=cap, except that at the beginning of a sentence, the token is marked case=forcedCap. In addition, words following a ``, ", or _ are marked forcedCap.
The tokenizer is loosely based on the version for OAK.
Constructor Summary | |
Tokenizer()
|
Method Summary | |
static Annotation[] |
gatherTokens(Document doc,
Span span)
returns an array containing all token annotations in span of doc . |
static java.lang.String[] |
gatherTokenStrings(Document doc,
Span span)
returns an array of Strings corresponding to all the tokens in span of doc . |
static int |
skipWS(Document doc,
int posn,
int end)
advances to the next non-whitespace character in a document. |
static int |
skipWS(java.lang.String text,
int posn,
int end)
|
static int |
skipWSX(Document doc,
int posn,
int end)
advances to the next non-whitespace character in a document, skipping any XML tags. |
static int |
skipWSX(java.lang.String text,
int posn,
int end)
|
static void |
tokenize(Document doc,
Span span)
tokenizes the portion of Document doc covered by span. |
static java.lang.String[] |
tokenize(java.lang.String text)
tokenizes the argument string. |
static void |
tokenizeOnWS(Document doc,
Span span)
tokenizes portion 'span' of 'doc', splitting only on white space. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public Tokenizer()
Method Detail |
public static void tokenize(Document doc, Span span)
public static java.lang.String[] tokenize(java.lang.String text)
public static void tokenizeOnWS(Document doc, Span span)
public static int skipWS(Document doc, int posn, int end)
posn
is a character position within Document
doc
. Returns posn
(if that character
position is occupied by a non-whitespace character), or the position
of the next non-whitespace character, or end
if all
the characters up to end
are whitespace.
public static int skipWS(java.lang.String text, int posn, int end)
public static int skipWSX(Document doc, int posn, int end)
public static int skipWSX(java.lang.String text, int posn, int end)
public static Annotation[] gatherTokens(Document doc, Span span)
span
of doc
.
public static java.lang.String[] gatherTokenStrings(Document doc, Span span)
span
of doc
.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |