Jet.Lex
Class Tokenizer

java.lang.Object
  extended byJet.Lex.Tokenizer

public class Tokenizer
extends java.lang.Object

Tokenizer contains the methods for dividing a string into tokens.

The rules generally follow those of the Penn Tree Bank, although hyphenated items are separated, with the hyphen a separate token, and single quotes (') are always treated as separate tokens unless part of a standard suffix ('s, 'm, 'd, 're, 've, n't, 'll).

For a capitalized word, we set the feature case=cap, except that at the beginning of a sentence, the token is marked case=forcedCap. In addition, words following a ``, ", or _ are marked forcedCap.

The tokenizer is loosely based on the version for OAK.


Constructor Summary
Tokenizer()
           
 
Method Summary
static Annotation[] gatherTokens(Document doc, Span span)
          returns an array containing all token annotations in span of doc.
static java.lang.String[] gatherTokenStrings(Document doc, Span span)
          returns an array of Strings corresponding to all the tokens in span of doc.
static int skipWS(Document doc, int posn, int end)
          advances to the next non-whitespace character in a document.
static int skipWS(java.lang.String text, int posn, int end)
           
static int skipWSX(Document doc, int posn, int end)
          advances to the next non-whitespace character in a document, skipping any XML tags.
static int skipWSX(java.lang.String text, int posn, int end)
           
static void tokenize(Document doc, Span span)
          tokenizes the portion of Document doc covered by span.
static java.lang.String[] tokenize(java.lang.String text)
          tokenizes the argument string.
static void tokenizeOnWS(Document doc, Span span)
          tokenizes portion 'span' of 'doc', splitting only on white space.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Tokenizer

public Tokenizer()
Method Detail

tokenize

public static void tokenize(Document doc,
                            Span span)
tokenizes the portion of Document doc covered by span. For each token, adds to doc an annotation of type token.


tokenize

public static java.lang.String[] tokenize(java.lang.String text)
tokenizes the argument string. Returns a vector, each of whose elements is the character string for one token in the argument.


tokenizeOnWS

public static void tokenizeOnWS(Document doc,
                                Span span)
tokenizes portion 'span' of 'doc', splitting only on white space.


skipWS

public static int skipWS(Document doc,
                         int posn,
                         int end)
advances to the next non-whitespace character in a document. posn is a character position within Document doc. Returns posn (if that character position is occupied by a non-whitespace character), or the position of the next non-whitespace character, or end if all the characters up to end are whitespace.


skipWS

public static int skipWS(java.lang.String text,
                         int posn,
                         int end)

skipWSX

public static int skipWSX(Document doc,
                          int posn,
                          int end)
advances to the next non-whitespace character in a document, skipping any XML tags.


skipWSX

public static int skipWSX(java.lang.String text,
                          int posn,
                          int end)

gatherTokens

public static Annotation[] gatherTokens(Document doc,
                                        Span span)
returns an array containing all token annotations in span of doc.


gatherTokenStrings

public static java.lang.String[] gatherTokenStrings(Document doc,
                                                    Span span)
returns an array of Strings corresponding to all the tokens in span of doc.