Tokenizer
The tokenizer divides the input text into tokens
-- roughly, words and punctuation. It is typically the first annotator
which is applied to a span of text. It adds to the text annotations
of type token. Three types of tokens are recognized:
-
words, consisting of one or more letters.
If the first letter is capitalized, the token annotation gets the feature
case with the value cap.
-
numbers, consisting of one or more digits.
The token annotation is assigned the feature intvalue whose value
is the numeric value of the integer.
-
special characters. Any character
other than a letter or digit is treated as a single-character token.
Whitespace (blanks, tabs, and newlines) is ignored.
The whitespace following a token is included in the span of the token annotation,
so that the end of one token will be the start of the next token.