The tokenizer divides the input text into tokens
-- roughly, words and punctuation. It is typically the first annotator
which is applied to a span of text. It adds to the text annotations
of type token. Three types of tokens are recognized:
Whitespace (blanks, tabs, and newlines) is ignored.
The whitespace following a token is included in the span of the token annotation,
so that the end of one token will be the start of the next token.
words, consisting of one or more letters.
If the first letter is capitalized, the token annotation gets the feature
case with the value cap.
numbers, consisting of one or more digits.
The token annotation is assigned the feature intvalue whose value
is the numeric value of the integer.
special characters. Any character
other than a letter or digit is treated as a single-character token.