Tokenizer

The tokenizer divides the input text into tokens -- roughly, words and punctuation. It is typically the first annotator which is applied to a span of text. It adds to the text annotations of type token. Three types of tokens are recognized:

words, consisting of one or more letters. If the first letter is capitalized, the token annotation gets the feature case with the value cap.
numbers, consisting of one or more digits. The token annotation is assigned the feature intvalue whose value is the numeric value of the integer.
special characters. Any character other than a letter or digit is treated as a single-character token.

Whitespace (blanks, tabs, and newlines) is ignored. The whitespace following a token is included in the span of the token annotation, so that the end of one token will be the start of the next token.