Tokenizer

The tokenizer divides the input text into tokens -- roughly, words and punctuation.  It is typically the first annotator which is applied to a span of text.  It adds to the text annotations of type token.  Three types of tokens are recognized: Whitespace (blanks, tabs, and newlines) is ignored.  The whitespace following a token is included in the span of the token annotation, so that the end of one token will be the start of the next token.