Tokenization Rules

Text elements annotated for the MUC-6 Named Entity and Coreference tasks must consist of one or more complete tokens. Normally, the presence of whitespace surrounding a single character or a group of characters defines an explicit token (a word). This document explains where boundaries of tagged strings are meant to be located when there is NO explicit whitespace between alphanumeric characters and a punctuation mark or other special character.

Named Entity tagging is used in this document as an example of the effects of tokenization. The tokenization rules apply also to the Coreference and Information Extraction tasks.

1 - Punctuation and special characters are normally considered separate tokens.
2 - When a proper name or number contains an internal punctuation mark or other special character, the word containing that character is treated as just one token.
3 - Hyphen at end of line

Tokenization Rules - 14 JUN 95
