[Next] [Previous] [Top] [Back to MUC-6 main page]

Tokenization Rules

1 Punctuation and special characters are normally considered separate tokens.

1.1 - Examples with period (ellipsis, sentence-end punctuation)
1.2 - Examples with hyphen or dash
1.3 - Examples with apostrophe
1.4 - Examples with other punctuation
1.5 - Examples with special characters

1.1 Examples with period (ellipsis, sentence-end punctuation)

"...Jaguar company in Britain."

...<ENAMEX TYPE="ORGANIZATION">Jaguar</ENAMEX> company in <ENAMEX TYPE="LOCATION">Britain</ENAMEX>.

1.2 Examples with hyphen or dash

"Chicago-based"

<ENAMEX TYPE="LOCATION">Chicago</ENAMEX>-based

"U.S.-based"

<ENAMEX TYPE="LOCATION">U.S.</ENAMEX>-based

"U.S.-Japan trade negotiations"

<ENAMEX TYPE="LOCATION">U.S.</ENAMEX>-<ENAMEX TYPE="LOCATION">Japan</ENAMEX> trade negotiations

"an Eaton-Sumitomo joint venture"

an <ENAMEX TYPE="ORGANIZATION">Eaton</ENAMEX>-<ENAMEX TYPE="ORGANIZATION">Sumitomo</ENAMEX> joint venture

"PHILADELPHIA--A new recycling center has been built."

<ENAMEX TYPE="LOCATION">PHILADELPHIA</ENAMEX>--A new recycling center has been built.

1.3 Examples with apostrophe

"California's"

<ENAMEX TYPE="LOCATION">California</ENAMEX>'s

"Guiness' Schenley Industries"

<ENAMEX TYPE="ORGANIZATION">Guiness</ENAMEX>' <ENAMEX TYPE="ORGANIZATION">Schenley Industries</ENAMEX>

1.4 Examples with other punctuation

"(IBM)"

(<ENAMEX TYPE="ORGANIZATION">IBM</ENAMEX>)

""IBM stock fell today," he said" [note the double quote preceding IBM]

"<ENAMEX TYPE="ORGANIZATION">IBM</ENAMEX> stock fell today," he said

1.5 Examples with special characters

"US$10"

<ENAMEX TYPE="LOCATION">US</ENAMEX><NUMEX TYPE="MONEY">$10</NUMEX>


Tokenization Rules - 14 JUN 95
[Next] [Previous] [Top] [Back to MUC-6 main page]

Generated with CERN WebMaker