[Next] [Previous] [Top] [Back to MUC-6 main page]

Tokenization Rules

2 When a proper name or number contains an internal punctuation mark or other special character, the word containing that character is treated as just one token.

2.1 - Examples with period
2.1.1 - A period that marks an abbreviation is considered part of the abbreviation token, even when the abbreviation appears at the end of a sentence.
2.1.2 - A period used used as a decimal marker is considered integral to the number token.
2.2 - Examples with hyphen or dash (see also section 3, below)
2.3 - Examples with slash
2.4 - Examples with other punctuation
2.5 - Examples with special characters

2.1 Examples with period

2.1.1 A period that marks an abbreviation is considered part of the abbreviation token, even when the abbreviation appears at the end of a sentence.

"U.K. industry"

<ENAMEX TYPE="LOCATION">U.K.</ENAMEX> industry

"Microtest Inc."

<ENAMEX TYPE="ORGANIZATION">Microtest Inc.</ENAMEX>

"Spokane, Wash."

<ENAMEX TYPE="LOCATION">Spokane</ENAMEX>, <ENAMEX TYPE="LOCATION">Wash.

</ENAMEX>

"Limousines are manufactured in the U.K."

Limousines are manufactured in the <ENAMEX TYPE="LOCATION">U.K.</ENAMEX>

2.1.2 A period used used as a decimal marker is considered integral to the number token.

"$5.10"

<NUMEX TYPE="MONEY">$5.10</NUMEX>

2.2 Examples with hyphen or dash (see also section 3, below)

"F. Gregory Fitz-Gerald"

<ENAMEX TYPE="PERSON">F. Gregory Fitz-Gerald</ENAMEX>

"Prudential-Bache Securities"

<ENAMEX TYPE="ORGANIZATION">Prudential-Bache Securities</ENAMEX>

"Allen-Bradley Co. and Hewlett-Packard Co. have undertaken a joint marketing and development program linking A-B manufacturing automation equipment with HP Unix-based computers."

<ENAMEX TYPE="ORGANIZATION">Allen-Bradley Co.</ENAMEX> and <ENAMEX TYPE="ORGANIZATION">Hewlett-Packard Co.</ENAMEX> have... <ENAMEX TYPE="ORGANIZATION">A-B</ENAMEX>... <ENAMEX TYPE="ORGANIZATION">HP<ENAMEX>...

"one-hundred percent"

<NUMEX TYPE="PERCENT">one-hundred percent</NUMEX>

2.3 Examples with slash

"The venture will be called Quality Spring/Togo Inc."

The venture will be called <ENAMEX TYPE="ORGANIZATION">Quality Spring/Togo Inc.</ENAMEX>

"10/13/89"

<NUMEX TYPE="DATE">10/13/89</NUMEX>

2.4 Examples with other punctuation

"McDonald's burger company"

<ENAMEX TYPE="ORGANIZATION">McDonald's</ENAMEX> burger company

"'87"

<TIMEX TYPE="DATE">'87</TIMEX>

2.5 Examples with special characters

"S&P 500 Index"

<ENAMEX TYPE="ORGANIZATION">S&P</ENAMEX> 500 Index


Tokenization Rules - 14 JUN 95
[Next] [Previous] [Top] [Back to MUC-6 main page]

Generated with CERN WebMaker