Named Entity Task Definition
For many text processing systems, such identifiers are recognized primarily using local pattern-matching techniques. The TEI (Text Encoding Initiative) Guidelines for Electronic Text Encoding and Interchange cover such identifiers (plus abbreviations) together in section 6.4 and explain that the identifiers comprise "textual features which it is often convenient to distinguish from their surrounding text. Names, dates and numbers are likely to be of particular importance to the scholar treating a text as source for a database; distinguishing such items from the surrounding text is however equally important to the scholar primarily interested in lexis."
The task is to identify all instances of the three types of expressions in each text in the test set and to subcategorize the expressions. The original texts contain some SGML tags already; the Named Entity task is to be performed within the text delimited by the TXT, HL, DATELINE, and DD tags. (Note, however, that the DD tag sometimes doesn't appear at all, sometimes appears once, and sometimes appears twice. When it appears twice, only the second instance is to be marked up for the Named Entity task.)
The system must produce a single, unambiguous output for any relevant string in the text; thus, this evaluation is not based on a view of a pipelined system architecture in which Named Entity recognition would be completely handled as a preprocess to sentence and discourse analysis. The task requires that the system recognize what a string represents, not just its superficial appearance. Sometimes, the right answer is superficially apparent, as in the case of most, if not all, NUMEX expressions, and can be obtained by local pattern-matching techniques. In other cases, the right answer is not superficially apparent, as when a single capitalized word could represent the name of a location, person, or organization, and the answer may have to be obtained using techniques that draw information from a larger context or from reference lists.
The three subtasks correspond to three SGML tag elements: ENAMEX, TIMEX, and NUMEX. The subcategorization is captured by a SGML tag attribute called TYPE, which is defined to have a different set of possible values for each tag element. The markup is described in section 2, below.
Cumulative scores will be generated at several levels of description of the task, e.g.,
* across subtasks,
* for each subtask,
* for the subcategorization aspect of each subtask,
* for each part of the article that is included in the task (<HL>,<DATELINE>, <DD>, <TXT>).
1.2 Performance Evaluation
Scoring of this task will be done using the same kinds of metrics that are used for scoring template-filling (information extraction) tasks. For specific information on the scoring, refer to "MUC-6 Scoring System User's Manual," prepared for MUC-6 by SAIC.
Named Entity Task Definition - 02 JUN 95
[Next] [Previous] [Top] [Back to MUC-6 main page]
Generated with CERN WebMaker