Information Extraction Task Definition
/* Top-Level Object -- applies to Scenario Template subtask only */ <TEMPLATE> := DOC_NR: "NUMBER"^ CONTENT: <scenario-specific-object>* COMMENT: " "- /* Template Element Objects -- apply to Template Element subtask; apply selectively to Scenario Template subtask */ <ORGANIZATION> := ORG_NAME: "NAME"- ORG_ALIAS: "ALIAS"* ORG_DESCRIPTOR: "DESCRIPTOR"- ORG_TYPE: {GOVERNMENT, COMPANY, OTHER}^ ORG_LOCALE: LOCALE-STRING {{LOC_TYPE}} * ORG_COUNTRY: NORMALIZED-COUNTRY | COUNTRY-STRING * ORG_NATIONALITY: NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING * OBJ_STATUS: {OPTIONAL}- COMMENT: " "- <PERSON> := PER_NAME: "NAME"^ PER_ALIAS: "ALIAS"* PER_TITLE: "TITLE"* OBJ_STATUS: {OPTIONAL}- COMMENT: " "- <ARTIFACT> := ART_ID: "ID"- ART_DESCRIPTOR: "DESCRIPTOR"- ART_TYPE: {{scenario-specific-set-fill}} OBJ_STATUS: {OPTIONAL}- COMMENT: " "- LOC_TYPE :: {CITY, PROVINCE, COUNTRY, REGION, UNK} /* Template Element Slots -- Apply to Scenario Template subtask only. Valence is scenario-dependent. */ LOCALE: LOCALE-STRING {{LOC_TYPE}} COUNTRY: COUNTRY | COUNTRY-STRING DATE: {BEFORE, AFTER, ON} DATE-EXP | BETWEEN DATE-EXP DATE-EXP DATE-EXP :: ([[01-31]]|{EA, MD, LT, EO, BO})[[01-12]][[00-99]YY] | {EA, MD, LT, EO, BO} {FA, WI, SP, SU, 1Q, 2Q, 3Q, 4Q, 1F, 2F, 3F, 4F, FY} [[00-99]YY] | {EA, MD, LT, EO, BO, FA, WI, SP, SU, 1Q, 2Q, 3Q, 4Q, 1F, 2F, 3F, 4F, FY} [[00-99]YY] | [[01-12]][[00-99]YY] | [[00-99]] | DESCRIPTOR
SET FILL.
To be filled in by selection from a prespecified list of categories defined in the fill rules for a given slot.
STRING FILL.
To be filled in with an exact copy of a text string from the article under analysis. The fill may be enclosed in double quotes, if desired. See the "Tokenization Rules" document for information on what counts as a word token in certain special cases.
NORMALIZED FILL.
To be filled with a text string that is converted to a canonical form in accordance with the fill rules for a given slot. The fill may be enclosed in double quotes, if desired.
INDEX FILL (POINTER).
To be filled with the index of an object, i.e., a pointer to an object. The fill is to be enclosed in angled brackets.
Since the Template Element subtask does not include the creation of pointers to the template element objects, the optionality of ORGANIZATION, PERSON, and ARTIFACT objects is indicated via the OBJ_STATUS slot within the optional object itself. The OBJ_STATUS slot is not used for the Scenario Template subtask.
The COMMENT slot may contain notes that the analyst wants to record concerning the answer key. The slot is not scored. (Analysts should avoid entering double quotes within the comment, as they will prevent the template-filling tool, Tabula Rasa, from being able to reload the template file.)
Lines within the <TXT> portion of the article that start with the "@" sign signify a table within the text and should NOT be annotated. (However, such lines may also appear within the <HL> portion of the article, and these should be annotated.)
Top-level object. Applies to Scenario Template subtask only.
MINIMUM INSTANTIATION CONDITIONS:
For every Scenario Template, instantiate one TEMPLATE object.
Article identifier. To be copied from the DOCNO tagged string in the text. Normalize the string to remove any internal dashes, e.g., 870101-0001 becomes 8701010001. This slot is not scored; it is used only to assist people in reading the template.
Pointer to object that captures info relevant to a given scenario. It is possible for CONTENT to have multiple values, corresponding to different relevant events described. Relevant events are defined as being different when the value of one slot in the scenario object is incompatible with the value of another.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
Corporate, governmental, or other kind of organization.
MINIMUM INSTANTIATION CONDITIONS:
Text must refer to a particular organization and must provide fill for at least one of the following slots: ORG_NAME, ORG_DESCRIPTOR.
The proper name of the organization, including any corporate designators (see reference document titled "Table of Corporate Designator Abbreviations"). If a document contains more than one variant of the name, the ORG_NAME slot is to be filled with the most complete variant.
MINIMUM INSTANTIATION CONDITIONS:
The name must appear in the text.
SPECIAL USAGE NOTES:
1. This slot has a 0 or 1 valence to allow the situation where an unnamed organization participates in an event (or relation) of interest and is perhaps referenced only by a descriptive phrase.
2. If an organization is changing name, report the current name as ORG_NAME and the past or future name as ORG_ALIAS.
3. See "Named Entity Task Definition" for information on treatment of names such as "McDonald's of Japan."
Variant of the proper name entered in the ORG_NAME slot. There may be more than one value for this slot.
MINIMUM INSTANTIATION CONDITIONS:
The variant must appear explicitly in the text. This slot can be filled only if ORG_NAME is filled also.
SPECIAL USAGE NOTES:
1. Misspelled variants of the name reported in ORG_NAME are to be reported in ORG_ALIAS.
Noun phrase describing or referring to an organization without naming it. This slot is not permitted to have more than one value.
MINIMUM INSTANTIATION CONDITIONS:
Text must provide a string that describes the organization and that does not fit the definition of the ORG_NAME slot. The string cannot be a pronoun, e.g., "it."
SPECIAL USAGE NOTES:
1. The answer key will provide alternative correct answers if the text supplies more than one substantive descriptor string. If the text provides one or more substantive descriptors in addition to an insubstantial one such as "the company" or "the organization," the answer key will contain only the substantive descriptors.
Categorization of organization as a corporate entity, a government entity, or some other kind of organizational entity.
MINIMUM INSTANTIATION CONDITIONS:
The ORG_TYPE fill should be based on evidence from the text or on world knowledge; the slot should never be left blank.
SPECIAL USAGE NOTES:
1. The categories that are to be used for ORG_TYPE are defined as follows:
COMPANY -- any profit-making or nonprofit legal (usually) entity, including universities, partnerships, corporations, proprietorsips, consortiums, enterprises, government-owned corporations, etc.
GOVERNMENT -- the government of a country, state, municipality, etc., or government body such as a government ministry, agency, commission, or committee. In the case of a string such as "IBM announced a joint venture with China," report "China" as type GOVERNMENT unless there is evidence for a different type elsewhere in the text.
OTHER -- organizational entities that do not fit the above categories, such as "the Apache Indian tribe," "OPEC," "the Medellin cartel," "NATO."
Specific place where an organization is located. Only the most specific place is to be reported. (This will enable accurate, automatic scoring.) The literal string that appears in the text, plus a categorization of the place name, appear in this slot as a complex (two-part) fill.
MINIMUM INSTANTIATION CONDITIONS:
The locale must be specifically mentioned in the text. NOTE: Except in the case of organizations of type GOVERNMENT, the name itself is not to be used as a source of information for the ORG_LOCALE slot.
SPECIAL USAGE NOTES:
1. NAMES
a. The "MUC-6 Reference Gazetteer" does not contain an exhaustive list of the place names that may be used to fill the ORG_LOCALE slot, nor does it usually provide alternative spellings for place names. Use UNK as locale type only if the type cannot be determined from the text.
b. If the text provides only a relative locale such as "near Tokyo" or "60 miles from Tokyo", report "Tokyo" as ORG_LOCALE name.
2. TYPES
a. The location categories that are to be used for ORG_LOCALE are defined as follows:
CITY -- a town, city, port, suburb, or other local settlement
PROVINCE -- a state, province, island or similar subnational geographically or politically defined area
COUNTRY -- a nation, country, colony, federation of countries such as the Confederation of Independent States (the former USSR), or other similar national entity
REGION -- an international region such as Eastern Europe, the Pacific Rim, or the Malay Archipelago
UNK -- a location whose possible type cannot be identified from evidence in the text or from world knowledge
b. The "MUC-6 Reference Gazetteer" uses more location categories than are to be reported in ORG_LOCALE. The following mappings apply:
PORT and AIRPORT in gazetteer are to be reported as CITY in ORG_LOCALE.
ISLAND in gazetteer is to be reported as PROVINCE in ORG_LOCALE.
ISLAND-GROUP in gazetteer is to be reported as either PROVINCE (if part of a single country) or as REGION (if part of an international region).
CONTINENT in gazetteer is to be reported as REGION in ORG_LOCALE.
3. ORG_LOCALE vs ORG_NATIONALITY
a. When a candidate fill for the ORG_LOCALE slot is the name of a country or a reference to a country, there is potential ambiguity as to whether the fill belongs in ORG_LOCALE or in ORG_NATIONALITY. For the purpose of maintaining consistency of extraction by the human analysts, the following kinds of text expressions have been identified as some typical ones that occasion a fill in ORG_LOCALE. (For info on the distinct criteria for ORG_NATIONALITY, see 3.2.2.7.)
<org> "of" <country name> /* "Honda Inc. of America" */
<org> "in" <country name> /* "Honda Inc. in America" */
<org> "based in" <country name> /* "GM Corp., based in America", "the largest auto manufacturer based in America" */
<org> "headquartered in" <country name>" /* "GM Corp., headquartered in America" */
<country name>"-based" <org> /* "the U.S.-based company", "U.S.-based Rockwell International" */
<country name> /* "Spain" [i.e., a country name that metonymically represents the government of that country] */
The country in which ORG_LOCALE is located. A defining list of country names in contained in "MUC-6 Country and Region List." (This list contains only canonical forms. NLP system developers must define their own mappings from the "MUC-6 Reference Gazetteer" and/or other gazetteer resources to this list.)
MINIMUM INSTANTIATION CONDITIONS:
To be reported only if ORG_LOCALE is filled with a locale of type CITY, PROVINCE, COUNTRY or, in some cases, UNK. Fill is to be inferred, if necessary.
SPECIAL USAGE NOTES:
1. If ORG_LOCALE is filled in by a name of type COUNTRY, report the country name in this slot as a normalized form drawn from "MUC-6 Country and Region List".
2. Note that the "MUC-6 Country and Region List" may not contain a complete list of countries. If a canonical form for the name of the country does not appear on the list, report the name in noun or adjective form (whichever appears in the text) as a string fill.
The name of the home country or home region of an organization. A defining list of country names in contained in "MUC-6 Country and Region List." (This list contains only canonical forms. NLP system developers must define their own mappings from the "MUC-6 Reference Gazetteer" and/or other gazetteer resources to this list.)
MINIMUM INSTANTIATION CONDITIONS:
Text must specify the nationality, which often is done in phrases such "the Japanese automaker," "Indonesia's largest electronics retailer," or "a US auto maker." Except in the case of organizations of type GOVERNMENT, the name (or alias) itself is not to be used as a source of information for the ORG_LOCALE slot. May be filled in by inference from the text or inference based on knowledge of geography, e.g., that Zurich is in Switzerland. Not to be filled in on the basis of general world knowledge alone. For example, if the text mentions "Swiss" or "Zurich" in the appropriate context, fill in SWITZERLAND, but if the article mentions "Boeing" without allusions to nationality, do not fill in UNITED STATES.
SPECIAL USAGE NOTES:
1. This slot may be filled with the name of type REGION rather than by a name of type COUNTRY if the text provides a general reference such as "Eastern European" or "Asian."
2. Note that the "MUC-6 Country and Region List" may not contain a complete list of countries and regions. If a canonical form for the name of the country or region does not appear on the list, report the name in noun or adjective form (whichever appears in the text) as a string fill.
3. As a default, assume that "American" refers to "United States."
4. There is potential ambiguity as to whether a text expression should be captured in ORG_LOCALE rather than in ORG_NATIONALITY. For the purpose of maintaining consistency of extraction by the human analysts, the following kinds of text expressions have been identified as some typical ones that occasion a fill in ORG_NATIONALITY. Note that the ORG_NATIONALITY fill is a premodifier in all cases. (For info on the distinct criteria for ORG_LOCALE, see 3.2.2.5.)
<country name> <org> /* "the Indonesia company", "the U.S. Government" */
<country name>"'s" <org> /* "Indonesia's company", "Spain's government", "U.S.'s Monsanto Inc." */
<nationality expressed in adjective form> <org> /* "the Spanish company", "the Spanish government" */
"the domestic" <org> /* "the domestic company" [requiring inference of referent] */
"the nation's" <org> /* "the nation's largest carrier" [requiring inference of referent] */
An (unincorporated) person or family.
MINIMUM INSTANTIATION CONDITIONS:
Text must supply fill for PER_NAME slot. The guidelines for instantiating a PERSON object are the same as the guidelines given in "Named Entity Task Definition" for annotating person names.
The proper name of the person or family.
MINIMUM INSTANTIATION CONDITIONS:
The text must supply a person or family name.
Variant of the proper name reported in the PER_NAME slot. There may be more than one value for this slot.
MINIMUM INSTANTIATION CONDITIONS:
The variant must appear explicitly in the text. This slot can be filled only if PER_NAME is filled also.
SPECIAL USAGE NOTES:
1. Misspelled variants of the name reported in PER_NAME are to be reported in PER_ALIAS.
An innate title such as "Dr." or "Ms.," as distinct from a person's role such as "President" or "CEO." (The latter would be captured by a scenario-specific template element such as a Relational object.)
MINIMUM INSTANTIATION CONDITIONS:
To be reported only if PER_NAME is filled. The text must explicitly mention the person's title.
A product or natural commodity. The nature of the specific artifact(s) to be reported is task-dependent and is therefore defined for a given Scenario Template subtask in the scenario task documentation.
MINIMUM INSTANTIATION CONDITIONS:
The text must supply a fill for at least one of the following slots: ART_ID, ART_DESCRIPTOR.
A unique identifier for the artifact.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
Noun phrase describing or referring to an artifact without naming it. This slot is not permitted to have more than one value.
MINIMUM INSTANTIATION CONDITIONS:
Text must provide a string that describes the artifact and that does not fit the definition of the ART_ID slot. The string cannot be a pronoun, e.g., "it."
SPECIAL USAGE NOTES:
1. The answer key will provide alternative correct answers if the text supplies more than one substantive descriptor string. If the text provides only uninformative descriptors, e.g., "the product," the fills in the answer key will all be marked as optional.
A categorization of the artifact. Inventory of categories depends on scenario definition.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
Task-independent slots (location and time data) that are separate from the predefined objects. They may be defined selectively for a given scenario, e.g., to provide the location and time of an event.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
SPECIAL USAGE NOTES:
1. These slots will not be part of the Template Element evaluation. Instead, one or more of them may play a role in one or more Scenario Template subtasks. In such cases, their role will be defined in the scenario task documentation.
Specific locale of an entity or event.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
Country locale of an entity or event.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
An absolute or relative date or date range.
MINIMUM INSTANTIATION CONDITIONS:
Depends on scenario definition.
SPECIAL USAGE NOTES:
1. The YY option and DESCRIPTOR option are to be used only if the article contains no DD tags. Use YY if only a partial date is given in the text, e.g., "on 27 March;" the output of extraction for that example would be "ON 2703YY". Use descriptor string option if a time phrase is used that cannot be represented in the usual date format; for example, "last week" ("ON last week") or "Tuesday" ("ON Tuesday").
2. See separate documentation titled "A Revised Template Description for Time (v3)" and "Supplement to Time Treatment Used for MUC-5" for further information.
Generated with CERN WebMaker
3.1.2.2 Object Identifiers
All objects are identified by the object name (from the template BNF), the document number (from the DOCNO tag in the text), and a one-up number; a dash is used to separate those three elements. For Wall Street Journal articles, the dash internal to the value of DOCNO must be suppressed; thus, a valid ORGANIZATION object identifier for DOCNO 891026-0100 would be <ORGANIZATION-8910260100-1>.3.1.2.3 Notation Reserved for Use in Answer Keys
Legitimate ambiguity or vagueness in the text is reflected in the answer key by the presence of alternative acceptable fills. The "/" notation is reserved for this use; such fills are *not* to be generated by the system under evaluation. The notation allows the answer key to present alternate acceptable single fills for a slot, alternate sets of fills for a slot, optional fills (one fill or zero fills), and combinations thereof. An object is treated as optional if all pointers to it are either optional or in a list of alternatives.3.2 Fill Rules
The input text contains some SGML tags, including TXT; the IE task is to be performed on the text delimited by the TXT, HL, DATELINE, and DD tags. (Note, however, that the DD tag sometimes doesn't appear at all, sometimes appears once, and sometimes appears twice.)3.2.1 TEMPLATE Object
DEFINITION:3.2.1.1 DOC_NR Slot
DEFINITION:3.2.1.2 CONTENT Slot
DEFINITION:3.2.2 ORGANIZATION Object
DEFINITION:3.2.2.1 ORG_NAME Slot
DEFINITION:3.2.2.2 ORG_ALIAS Slot
DEFINITION:3.2.2.3 ORG_DESCRIPTOR Slot
DEFINITION:3.2.2.4 ORG_TYPE Slot
DEFINITION:3.2.2.5 ORG_LOCALE Slot
DEFINITION:3.2.2.6 ORG_COUNTRY Slot
DEFINITION:3.2.2.7 ORG_NATIONALITY Slot
DEFINITION:3.2.3 PERSON Object
DEFINITION:3.2.3.1 PER_NAME Slot
DEFINITION:3.2.3.2 PER_ALIAS Slot
DEFINITION:3.2.3.3 PER_TITLE Slot
DEFINITION:3.2.4 ARTIFACT Object
DEFINITION:3.2.4.1 ART_ID Slot
DEFINITION:3.2.4.2 ART_DESCRIPTOR Slot
DEFINITION:3.2.4.3 ART_TYPE Slot
DEFINITION:3.2.5 Template Element Slots
DEFINITION:3.2.5.1 LOCALE Slot
DEFINITION:3.2.5.2 COUNTRY Slot
DEFINITION:3.2.5.3 DATE Slot
DEFINITION:
Information Extraction Task Definition - 14 JUN 95
[Next] [Previous] [Top] [Back to MUC-6 main page]