File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/m98-1028_metho.xml
Size: 22,443 bytes
Last Modified: 2025-10-06 14:14:49
<?xml version="1.0" standalone="yes"?> <Paper uid="M98-1028"> <Title>MUC-7 Named Entity Task Definition</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. TASK OVERVIEW </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Markup Description </SectionTitle> <Paragraph position="0"> The output of the systems to be evaluated will be in the form of SGML text markup. The only insertions allowed during tagging are tags enclosed in angled brackets. No extra whitespace or carriage returns are to be inserted; otherwise, the offset count would change, which would adversely affect scoring.</Paragraph> <Paragraph position="1"> The markup will have the following form: The markup is defined in SGML Document Type Descriptions (DTDs), written for MUC-7 use and maintained by personnel at SAIC. The DTDs enable annotators and system developers to use SGML validation tools to check the correctness of the SGML-tagged texts produced by the annotator or the system. The validation tools are available to MUC-7 participants in the file called muc7-sgml-tools and in the form of the scorer's parser both available via anonymous ftp from ftp.muc.saic.com (or online.muc.saic.com) in the under the MUC subdirectory.</Paragraph> <Paragraph position="2"> Annotators are using a software tool provided for MUC-7 and MET-2 by SRA Corporation to assist in generating the answer keys to be used for system training and testing.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Named Entities (ENAMEX tag element) </SectionTitle> <Paragraph position="0"> This subtask is limited to proper names, acronyms, and perhaps miscellaneous other unique identifiers, which are categorized via the TYPE attribute as follows: ORGANIZATION: named corporate, governmental, or other organizational entity PERSON: named person or family LOCATION: name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.)</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Temporal Expressions (TIMEX tag element) </SectionTitle> <Paragraph position="0"> This subtask is for &quot;absolute&quot; and &quot;relative&quot; temporal expressions only; explanation is provided in appendix B.</Paragraph> <Paragraph position="1"> The tagged tokens are categorized only via the TYPE attribute as follows: DATE: complete or partial date expression TIME: complete or partial expression of time of day The TYPE attribute does not distinguish &quot;absolute&quot; and &quot;relative&quot; temporal expressions from each other.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Number Expressions (NUMEX tag element) </SectionTitle> <Paragraph position="0"> This subtask is for two useful types of numeric expressions, monetary expressions and percentages. The numbers may be expressed in either numeric or alphabetic form.</Paragraph> <Paragraph position="1"> The task covers the complete expression, which is categorized via the TYPE attribute as follows: MONEY: monetary expression PERCENT: percentage</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. NOTATION RESERVED FOR USE IN THE ANSWER KEYS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Expressing Alternative Attribute Values </SectionTitle> <Paragraph position="0"> A vertical bar is being used to separate alternative TYPE attribute values in the answer key.</Paragraph> <Paragraph position="1"> Alternative values will be given when the annotator does not have enough information to make a unique categorization, even considering the context and the annotator's knowledge of the world.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Expressing Optional Markup (STATUS Attribute) </SectionTitle> <Paragraph position="0"> When it is not certain that a string should be marked up, the annotator will include the STATUS attribute in the markup to indicate that the markup is optional. The only value of the STATUS attribute is &quot;OPT.&quot; Examples of its possible use can be found in the appendices, such as in appendix B (holiday names).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Expressing Alternative or Minimum String Boundaries (ALT or MIN Attribute) </SectionTitle> <Paragraph position="0"> The ALT or MIN attribute will be used when the tagged string contains one or more substrings that should be considered correct for the purposes of scoring the system response. Certain premodifiers (&quot;a,&quot; &quot;an,&quot; and &quot;the&quot;) are automatically ignored by the scoring program (via the &quot;configuration&quot; file) and thus do not need to be specially marked in the key.</Paragraph> <Paragraph position="1"> ALT was the original term used for this attribute in past NE definitions and scorers, but the markup tool calls it MIN. Either term is appropriate because the alternative string is smaller than the tagged string in all cases.</Paragraph> <Paragraph position="2"> The ALT or MIN attribute will be used sparingly. A possible TIMEX example is shown below.</Paragraph> <Paragraph position="3"> &quot;all of 1987&quot;</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> <TIMEX TYPE=&quot;DATE&quot; ALT=&quot;1987&quot;>all of 1987</TIMEX> 4. GUIDELINES FOR MARKUP OF EXCEPTIONAL CONSTRUCTIONS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Conjunction and Elision in Multi-name, Multi-modifier, and Numeric Range Expressions </SectionTitle> <Paragraph position="0"> Conjoined named entities in general are marked separately except for those in the following categories. All cases in these categories are tagged as *single* expressions.</Paragraph> <Paragraph position="1"> A conjoined multi-name expression, in which there is elision of the head of one conjunct, should be marked up as a single expression.</Paragraph> <Paragraph position="2"> &quot;North and South America&quot; <ENAMEX TYPE=&quot;LOCATION&quot;>North and South America</ENAMEX> A similar case occurs with elision in multi-number expressions: &quot;10- and 20-dollar bills&quot; (i.e. 10-dollar bills and 20-dollar bills) <NUMEX TYPE=&quot;MONEY&quot;>10- and 20-dollar</NUMEX> bills A single-name expression containing conjoined modifiers with no elision also should be marked up as a single expression.</Paragraph> <Paragraph position="3"> &quot;U.S. Fish and Wildlife Service&quot; (which does NOT mean two entities, i.e. &quot;the U.S. Fish Service and the U.S.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Effects of Tokenization Conventions </SectionTitle> <Paragraph position="0"> The systems must incorporate certain tokenization conventions. These conventions are contained in a separate document titled &quot;Tokenization Rules.&quot; The tokenization conventions for MUC-7 have an impact on the boundaries of the strings to be tagged. For example, the conventions call for treating possessive forms, e.g., &quot;California's,&quot; as multiple tokens, unless there is a name such as &quot;McDonald's [burger company]&quot; that is inherently possessive. See the separate documentation titled &quot;Tokenization Rules&quot; for further information and examples.</Paragraph> <Paragraph position="1"> In various sources there are some special characters used that end up being within the marked string because they are contiguous, but a reader will ignore them. For example, in the Wall Street Journal an @ appears at the beginning of some lines in the headline. In the New York Times News Service articles there are some codes such as &quot;&MD;&quot; which appear and are not always separated by white space from their environment. These will generally be marked up and the scorer will not be able to delete them because of the segmentation problem.</Paragraph> <Paragraph position="2"> Although infrequent, the rule we follow will be to include them if they are string-internal and to exclude them otherwise. It is unlikely that scores will be seriously affected so the scorer will not specially treat these codes.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Nested Expressions </SectionTitle> <Paragraph position="0"> No nested expressions will be marked. Even in cases where LOCATION (ENAMEX) expressions occur within TIMEX and NUMEX expressions, they are not to be tagged. Also, entity names that appear within ENAMEX tags are *not* to be tagged.</Paragraph> <Paragraph position="1"> &quot;8:24 a.m. Chicago time&quot; <TIMEX TYPE=&quot;TIME&quot;>8:24 a.m. Chicago time</TIMEX> &quot;U.S. $10 million&quot; <NUMEX TYPE=&quot;MONEY&quot;>U.S. $10 million</NUMEX> &quot;the U.S. Customs Service&quot; the <ENAMEX TYPE=&quot;ORGANIZATION&quot;>U.S. Customs Service</ENAMEX> 5. APPENDICES 5.1 Naming Conventions for Section Headlines in the Appendices 1. An &quot;Entity (/Temporal/Numeric)-Expression&quot; identifies something that MUST be tagged; 2. An &quot;Entity-String&quot; identifies that something that MIGHT be tagged, but not in the context described; 3. A &quot;Non-entity&quot; identifies something that is NEVER tagged, according to current MUC/MET conventions. APPENDIX A. ENAMEX: SPECIFIC GUIDELINES A.1 Guidelines That Pertain to All Three TYPEs (PERSON, LOCATION, and ORGANIZATION) A.1.1 Entity-Expressions that Modify Non-entities Entity names used as modifiers in complex NPs that are not proper names are to be tagged when it is clear to the annotator from context or the annotator's knowledge of the world that the name is that of an organization, person, or location.</Paragraph> <Paragraph position="2"> In some cases, multi-word strings that are proper names will contain entity name substrings; such strings are not decomposable; therefore, the substrings are not to be tagged. (See A.1.2 re special cases involving prenominal modifiers of person identifiers.) based on the name of a unique structure or facility in which the organization holds office. The association between the name and the organization should be idiosyncratic enough to justify its inclusion in the dictionary definition of the term (in contrast with &quot;common&quot; metonyms, discussed below), as a kind of nickname for the organization. Some examples follow.</Paragraph> <Paragraph position="3"> &quot;The White House announced ...&quot; [alias for the U.S.president's executive organization] The <ENAMEX TYPE=&quot;ORGANIZATION&quot;>White House</ENAMEX> announced ... &quot;The Pentagon announced...&quot; The <ENAMEX TYPE=&quot;ORGANIZATION&quot;>Pentagon</ENAMEX> announced ... Taggable aliases will NOT include the following forms of entity names: * Common nouns, including pronouns, used in anaphoric reference to taggable entity names, such as &quot;IBM announced that the company would lay off ...&quot; [no markup for &quot;the company&quot;] * Aliases that refer to broad industrial sectors, political power centers, etc., rather than to specific organizations. For example, do not tag &quot;Wall Street&quot; as an alias for the U.S. stock market, &quot;Japan Incorporated&quot; as an alias for Japanese Industries, &quot;Uncle Sam&quot; and &quot;Washington&quot; as aliases for the U.S. government, or &quot;Capitol Hill&quot; as an alias for the Congress, since these do not refer to specific organizations. The &quot;Ivy League&quot; refers to a specific set of universities, but does not seem to be a specific organization in its own right. Similarly, the &quot;Axis&quot; (WWII Germany-Japan-Italy) and the &quot;Iron Curtain countries&quot; are aliases for finite sets of entities, but not for specific organizations with corporation-like infrastructures.</Paragraph> <Paragraph position="4"> * Metonyms, herein designated &quot;common&quot; metonyms, that reference political, military, athletic, and other organizations by the name of a city, country, or other associated location. In these cases, the association between the name's semantic type and the organization is sufficiently predictable and non-idiosyncratic as to preclude a dictionary gloss; hence the name should be tagged as a LOCATION. Some examples of &quot;common&quot; metonyms follow.</Paragraph> <Paragraph position="5"> &quot;Germany invaded Poland in 1939.&quot; <ENAMEX TYPE=&quot;LOCATION&quot;>GERMANY</ENAMEX> invaded ...</Paragraph> <Paragraph position="6"> &quot;Baltimore defeated the Yankees by a score of 4 to 3.</Paragraph> <Paragraph position="7"> <ENAMEX TYPE=&quot;LOCATION&quot;>Baltimore</ENAMEX> defeated the <ENAMEX TYPE=&quot;ORGANIZATION&quot;>Yankees</ENAMEX> ...</Paragraph> <Paragraph position="8"> Note that links from LOCATION-tagged names to organizations (e.g. &quot;Baltimore&quot; to the &quot;Baltimore Orioles&quot; baseball team) are left to occur, along with anaphora-resolution, at a processing level higher than Named Entity tagging.</Paragraph> <Paragraph position="9"> A.1.6.1 Quotation Marks Around an Alias Quotes are included in the tag if they appear within a person's name.</Paragraph> <Paragraph position="10"> [no markup, not even for &quot;Dow Jones&quot;] Note that just as in A.1.1, entity names used as modifiers in complex NPs that are not to be marked as entities are to be tagged when it is clear to the annotator from context or the annotator's knowledge of the world that the name is that of an organization, person, or location. More specifically, cases where the manufacturer and the product are named, the manufacturer will be tagged. The product will not be tagged. However, the scorer ignores corporate designators as listed in the configuration file, so the scoring is lenient in this respect. It is possible that at a later date only partial credit will be given if the existing corporate designator is not included in the markup.</Paragraph> <Paragraph position="11"> Miscellaneous types of proper names that are to be tagged as ORGANIZATION include stock exchanges, multinational organizations, political parties, orchestras, unions, non-generic governmental entity names such as &quot;Congress&quot; or &quot;Chamber of Deputies&quot;, sports teams and armies (unless designated only by country names, which are tagged as LOCATION), The scorer does ignore a list of premodifiers that it is given in its configuration file because some of these articles can be included unconsciously during human markup and systems should not be penalized. The answer keys are not consistent with respect to including or excluding articles and the examples in this document reflect the inconsistency of humans. However, the scoring ignores these premodifiers.</Paragraph> <Paragraph position="12"> countries, provinces, counties, cities, regions, districts, towns, villages, neighborhoods, airports, highways, street names, street addresses, oceans, seas, straits, bays, channels, sounds, rivers, islands, lakes, national parks, mountains, fictional or mythical locations, and monumental structures, such as the Eiffel Tower and Washington Monument, that were built primarily as monuments.</Paragraph> <Paragraph position="13"> &quot;flew to Plymouth Airport&quot; flew to <ENAMEX TYPE=&quot;LOCATION&quot;>Plymouth Airport</ENAMEX> &quot;created a backup at O'Hare International Airport&quot; created a backup at <ENAMEX TYPE=&quot;LOCATION&quot;>O'Hare International Airport</ ENAMEX> If the name of the airport refers to the organization or business of the airport and not its location or facilities, then it is still marked as a LOCATION..</Paragraph> <Paragraph position="14"> <ENAMEX TYPE=&quot;ORGANIZATION&quot;>Massport</ENAMEX>, which owns and operates <ENAMEX TYPE=&quot;LOCATION&quot;>Logan</ENAMEX>, defended the attempts ...</Paragraph> <Paragraph position="15"> A.4.1 Embedded Locative Entity-Strings and Conjoined Locative Entity-Expressions The phrase &quot;of <place-name>&quot; following an organization name may or may not be part of the organization name proper. The annotation in the answer key will follow these guidelines: (1) If there is a corporate designator, it marks the end of the organization name; (2) if there is no corporate designator, the &quot;of <place-name>&quot; is part of the organization name.</Paragraph> <Paragraph position="16"> Designators that are integrally associated with a place name are to be tagged as part of the name. For example, include in the tagged string the word &quot;River&quot; in the name of a river, &quot;Mountain&quot; in the name of a mountain, &quot;City&quot; in the name of a city, etc., if such words are contained in the string. Note that, due to the political significance of the Jordan River's west bank, the term &quot;West Bank&quot; may, in the context of discussions about the Middle East, assume the status of a named entity expression. A similar example is the term &quot;Left Bank&quot; (of the Seine River) as a name for an area of Paris. Use context and world knowledge to determine whether such a term is being used as a specifying non-entity following a place name, or as an entity expression (a proper noun) representing a particular LOCATION.</Paragraph> <Paragraph position="17"> Do not tag names of sub-national regions when referenced only by compass-point modifiers. Do not tag &quot;the South&quot; or the &quot;mid-West&quot;, analogies to &quot;the Middle East&quot; notwithstanding, because, unlike the latter term, their referential value varies from country to country. For example, &quot;the Southwest region&quot; [no markup] Do tag names of sub-national regions when they are associated with specific regions, if they are identifiable even when the name is disassociated from context. Examples include &quot;the Ruhr&quot;, &quot;the Auvergne&quot;, and &quot;Amazonia&quot;. Note that these names generally straddle, or lie within, geo-political jurisdictions such as states or provinces.</Paragraph> <Paragraph position="18"> A.4.6 Time and Space Modifiers of Locative Entity Expressions Historic-time modifiers (&quot;former&quot;, &quot;present-day&quot;) and directional modifiers (&quot;north&quot;, &quot;south&quot;, &quot;east&quot;, &quot;west&quot;, &quot;upper&quot;, &quot;lower&quot;, and combinations thereof) are taggable only when they are intrinsic parts of a location's official name, as in &quot;Upper Volta&quot; or &quot;North Dakota.&quot; Do not include them in tagged expressions when used as ad hoc modifiers that are readily separable from the name.</Paragraph> <Paragraph position="19"> Contrast &quot;Premier of the former Soviet Union&quot; and &quot;formerly Premier of the Soviet Union&quot;; &quot;east Baltimore&quot; and &quot;eastern section of Baltimore&quot;; and &quot;Upper Volta&quot; and &quot;upper section of Volta&quot; to see the separability of these modifiers.</Paragraph> <Paragraph position="20"> APPENDIX B. TIMEX: SPECIFIC GUIDELINES</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> B.1 Introduction </SectionTitle> <Paragraph position="0"> Both &quot;absolute&quot; time expressions and certain &quot;relative&quot; time expressions, as specified below (B.1.2), are to be tagged in MUC-7. Note that the tag itself does not differentiate between &quot;absolute&quot; and &quot;relative&quot; types, i.e., all time expressions are labeled with the same type of tag. The salient features of the time expressions that are marked is that whether absolute or relative, they can be anchored on a timeline; unanchored durations, for example, are not marked.</Paragraph> <Paragraph position="1"> The TIME sub-type is defined as a temporal unit shorter than a full day, such as second, minute, or hour. The DATE sub-type is a temporal unit of a full day or longer. Both DATE and TIME expressions may be either absolute or relative. Both absolute and relative times are tagged as TIME and absolute and relative dates are tagged as DATE.</Paragraph> <Paragraph position="2"> Temporal expressions are to be tagged as a single item. Contiguous subparts (month/day/year) are not to be separately tagged unless they are taggable expressions of two distinct TIMEX sub-types (date followed by time or time followed by date).</Paragraph> <Paragraph position="3"> Determiners that introduce the expressions are not to be tagged. Words or phrases modifying the expressions (such as &quot;around&quot; or &quot;about&quot;) also will not be tagged. Only the actual temporal expression itself is to be tagged. &quot;around the 4th of May&quot; around the <TIMEX TYPE=&quot;DATE&quot;>4th of May</TIMEX> &quot;shortly after the 4th of May&quot; shortly after the <TIMEX TYPE=&quot;DATE&quot;>4th of May</TIMEX></Paragraph> </Section> class="xml-element"></Paper>