File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1030_metho.xml
Size: 29,642 bytes
Last Modified: 2025-10-06 14:14:30
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1030"> <Title>Categorizing and standardizing proper nouns for efficient information retrieval, In B. Boguraev and</Title> <Section position="4" start_page="202" end_page="203" type="metho"> <SectionTitle> 2 The Ambiguity of Proper Names </SectionTitle> <Paragraph position="0"> Name identification requires resolution of a subset of the types of structural and semantic ambiguities encountered in the analysis of nouns and noun phrases (NPs) in natural language processing. Like common nouns, ((Jensen and Binot 1987), (Hindle and Rooth 1993) and (Brill and Resnick 1994)), proper names exhibit structural ambiguity in prepositional phrase (PP) attachment and in conjunction scope.</Paragraph> <Paragraph position="1"> A PP may be attached to the preceding NP and form part of a single large name, as in NP\[Midwest Center PP\[for NP\[Computer Research\]\]\]. Alternatively it may be independent of the preceding NP, as in NP\[Carnegie Hall\] PP\[for NP\[Irwin Berlin\]\], where for separates two distinct names, Carnegie Hall and Irwin Berlin.</Paragraph> <Paragraph position="2"> As with PP-attachment of common noun phrases, the ambiguity is not always resolved, even in human sentence parsing (cf. the famous example I saw the girl in the park with the telescope). The loca-tion of an organization, for instance, could be part of its name (City University of New York) or an attached modifier (The Museum of Modern Art in New York City). Without knowledge of the official name, it is sometimes difficult to determine the exact boundaries of a proper name. Consider examples such as Western Co. of North America, Commodity Ezchange in New York and Hebrew University in Jerusalem, Israel.</Paragraph> <Paragraph position="3"> Proper names contain ambiguous conjoined phrases. The components of Victoria and Albert Museum and IBM and Bell Laboratories look identical; however, and is part of the name of the museum in the first example, but a conjunction joining two computer company names in the second. Although this problem is well known, a seazch of the computational literature shows that few solutions have been proposed, perhaps because the conjunct ambiguity problem is harder than PP attachment (though see (Agarwal and Boggess 1992) for a method of conjunct identification that relies on syntactic category and semantic label).</Paragraph> <Paragraph position="4"> Similar structural ambiguity exists with respect to the possessive pronoun, which may indicate a relationship between two names (e.g., Israel's Shimon Peres) or may constitute a component of a single name (e.g., Donoghue's Money Fund Report).</Paragraph> <Paragraph position="5"> The resolution of structural ambiguity such as PP attachment and conjunction scope is required in order to automatically establish the exact boundaries of proper names. Once these boundaries have been established, there is another type of well-known structural ambiguity, involving the internal structure of the proper name. For example, Professor of Far Eastern Art John Blake is parsed as \[\[Professor \[of Fax Eastern Art\]\] John Blake\] whereas Professor Art Klein is \[\[Professor\] Art Klein\].</Paragraph> <Paragraph position="6"> Proper names also display semantic ambiguity.</Paragraph> <Paragraph position="7"> Identification of the type of proper nouns resembles the problem of sense disambiguation for common nouns where, for instance, state taken out of context may refer either to a government body or the condition of a person or entity. A name variant taken out of context may be one of many types, e.g., Ford by itself could be a person (Gerald Ford), an organization (Ford Motors), a make of car (Ford), or a place (Ford, Michigan). Entity-type ambiguity is quite common, as places are named after famous people and companies are named after their owners or locations. In addition, naming conventions are sometimes disregarded by people who enjoy creating novel and unconventional names. A store named Mr.</Paragraph> <Paragraph position="8"> Tall and a woman named April Wednesday (McDonald 1993) come to mind.</Paragraph> <Paragraph position="9"> Like common nouns, proper nouns exhibit systematic metonymy: United States refers either to a geographical area or to the political body which governs this area; Wall Street Journal refers to the printed object, its content, and the commercial entity that produces it.</Paragraph> <Paragraph position="10"> In addition, proper names resemble definite noun phrases in that their intended referent may be ambiguous. The man may refer to more than one male individual previously mentioned in the discourse or present in the non-linguistic context; J. Smith may similarly refer to more than one individual named Joseph Smith, John Smith, Jane Smith, etc. Semantic ambiguity of names is very common because of the standard practice of using shorter names to stand for longer ones. Shared knowledge and context are crucial disambiguation factors. Paris, usually refers to the capital of France, rather than a city in Texas or the Trojan prince, but in a particular context, such as a discussion of Greek mythology, the presumed referent changes.</Paragraph> <Paragraph position="11"> Beyond the ambiguities that proper names share with common nouns, some ambiguities are particular to names: noun phrases may be ambiguous between a name reading and a common noun phrase, as in Candy, the person's name, versus candy the food, or The House as an organization versus a house referring to a building. In English, capitalization usually disambiguates the two, though not at sentence beginnings: at the beginning of a sentence, the components and capitalization patterns of New Coke and New Sears are identical; only world knowledge informs us that New Coke is a product and Sears is a company.</Paragraph> <Paragraph position="12"> Furthermore, capitalization does not always disambiguate names from non-names because what constitutes a name as opposed to a'non-name is not always clear. According to (Quirk et al. 1972) names, which consist of proper nouns (classified into personal names like Shakespeare, temporal names like Monday, or geographical names like Australia) have 'unique' reference. Proper nouns differ in their linguistic behavior from common nouns in that they mostly do not take determiners or have a plural form. However, some names do take determiners, as in The New York Times; in this case, they &quot;are perfectly regular in taking the definite article since they are basically prernodified count nouns... The difference between an ordinary common noun and an ordinary common noun turned name is that the unique reference of the name has been institutionalized, as is made overt in writing by initial capital letter.&quot; Quirk et al.'s description of names seems to indicate that capitalized words like Egyptian (an adjective) or Frenchmen (a noun referring to a set of individuals) are not names. It leaves capitalized sequences like Minimum Alternative Taz, Annual Report, and Chairman undetermined as to whether or not they are names.</Paragraph> <Paragraph position="13"> All of these ambiguities must be dealt with if proper names are to be identified correctly. In the rest of the paper we describe the resources and heuristics we have designed and implemented in Nominator and the extent to which they resolve these ambiguities.</Paragraph> </Section> <Section position="5" start_page="203" end_page="203" type="metho"> <SectionTitle> 3 Disambiguation Resources </SectionTitle> <Paragraph position="0"> In general, two types of resources are available for disambiguation: context and world knowledge. Each of these can be exploited along a continuum, from 'cheaper' to computationally and manually more expensive usage. 'Cheaper' models, which include no context or world knowledge, do very little disambiguation. More 'expensive' models, which use full syntactic parsing, discourse models, inference and reasoning, require computational and human resources that may not always be available, as when massive amounts of text have to be rapidly processed on a regular basis. In addition, given the current state of the art, full parsing and extensive world knowledge would still not yield complete automatic ambiguity resolution.</Paragraph> <Paragraph position="1"> In designing Nominator, we have tried to achieve a balance between high accuracy and speed by adopting a model which uses minimal context and world knowledge. Nominator uses no syntactic contextual information. It applies a set of heuristics to a list of (multi-word) strings, based on patterns of capitalization, punctuation and location within the sentence and the document. This design choice differentiates our approach from that of several similar projects. Most proper name recognizers that have been reported on in print either take as input text tagged by part-of-speech (e.g., the systems of (Paik et al. 1993) and (Mani et al. 1993)) or perform syntactic and/or morphological analysis on all words, including capitalized ones, that are part of candidate proper names (e.g., (Coates-Stephens 1993) and (McDonald 1993)). Several (e.g., (McDonald 1993), (Mani et al. 1993), (Paik et al. 1993) and (Cowie et al. 1992)) look in the local context of the candidate proper name for external information such as appositives (e.g., in a sequence such as Robin Clark, presiden~ of Clark Co.) or for human-subject verbs (e.g., say, plan) in order to determine the category of the candidate proper name. Nominator does not use this type of external context.</Paragraph> <Paragraph position="2"> Instead, Nominator makes use of a different kind of contextual information -- proper names co-occuring in. the document. It is a fairly standard convention in an edited document for one of the first references to an entity (excluding a reference in the title) to include a relatively full form of its name.</Paragraph> <Paragraph position="3"> In a kind of discourse anaphora, other references to the entity take the form of shorter, more ambiguous variants. Nominator identifies the referent of the full form (see below) and then takes advantage of the discourse context provided by the list of names to associate shorter more ambiguous name occurrences with their intended referents.</Paragraph> <Paragraph position="4"> In terms of world knowledge, the most obvious resource is a database of known names. In fact, this is what many commercially available name identification applications use (e.g., Hayes 1994). A reliable database provides both accuracy and efficiency, if fast look-up methods are incorporated. A database also has the potential to resolve structural ambiguity; for example, if IBM and Apple Computers are listed individually in the database but IBM and Apple Computers is not, it may indicate a conjunction of two distinct names. A database may also contain default world knowledge information: e.g., with no other over-riding information, it may be safe to assume that the string McDonald's refers to an organization. But even if an existing database is reliable, names that are not yet in it must be discovered and information in the database must be over-ridden when appropriate. For example, if a new name such as IBM Credit Corp. occurs in the text but not in the database, while IBM exists in the database, automatic identification of IBM should be blocked in favor of the new name IBM Credi~ Corp.</Paragraph> <Paragraph position="5"> If a name database exists, Nominator can take advantage of it. However, our goal has been to design Nominator to function optimally in the absence of such a resource. In this case, Nominator consults a small authority file which contains information on about 3000 special 'name words' and their relevant lexical features. Listed are personal titles (e.g., Mr., King), organizational identifiers (including strong identifiers such as Inc. and weaker domain identifiers such as Arts) and names of large places (e.g., Los Angeles, California, but not Scarsdale, N.Y.). Also listed are exception words, such as upper-case lexical items that are unlikely to be single-word proper names (e.g., Very, I or TV) and lower-case lexical items (e.g., and and van) that can be parts of proper names. In addition, the authority file contains about 20,000 first names.</Paragraph> <Paragraph position="6"> Our choice of disambiguation resources makes Nominator fast and robust. The precision and recall of Nominator, operating without a database of pre-existing proper names, is in the 90's while the processing rate is over 40Mg of text per hour on a RISC/6000 machine. (See (Ravin and Wacholder 1996) for details.) This efficient processing has been achieved at the cost of limiting the extent to which the program can 'understand' the text being analyzed and resolve potential ambiguity. Many wordsequences that are easily recognized by human readers as names are ambiguous for Nominator, given the restricted set of tools available to it. In cases where Nominator cannot resolve an ambiguity with relatively high confidence, we follow the principle that 'noisy information' is to be preferred to data omitted, so that no information is lost. In ambiguous cases, the module is designed to make conservative decisions, such as including non-names or non-name parts in otherwise valid name sequences. It assigns weak types such as ?HUMAN or fails to assign a type if the available information is not sufficient.</Paragraph> </Section> <Section position="6" start_page="203" end_page="204" type="metho"> <SectionTitle> 4 The Name Discovery Process </SectionTitle> <Paragraph position="0"> In this section, we give an overview of the process by which Nominator identifies and classifies proper names. Nominator's first step is to build a list of candidate names for a document. Next, 'splitting' heuristics are applied to all candidate names for the purpose of breaking up complex names into smaller ones. Finally, Nominator groups together name vari- null ants that refer to the same entity. After information about names and their referents has been extracted from individual documents, an aggregation process combines the names collected from all the documents into a dictionary, or database of names, representative of the document collection. (For more details on the process, see (Ravin and Wacholder 1996)).</Paragraph> <Paragraph position="1"> We illustrate the process of name discovery with an excerpt taken from a Wall Street Journal article in the TIPSTER CD-ROM collection (NIST 1993).</Paragraph> <Paragraph position="2"> Paragraph breaks are omitted to conserve space.</Paragraph> <Paragraph position="3"> ... The professional conduct of lawyers in other jurisdictions is guided by American Bar Association rules or by state bar ethics codes, none of which permit non-lawyers to be partners in law firms. The ABA has steadfastly reserved the title of partner and partnership perks (which include getting a stake of the firm's profit) for those with law degrees. But Robert Jordan, a partner at Steptoe & Johnson who took the lead in drafting the new district bar code, said the ABA's rules were viewed as &quot;too restrictive&quot; by lawyers here. &quot;The practice of law in Washington is very different from what it is in Dubuque,&quot; he said .... Some of these non-lawyer employees are paid at partners' levels. Yet, not having the partner title &quot;makes non-lawyers working in law firms second-class citizens,&quot; said Mr. Jordan of Steptoe & Johnson ....</Paragraph> <Paragraph position="4"> Before the text is processed by Nominator, it is analyzed into tokens -- sentences, words, tags, and punctuation elements. Nominator forms a candidate name list by scanning the tokenized document and collecting sequences of capitalized tokens (or words) as well as some special lower-case tokens, such as conjunctions and prepositions.</Paragraph> <Paragraph position="5"> The list of candidate names extracted from the Mr. Jordan of Steptoe & Johnson Each candidate name is examined for the presence of conjunctions, prepositions or possessive 's. A set of heuristics is applied to determine whether each candidate name should be split into smaller independent names. For example, Mr. Jordan of Steptoe Johnson is split into Mr. Jordan and Steptoe 8J Johnson.</Paragraph> <Paragraph position="6"> Finally, Nominator links together variants that refer to the same entity. Because of standard English-language naming conventions, Mr. Jordan is grouped with Robert Jordan. ABA is grouped with American Bar Association as a possible abbreviation of the longer name. Each linked group is categorized by an entity type and assigned a 'canonical name' as its identifier. The canonical name is the fullest, least ambiguous label that can be used to refer to the entity. It may be one of the variants found in the document or it may be constructed from components of different ones As the links are formed, each group is assigned a type. In the sample output shown below, each canonical name is followed by its entity type and by the variants linked to it.</Paragraph> <Paragraph position="7"> After the whole document collection has been processed, linked groups are merged across documents and their variants combined. Thus, if in one document President Clinton was a variant of William Clinton, while in another document Governor Clinton was a variant of William Clinton, both are treated as variants of an aggregated William Clinton group. In this minimal sense, Nominator uses the larger context of the document collection to 'learn' more variants for a given name.</Paragraph> <Paragraph position="8"> In the following sections we describe how ambiguity is resolved as part of the name discovery process.</Paragraph> </Section> <Section position="7" start_page="204" end_page="205" type="metho"> <SectionTitle> 5 Resolution of Structural </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="204" end_page="205" type="sub_section"> <SectionTitle> Ambiguity </SectionTitle> <Paragraph position="0"> We identify three indicators of potential structural ambiguity, prepositions, conj unctions and possessive pronouns, which we refer to as 'ambiguous operators'. In order to determine whether 'splitting' should occur, a name sequence containing an ambiguous operator is divided into three segments -the operator, the substring to its left and the sub-string to its right. The splitting process applies a set of heuristics based on patterns of capitalization, lexical features and the relative 'scope' of operators (see below) to name sequences containing these operators to determine whether or not they should be split into smaller names.</Paragraph> <Paragraph position="1"> We can describe the splitting heuristics as determining the scope of ambiguous operators, by analogy to the standard linguistic treatment of quantifiers.</Paragraph> <Paragraph position="2"> From Nominator's point of view, all three operator types behave in similar ways and often interact when they co-occur in the same name sequence, as in New York's MOMA and the Victoria and Albert Museum in London.</Paragraph> <Paragraph position="3"> The scope of ambiguous operators also interacts with the 'scope' of NP-heads, if we define the scope of NP-heads as the constituents they dominate. For example, in Victoria and Albert Museum, the conjunction is within the scope of the lexical head Museum because Museum is a noun that can take PP modification (Museum of Natural History) and hence pre-modification (Natural History Museum).</Paragraph> <Paragraph position="4"> Since pre-modifiers can contain conj unctions (Japanese Painting and Printing Museum), the conjunction is within the scope of the noun, and so the name is not split. Although the same relationship holds between the lexical head Laboratories and the conjunction and in IBM and Bell Laboratories, another heuristic takes precedence, one whose condition requires splitting a string if it contains an acronym immediately to the left or to the right of the ambiguous operator.</Paragraph> <Paragraph position="5"> It is not possible to determine relative scope strength for all the combinations of different operators. Contradictory examples abound: Gates of Microsoft and Gerstner of IBM suggests stronger scope of and over o k The Department of German Languages and Literature suggests the opposite. Since it is usually the case that a right-hand operator has stronger scope over a left-hand one, we evaluate strings containing operators from right to left. To illustrate, New York's MOMA and the Victoria and Albert Museum in London is first evaluated for splitting on in. Since the left and right substrings do not satisfy any conditions, we proceed to the next operator on the left -- and. Because of the strong scope of Museum, as mentioned above, no splitting occurs. Next, the second and from the right is evaluated. It causes a split because it is immediately preceded by an all-capitalized word. We have found this simple typographical heuristic to be powerful and surprisingly accurate.</Paragraph> <Paragraph position="6"> Ambiguous operators form recursive structures and so the splitting heuristics apply recursively to name sequences until no more splitting conditions hold. New York's MOMA is further split at's because of a heuristic that checks for place names on the left of a possessive pronoun or a comma. Victoria and Albert Museum in London remains intact.</Paragraph> <Paragraph position="7"> Nominator's other heuristics resemble those discussed above in that they check for typographical patterns or for the presence of particular name types to the left or right of certain operators. Some heuristics weigh the relative scope strength in the sub-strings on either side of the operator. If the scope strength is similar, the string is split. We have observed that this type of heuristic works quite well. Thus, the string The Natural History Museum and The Board of Education is split at and because each of its substrings contains a strong-scope NP-head (as we define it) with modifiers within its scope. These two substrings are better balanced than the sub-strings of The Food and Drug Administration where the left substring does not contain a strong-scope NP-head while the right one does (Administration).</Paragraph> <Paragraph position="8"> Because of the principle that noisy data is preferable to loss of information, Nominator does not split names if relative strength cannot be determined. As a result, there occur in Nominator's output certain 'names' such as American Television ~ Communications and Houston Industries Inc. or Dallas's MCorp and First RepublicBank and Houston's First City Bancorp. of Tezas.</Paragraph> </Section> </Section> <Section position="8" start_page="205" end_page="205" type="metho"> <SectionTitle> 6 Resolution of Ambiguity at </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="205" end_page="205" type="sub_section"> <SectionTitle> Sentence Beginnings </SectionTitle> <Paragraph position="0"> Special treatment is required for words in sentence-initial position, which may be capitalized because they are part of a proper name or simply because they are sentence initial.</Paragraph> <Paragraph position="1"> While the heuristics for splitting names are linguistically motivated and rule-governed, the heuristics for handling sentence-initial names are based on patterns of word occurrence in the document. When all the names have been collected and split, names containing sentence-initial words are compared to other names on the list. If the sentence-initial candidate name also occurs as a non-sentence-initial name or as a substring of it, the candidate name is assumed to be valid and is retained. Otherwise, it is removed from the list. For example, if White occurs at sentence-initiai position and also as a substring of another name (e.g., Mr. White) it is kept. If it is found only in sentence-initial position (e.g., White paint is ...), White is discarded.</Paragraph> <Paragraph position="2"> A more difficult situation arises when a sentence-initial candidate name contains a valid name that begins at the second word of the string. If the preceding word is an adverb, a pronoun, a verb or a preposition, it can safely be discarded. Thus a sentence beginning with Yesterday Columbia yields Columbia as a name. But cases involving other parts of speech remain unresolved. If they are sentenceinitial, Nominator accepts as names both New Sears and New Coke; it also accepts sentence-initial Five Reagan as a variant of President Reagan, if the two co-occur in a document.</Paragraph> </Section> </Section> <Section position="9" start_page="205" end_page="206" type="metho"> <SectionTitle> 7 Resolution of Semantic Ambiguity </SectionTitle> <Paragraph position="0"> In a typical document, a single entity may be referred to by many name variants which differ in their degree of potential ambiguity. As noted above, Paris and Washington are highly ambiguous out of context but in well edited text they are often disambiguated by the occurrence of a single unambiguous variant in the same document. Thus, Washington is likely to co-occur with either President Washington or Washington, D.C., but not with both. Indeed, we have observed that if several unambiguous variants do co-occur, as in documents that mention both the owner of a company and the company named after the owner, the editors refrain from using a variant that is ambiguous with respect to both.</Paragraph> <Paragraph position="1"> To disambiguate highly ambiguous variants then, we link them to unambiguous ones occurring within the same document. Nominator cycles through the list of names, identifying 'anchors', or variant names that unambiguously refer to certain entity types.</Paragraph> <Paragraph position="2"> When an anchor is identified, the list of name candidates is scanned for ambiguous variants that could refer to the same entity. They are linked to the anchor. null Our measure of ambiguity is very pragmatic. It is based on the confidence scores yielded by heuristics that analyze a name and determine the entity types it can refer to. If the heuristic for a certain entity type (a person, for example) results in a high condifence score (highly confident that this is a person name), we determine that the name unambiguously refers to this type. Otherwise, we choose the highest score obtained by the various heuristics.</Paragraph> <Paragraph position="3"> A few simple indicators can unambiguously determine the entity type of a name, such as Mr. for a person or Inc. for an organization. More commonly, however, several pieces of positive and negative evidence are accumulated in order to make this judgement. null We have defined a set of obligatory and optional components for each entity type. For a human name, these components include a professional title (e.g., Attorney General), a personal title (e.g., Dr.), a first name, middle name, nickname, last name, and suffix (e.g., Jr.). The combination of the various components is inspected. Some combinations may result in a high negative score -- highly confident that this cannot be a person name. For example, if the name lacks a personal title and a first name, and its last name is listed as an organization word (e.g., Department) in the authority list, it receives a high negative score. This is the case with Justice Department or Frank Sinatra Building. The same combination but with a last name that is not a listed organization word results in a low positive score, as for Justice Johnson or Frank Sinatra. The presence or absence of a personal title is also important for determining confidence: If present, the result is a high confidence score (e.g., Mrs. Ruth Lake); No personal title with a known first name results in a low positive confidence score (e.g'., Ruth Lake, Beverly Hills); and no personal title with an unknown first name results in a zero score (e.g., Panorama Lake).</Paragraph> <Paragraph position="4"> By the end of the analysis process, Justice Departmen~ has a high negative score for person and a low positive score for organization, resulting in its classification as an organization. Beverly Hills, by contrast, has low positive scores both for place and for person. Names with low or zero scores are first tested as possible variants of names with high positive scores. However, if they are incompatible with any, they are assigned a weak entity type. Thus in the absence of any other evidence in the document, Beverly Hills is classified as a ?PERSON. (?PER-SON is preferred over ?PLACE as it tends to be the correct choice most of the time.) This analysis of course can be over-ridden by a name database listing Beverly Hills as a place.</Paragraph> <Paragraph position="5"> Further disambiguation may be possible during aggregation across documents. As mentioned before, during aggregation, linked groups from different documents are merged if their canonical forms are identical. As a rule, their entity types should be identical as well, to prevent a merge of Boston (PLACE) and Boston (ORG). Weak entity types, however, are allowed to merge with stronger entity types. Thus, Jordan Hills (?PERSON) from one document is aggregated with Jordan Hills (PER-SON) from another, where there was sufficient evidence, such as Mr. Hills, to make a firmer decision.</Paragraph> </Section> class="xml-element"></Paper>