File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0105_metho.xml
Size: 23,081 bytes
Last Modified: 2025-10-06 14:13:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0105"> <Title>Identifying Unknown Proper Names in Newswire Text</Title> <Section position="3" start_page="45" end_page="46" type="metho"> <SectionTitle> 2 Proper Names - Syntactic Forms and Semantic At- </SectionTitle> <Paragraph position="0"> tributes We first need to describe more precisely what we mean by proper names. In terms of syntactic categories, proper names are commonly identified as lexical NPs. In the examples in this paper, we use D to identify an internal proper name constituent of interest. Proper names often occur inside definite NPs, where the proper name can function as the syntactic head (&quot;the \[President of France\]&quot;, &quot;the \[Gulf of California\]&quot;, &quot;the Reagan \[White House\]&quot;, &quot;Iraq's president \[Saddam Hussein\]&quot;, &quot;Lake \[George\]&quot;), a complement (&quot;the president of \[France\]&quot;), or an adjunct or attributive NP (&quot;the \[Reagan\] White House&quot;, &quot;the \[Bush\] administration&quot;). They can also occur with indefinite determiners (&quot;an \[Arnold Schwartznegger\]&quot;, &quot;a \[Washington Redskin\]&quot;, &quot;an \[IBM\]&quot;). As lexical NPs, proper names have substantial internal structure: they can be formed out of primitive proper name elements (&quot;Oliver North&quot;, &quot;Gramm-Rudman .... Villa-Lobos&quot;), other proper names (&quot;Lake George&quot;, &quot;the \[President of France\]&quot;, &quot;the \[Reagan White ttouse\]&quot;, &quot;Anne of a Thousand Days&quot;) and also out of non-proper names (&quot;the \[Savings and Loan\] crisis&quot;, &quot;General Electric Co.&quot;, &quot;Federal Savings and Loan Insurance Corporation&quot;, &quot;Committee for tile Protection of Public Welfare&quot;). A common resulting form is the open compound proper name (&quot;the \[Carter Administration National Energy Conservation Committee\]&quot;).</Paragraph> <Paragraph position="1"> Given an occurrence of a proper name in text, we can use the text itself to extract semantic attributes associated with that name. As mentioned earlier, the local context frequently offers valuable clues. Also, for certain varieties of names, such as organization names (&quot;Microelectronics and Computer Technology Corporation&quot;) and geographical location names (&quot;Easter Island&quot;), the internal structure of the name can be used to hy- null pothesize various semantic attributes. A study reported in \[Amsler, 87\] on proper names in the New York Times containing the word &quot;center&quot; (such as &quot;Grand Forks Energy Research Center&quot; and &quot;Boston University's Center for Adaptive Systems&quot;) is suggestive of the scope of such techniques. Identifying idiomatic uses is obviously a problem: as \[Amsler, 87\] ,points out, &quot;Grand Funk Railroad&quot; is the name of a rock group. In keeping with such an approach, we have developed subgrammars which model the internal syntax and semantics of geographical names, which, in combination with information from the local Context, can be used to guess the type of location.</Paragraph> </Section> <Section position="4" start_page="46" end_page="49" type="metho"> <SectionTitle> 3 Overall Algorithm </SectionTitle> <Paragraph position="0"> The approach of text skimming is associated with much recent work on data extraction from text (e.g. \[Mauldin 89\], \[Jacobs 88\], and many others). In general, this means that different parts of the text can be processed to different depths, with some parts being skipped over lightly. The text skimming approach also implies, in our case, that we lighten the burden of lexical semantics: in contrast to approaches like \[Coates-Stephens, 91\], we need only represent word meanings for words closely related in meaning to the semantic attributes we are attempting to extract. While we were attracted to such an approach, our work also explores some of the practical tradeoffs associated with text skimming.</Paragraph> <Paragraph position="1"> The overall algorithm involves first tokenizing the text into sentences and words, then proposing candidate name mentions, and finally allowing various knowledge sources (KSs) to vote on and propose hypotheses about a given mention. Each KS can generate multiple scored hypotheses about a given mention. The KSs are applied in a pre-determined order to a mention, with each KS refining the hypotheses generated by the previous KS.</Paragraph> <Paragraph position="2"> Names which are identified beyond a certain confidence level (a variable recall/precision threshold) are added to a hypothetical lexicon after asking the user about them. Over time, learnt names (or name elements) in the hypothetical lexicon increase the likelihood of recognizing a name mention.</Paragraph> <Paragraph position="3"> The system assumes a shallow knowledge base representing the specific concepts and attributes to be extracted. For example, a president is either a head-of-state or a corporateofficer, and a person has age, title, gender and occupation; a place may be a continent, country, state, city, etc. The semantic lexicon associated with this knowledge base is a small one, of the order of a few hundred words, consisting of titles, honorifics, location nouns and organizational suffixes extracted from phrases tagged as NP in the Penn Tree-bank Wall Street Journal (WSJ) corpus. Words associated with these entities are the only ones which currently have any lexical semantics in our system. (A noteable exception comes from our work on place names, which exploits, for comparison purposes, a TIPSTER gazetteer). This small lexicon is complemented by the very large syntactic lexicon derived from the Lancaster-Oslo-Bergen corpus, which is used by our part-of-speech tagger and parser \[de Marcken, 90\].</Paragraph> <Paragraph position="4"> A variety of different grammars are used by the system. The simpler kind are regular expression grammars which rely on part-of-speech, some specific key lexical items from our semantic lexicon, and punctuation - these grammars drive a pattern matcher which is an extension of the one described in \[Norvig, 92\]. Such grammars are used for modeling tile internal syntax and semantics of geographical names and person names, and also for locating various Context boundaries - for example, identifying an al)positivc construction. Further segmentation of the appositive (see Section 3.3) is done by a mix- null ture of pattern-matching of the above kind and NP parsing (into head, pre-modifiers, and post-modifiers) using the MIT Fast Parser \[de Marcken, 90\] and its associated syntactic grammar. At present, we perform only a rudimentary analysis of organization names, merely hypothesizing whether a mention is a likely organization name or not.</Paragraph> <Paragraph position="5"> We have used the WSJ as a training corpus. The mode of knowledge engineering has involved building a rudimentary proper name tagger, followed by iterations through a cycle of tagging the corpus with records of Mentions and their occurrence Contexts, examining the tagged corpus to improve the knowledge sources, and retagging. It is envisaged that over time, certain hypothesized individuals will be incorporated into the knowledge base.</Paragraph> <Section position="1" start_page="47" end_page="47" type="sub_section"> <SectionTitle> 3.1 The Mention Generator </SectionTitle> <Paragraph position="0"> Given text which distinguished between upper-case and lower-case, the KS which proposes candidate mentions is based on finding contiguous capitMized words including lower-case function words (e.g. &quot;of&quot;, &quot;and&quot;, &quot;de&quot;, etc.). Only those sentences containing such mentions are processed (partially) by other KSs. This capitalization heuristic recalls all the proper names, but it is slightly imprecise, especially since sentence-initial words are always capitalized in case distinguished text. To eliminate these, a part-of-speech based filter is applied to each sentence-initial candidate sequence, discarding the initial word unless it is from a designated set (a noun, and adjective, a NP, the definite determiner &quot;the&quot;, or an unknown word) and excluding isolated definite determiners. In practice, this filter works extremely well. However, mentions may need to be split up later when more knowledge is available, since titles may need to be extracted, and function words like conjunctions and prepositions introduce attachment ambiguities (e.g. &quot;Democratic Seas.</Paragraph> <Paragraph position="1"> Dennnis De Concini and Alan Cranston&quot;, &quot;Food and Drug Administration&quot;).</Paragraph> <Paragraph position="2"> Given newswire text which makes no reliable case distinction (e.g. all-uppercase or all-lowercase text), the proposer proposes contiguous sequences of words with categories in the above designated set. The proposals include all the mentions proposed in case-sensitive mode, but the use of shallow processing here is obviously far less precise, generating 3 to 4 times as many mentions. However, incorrect candidates get filtered out eventually, since there are no significant hypotheses about them.</Paragraph> </Section> <Section position="2" start_page="47" end_page="48" type="sub_section"> <SectionTitle> 3.2 Knowledge Sources </SectionTitle> <Paragraph position="0"> Each KS can have multiple hypotheses with different confidences. For example, the mention &quot;General Electric Co.&quot;, may result in an initial hypothesis that it could be a person, based on interpreting &quot;General&quot; as a title, and other hypotheses that it could be a company or a county, based on the abbreviated suffix &quot;Co.&quot;. Each distinct filling of attributes corresponds to a distinct hypothesis. We currently use a somewhat crude thresholding scheme: viewing an attribute-KS as filling a single attribute, the confidence of a particular attribute-KS's hypothesis is a weighted sum of the match strength and the attribute-KS's strength, the latter being based on an initial global ranking followed by later calibration.</Paragraph> <Paragraph position="1"> The KSes are based on simple heuristics, which, except for Coreference, are interesting more in terms of their combined effect than in themselves. For example, Organization? is a KS which trivially determines organizationhood by the presence of certain company suffixes like &quot;Inc.&quot;. Honorifics uses the text occurrence of honorifies (&quot;Mr.&quot;, &quot;His Holiness&quot;, &quot;Lt. Col.&quot;) from the small semantic lexicon to make inferences about personhood, as well as gender and job occupation.</Paragraph> <Paragraph position="2"> The Job-Title and Age KSes extract their data from appositive constructions and premodifying adjective phrases and noun compounds. A job-title (a surface string like &quot;president-for-life&quot;) may or may not be in the syntactic or semantic lexicon; if it is present in the semantic lexicon, an effort is made to infer, based on context, the person's joboccupation, as discussed in the next section. Person-Name is a weak KS which segments potential person-names without being able to determine personhood with any confidence.</Paragraph> <Paragraph position="3"> Name-Element upgrades the confidence of names which match learned name elements.</Paragraph> <Paragraph position="4"> Agent-of-Human-Action looks for verbs like &quot;lead&quot;, &quot;head&quot;, &quot;say&quot;, &quot;explain&quot;, &quot;think&quot;, &quot;admit&quot; in the syntactic context to estimate whether a given mention could be a person, though the assignment of agent role to the mention is only approximate; the frequent use of metonymy involving companies as agents makes this a relatively weak KS. A Short-Name? KS reflects a newspaper honorific convention of not using single-word titleless names in introductory people mentions (as in &quot;Yesterday \[Kennedy\] said..&quot;). The Location KS uses patterns involving locational category nouns from the semantic lexicon like &quot;town&quot;, &quot;sea&quot;, &quot;gulf&quot;, &quot;north&quot; to flag location mentions like &quot;town of Beit Sahoud&quot;.</Paragraph> </Section> <Section position="3" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 3.3 Appositives </SectionTitle> <Paragraph position="0"> Appositives are important linguistic devices for introducing new mentions. We limit ourselves to constituents of the form <NP, NP>. These are of the form name-commaappositive (e.g. &quot;<name>, <ORG>'s top managing director&quot;, &quot;<name>, a small Bay Area town&quot;), and appositive-comma-name (e.g. &quot;a top Japanese executive, <name>&quot;).</Paragraph> <Paragraph position="1"> We ignore double appositives, except for simple ones involving age, as in &quot;Osamu Nagayama, 33, senior vice president and chief financial officer of Chugai.&quot;. Therefore, given a candidate name mention, the appositive modifier is a NP to the right or the left of the name. (A <NP, NP> constituent can of course be part of an enumerated, conjoined NP; however, if one conjunct is a name, it's likely that the other one may be too. Of course, a <NP, NP> sequence may not be a constituent in the first place).</Paragraph> <Paragraph position="2"> To identify appositive boundaries, we experimented with both (a) a regular expression grammar tuned to find appositives in the training corpus, and (b) syntactic-grammar based parsing using the MIT Fast Parser. Here we found pattern matching, based on looking for left and right delimiters such as comma and certain parts of speech, to be far more accurate. For example, given &quot;said Chugai's senior vice president for international trade, Osamu Nagayama&quot;, the appositive identifier would find &quot;Chugai's senior vice president for international trade&quot;. For extracting premodifiers, head and postmodifiers, we have found technique (b) to be somewhat more useful, though attachment errors still occur. The extracted premodifiers and head (or maximal fragment thereof) are then looked up in the semantic lexicon ontology; looking up &quot;senior vice president&quot; would yield corporate-officer or government-official. Hypotheses about &quot;Chugai&quot;, based on information from Coreference linking it to an earlier mention of &quot;Chugai Pharmaceutical Corp.&quot;, can be used to infer that &quot;Osamu Nagayama&quot; is more likely to be a corporate officer than a government official.</Paragraph> </Section> </Section> <Section position="5" start_page="49" end_page="50" type="metho"> <SectionTitle> 4 Coreference </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 4.1 Normalized Names </SectionTitle> <Paragraph position="0"> When a new mention is processed by the Coreference KS, pegs from previous mentions seen earlier in the document are considered as candidate coanchored mentions. Obviously, we wish to avoid considering the set of all previous pegs in the discourse. The use of focus information at some level can be used to constrain this set, but that would require in turn strong assumptions about the discourse structure of texts - which could severely limit our applicable domains. Still, it seems unreasonable, given a mention of &quot;Bill Clinton&quot;, to consider a peg for &quot;New York City&quot; as a possible antecedent. This suggests we consider only previous mentions which are similar in some way. We do this by indexing each mention by a normalized name, and considering only pegs for mentions which have the same normalized name. This raises the issue of the choice of a normalized name key.</Paragraph> <Paragraph position="1"> Obviously, there can be considerable variability in the form of a name across different mentions. For example, a mention of &quot;President Clinton&quot; could be followed by &quot;Bill Clinton&quot;; one of &quot;Georgetown University&quot; by &quot;Georgetown&quot;; &quot;the Los Angele s Lakers&quot; by &quot;the Lakers&quot;. (See \[Carroll, 85\] for a discussion of the regularities and numerous irregularities in alternations in name forms, many of which involve metonymic reference).</Paragraph> <Paragraph position="2"> In the training corpus, the heuristic of choosing the last name element in the surface form of a name as a normalized name works well for people. This may reflect the fact that newspapers often impose their own normalization conventions. There are obvious exceptions to the last name element heuristic; for example, in the WSJ, a mention of &quot;Roh Tae Woo&quot; is followed by a co-referential mention of &quot;Mr. Roh&quot;. For organization names, our heuristic is to choose all but the last element as the normalized name, but to allow a degree of partial matching. Given a new name mention, upon failure to find a partition cell having previous mentions with the same normalized name, partition cells with neighboring normalized names are searched. (The closeness metric here involves having a high percentage of sequential words in common). Thus the WSJ mentions of &quot;Leaseway Transportation Corp&quot; followed by &quot;Leaseway&quot; would be tied together, as would &quot;Canadian Technical Tape Inc.&quot; and &quot;Technical Tape&quot;. Of course, at the time of invoking Coreference for a hypothesis associated with a mention, we may or may not have (depending in part on the ordering of knowledge sources) enough information to decide which normalized name heuristic to invoke, in which case we use the last name as a default.</Paragraph> <Paragraph position="3"> In practice the matching on normalized names works well, except for cases like Mr.</Paragraph> <Paragraph position="4"> Roh above, and in cases of spelling errors. If necessary, the system can use a strategy of iterative widening; if the system fails to find a coreferring mention, in iterative widening mode it attempts to search through the space of all other previous mentions. In this mode, the system can also separately collect and warn about mentions whose names are close to (using the Damerau-Levenshtein similarity metric) but not identical in spelling to the current mention.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 4.2 Coreference Algorithm </SectionTitle> <Paragraph position="0"> At each peg site, the system unifies information from Hypotheses associated with the new mention with information accumulated from the other mentions at the peg site. As a rule, successful unification results in coanchoring. The Coreference procedure terminates when all the pegs in the relevant normalized name partition cell have been considered. A failure of unification, which results from a conflict from a new mention at a peg site, can lead to three possible outcomes: (i) Ignoring of the conflict, in which case coanchoring of the new mention to the peg is established; (ii) Overriding of earlier information accumulating at the peg in question, in which case coanchoring of the new mention to the peg is established, and coanchoring links from any other conflicting mentions to the peg are broken; or (iii) Honoring of the conflict, leading to (a) considering some other peg, or if none remains, (b) the creation of a new peg. The decision whether to Ignore or Override is based on the relative strength of the hypotheses emanating from different mentions: (i) Conflicts are Ignored when the information from the new mention has low confidence. (ii) Conflicts are Overriden when (a) (Weak-Opposition-Loses) the conflicting information from the new mention has high confidence and the conflicting information from the old mention has low confidence, or (b) (Strong-Majority-Wins) all the other evidence at the peg (there must be some) strongly confirms the new mention's hypothesis. Strong-Majority-Wins requires that there are at least two old mentions at the peg, with only one old mention giving rise to the conflict, and with all the other old mentions at the peg being compatible with the new mention at a high level of confidence for each attribute. Once a link from a mention is broken, the mention can be relinked to some other peg (either existing, or a new one).</Paragraph> <Paragraph position="1"> (iii) Otherwise, the conflict is Honored.</Paragraph> <Paragraph position="2"> Figure 1 shows an example of Coreference and ambiguity resolution. To simplify the presentation, only one hypothesis is shown per mention, appositives are ignored, and each attribute of each hypothesis is assumed to have the same confidence. (A Mention is identified as a string, with the hypothesis directly below it.) Assume Mention 1 is discourse-initial; assume further that Person-Name and Age have fired. Coreference on Mention 1 leads to the creation of a new peg, Peg 1, representing the hypothetical entity Bill Clinton. Coreference on Mention 2 leads to a search in the normalized-name partition for Clinton. The system unifies the properties associated with Mention 2 with Mention l's properties. In this case, since there is no conflict, both mentions are anchored to Peg 1. Mention 3 results in Coreference attempting a link to Peg 1. This leads to a conflict in unification with the properties from one of the other links to Mention 1, arising specifically from the full name and gender information extracted from Mention 1. These are conflicts because they violate a single-valued constraint for these attributes. The conflict with Mention 3 is honored, since there is no disparity in confidence measures. This results in Mention 3 being anchored to a new peg Peg 2, representing a hypothetical entity Hilary Clinton. Mention 4's properties are compatible with both pegs, hence it is coanchored to both, making it ambiguous. Mention 5 leads to a conflict on name at Peg 1. There is no confidence disparity at Peg 1, so the conflict is honored, resulting in a search for some other peg. At Peg 2, there is a conflict on occupation, but since Mention 3 is compatible with Mention 5, by Strong-Majority-Wins, Mention 3 overrides the information from Mention 4. This leads to breaking of the link of the conflicting mention with Peg 2, disambiguating Mention 4.</Paragraph> </Section> </Section> class="xml-element"></Paper>