File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1021_metho.xml

Size: 20,271 bytes

Last Modified: 2025-10-06 14:15:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1021">
  <Title>A Knowledge-free Method for Capitalized Word Disambiguation</Title>
  <Section position="4" start_page="159" end_page="161" type="metho">
    <SectionTitle>
2 Bottom-Line Performance
</SectionTitle>
    <Paragraph position="0"> In general, the disambiguation of capitalized words in the mixed case texts doesn't seem to be too difficult: if a word is capitalized in an un-ambiguous position, e.g., not after a period or other punctuation which might require the following word to be capitalized (such as quotes or brackets), it is a proper name or part of a multi-word proper name. However, when a capitalized word is used in a position where it is expected to be capitalized, for instance, after a period or in a title, our task is to decide whether it acts  as a proper name or as the expected capitalized common word.</Paragraph>
    <Paragraph position="1"> The first obvious strategy for deciding whether a capitalized word in an ambiguous position is a proper name or not is to apply lexicon lookup (possibly enhanced with a morphological word guesser, e.g., (Mikheev, 1997)) and mark as proper names the words which are not listed in the lexicon of common words. Let us investigate this strategy in more detail: In our experiments we used a corpus of 100 documents (64,337 words) from The New York Times 1996.</Paragraph>
    <Paragraph position="2"> This corpus was balanced to represent different domains and was used for the formal test run of the 7th Message Understanding Conference (MUC'7) (Chinchor, 1998) in the Named Entity Recognition task.</Paragraph>
    <Paragraph position="3"> First we ran a simple zoner which identified ambiguous positions for capitalized words capitalized words after a period, quotes, colon, semicolon, in all-capital sentences and titles and in the beginnings of itemized list entries.</Paragraph>
    <Paragraph position="4"> The 64,337-word corpus contained 2,677 capitalized words in ambiguous positions, out of which 2,012 were listed in the lexicon of English common words. Ten common words were not listed in the lexicon and not guessed by our morphological guesser: &amp;quot;Forecasters&amp;quot;, &amp;quot;Benchmark&amp;quot;, &amp;quot;Eeverybody&amp;quot;, &amp;quot;Liftoff&amp;quot;, &amp;quot;Downloading&amp;quot;, &amp;quot;Pretax&amp;quot;, &amp;quot;Hailing&amp;quot;, &amp;quot;Birdbrain&amp;quot;, &amp;quot;Opting&amp;quot; and &amp;quot;Standalone&amp;quot;. In all our experiments we did not try to disambiguate between singu* lar and plural proper names and we also did not count as an error the adjectival reading of words which are always written capitalized (e.g.</Paragraph>
    <Paragraph position="5"> American, Russian, Okinawian, etc.). The distribution of proper names among the ambiguous capitalized words is shown in Table 1.</Paragraph>
    <Paragraph position="6"> Table 1 allows one to estimate the performance of the lexicon lookup strategy which we take as the bottom-line. First, using this strategy we would wrongly assign the ten common words which were not listed in the lexicon. More damaging is the biind assignment of the common word category to the words listed in the lexicon: out of 2,012 known word-tokens 171 actually were used as proper names. This in total would give us 181 errors out of 2,677 tries - about a 6.76% misclassification error on capitalized word-tokens in the ambiguous positions.</Paragraph>
    <Paragraph position="7"> The lexicon lookup strategy can be enhanced by accounting for the immediate context of the capitalized words in question. However, capitalized words in the ambiguous positions are not easily disambiguated by their surrounding part-of-speech context as attempted by part-of-speech taggers. For instance, many surnames are at the same time nouns or plural nouns in English and thus in both variants can be followed by a past tense verb. Capitalized words in the phrases Sails rose ... or Feeling himsell.., can easily be interpreted either way and only knowledge of semantics disallows the plural noun interpretation of Stars can read.</Paragraph>
    <Paragraph position="8"> Another challenge is to decide whether the first capitalized word belongs to the group of the following proper nouns or is an external modifier and therefore not a proper noun. For instance, All American Bank is a single phrase but in All State Police the word &amp;quot;All&amp;quot; is an external modifier and can be safely decapitalized. One might argue that a part-of-speech tagger can capture that in the first case the word &amp;quot;All&amp;quot; modified a singular proper noun (&amp;quot;Bank&amp;quot;) and hence is not grammatical as an external modifier and in the second case it is a grammatical external modifier since it modifies a plural proper noun (&amp;quot;Police&amp;quot;) but a simple counter-example - All American Games - defeats this line of reasoning.</Paragraph>
    <Paragraph position="9"> The third challenge is of a more local nature - it reflects a capitalization convention adopted by the author. For instance, words which reflect the occupation of a person can be used in an honorific mode e.g. &amp;quot;Chairman Mao&amp;quot; vs.</Paragraph>
    <Paragraph position="10">  &amp;quot;ATT chairman Smith&amp;quot; or &amp;quot;Astronaut Mario Runko&amp;quot; vs. &amp;quot;astronaut Mario Runko&amp;quot;. When such a phrase opens a sentence, looking at the sentence only, even a human classifier has troubles in making a decision.</Paragraph>
    <Paragraph position="11"> To evaluate the performance of part-of-speech taggers on the proper-noun identification task we ran an HMM trigram tagger (Mikheev, 1997) and the Brill tagger (Brill,.1995) on our corpus. Both taggers used the Penn Treebank tag-set and were trained on the Wall Street Journal corpus (Marcus et al., 1993). Since for our task the mismatch between plural proper noun (NNPS) and singular proper noun (NNP) was not important we did not count this as an error. Depending on the smoothing technique, the HMM tagger performed in the range of 5.3%-4.5% of the misclassification error on capitalized common words in the ambiguous positions, and the Brill tagger showed a similar pattern when we varied the lexicon acquisition heuristics.</Paragraph>
    <Paragraph position="12"> The taggers handled the cases when a potential adjective was followed by a verb or adverb ( &amp;quot;Golden added .. &amp;quot;) well but they got confused with a potential noun followed by a verb or adverb ( &amp;quot;Butler was ..&amp;quot; vs. &amp;quot;Safety was .. &amp;quot;), probably because the taggers could not distinguish between concrete and mass nouns. Not surprisingly the taggers did not do well on potential plural nouns and gerunds - none of them were assigned as a proper noun. The taggers also could not handle the case when a potential noun or adjective was followed by another capitalized word (&amp;quot;General Accounting Office&amp;quot;) well. In general, when the taggers did not have strong lexical preferences, apart from several obvious cases they tended to assign a common word category to known capitalized words in the ambiguous positions and the performance of the part-of-speech tagging approach was only about 2% superior to the simple bottom-line strategy.</Paragraph>
  </Section>
  <Section position="5" start_page="161" end_page="164" type="metho">
    <SectionTitle>
3 Our Knowledge-Free Method
</SectionTitle>
    <Paragraph position="0"> As we discussed above, the bad news (well, not really news) is that virtually any common word can potentially act as a proper name or part of a multi-word proper name. Fortunately, there is good news too: ambiguous things are usually unambiguously introduced at least once in the text unless they are part of common knowledge presupposed to be known by the readers.</Paragraph>
    <Paragraph position="1"> This is an observation which can be applied to a broader class of tasks. For example, people are often referred to by their surnames (e.g.</Paragraph>
    <Paragraph position="2"> &amp;quot;Black&amp;quot;) but usually introduced at least once in the text either with their first name (&amp;quot;John Black&amp;quot;) or with their title/profession affiliation (&amp;quot;Mr. Black&amp;quot;, &amp;quot;President Bush&amp;quot;) and it is only when their names are common knowledge that they don't need an introduction ( e.g. &amp;quot;Castro&amp;quot;, &amp;quot;Gorbachev&amp;quot;).</Paragraph>
    <Paragraph position="3"> In the case of proper name identification we are not concerned with the semantic class of a name (e.g. whether it is a person name or location) but we simply want to distinguish whether this word in this particular occurrence acts as a proper name or part of a multi-word proper name. If we restrict our scope only to a single sentence, we might find that there is just not enough information to make a confident decision. For instance, Riders in the sentence &amp;quot;Riders said later..&amp;quot; is equally likely to be a proper noun, a plural proper noun or a plural common noun but if in the same text we find &amp;quot;John Riders&amp;quot; this sharply increases the proper noun interpretation and conversely if we find &amp;quot;many riders&amp;quot; this suggests the plural noun interpretation. Thus our suggestion is to look at the unambiguous usage of the words in question in the entire document.</Paragraph>
    <Section position="1" start_page="161" end_page="162" type="sub_section">
      <SectionTitle>
3.1 The Sequence Strategy
</SectionTitle>
      <Paragraph position="0"> Our first strategy for the disambiguation of capitalized words in ambiguous positions is to explore sequences of proper nouns in unambiguous positions. We call it the Sequence Strategy. The rationale behind this is that if we detect a phrase of two or more capitalized words and this phrase starts from an unambiguous position we can be reasonably confident that even when the same phrase starts from an unreliable position all its words still have to be grouped together and hence are proper nouns. Moreover, this applies not just to the exact replication of such a phrase but to any partial ordering of its words of size two or more preserving their sequence. For instance, if we detect a phrase Rocket Systems Development Co. in the middle of a sentence, we can mark words in the sub-phrases Rocket Systems, Rocket Systems Co., Rocket Co., Systerns Development, etc. as proper nouns even if they occur at the beginning of a sentence or in other ambiguous positions. A span of capital- null ized words can also include lower-cased words of length three or shorter. This allows us to capture phrases like A ~ M, The Phantom of the Opera., etc. We generate partial orders from such phrases in a similar way but insist that every generated sub-phrase should start and end with a capitalized word.</Paragraph>
      <Paragraph position="1"> To make the Sequence Strategy robust to potential capitalization errors in the document we also use a set of negative evidence. This set is essentially a set of all lower-cased words of the document with their following words (bigrams).</Paragraph>
      <Paragraph position="2"> We don't attempt here to build longer sequences and their partial orders because we cannot in general restrict the scope of dependencies in such sequences. The negative evidence is then used together with the positive evidence of the Sequence Strategy and block the proper name assignment when controversy is found. For instance, if in a document the system detects a capitalized phrase &amp;quot;The President&amp;quot; in an un-ambiguous position, then it will be assigned as a proper name even if found in ambiguous positions in the same document. To be more precise the method will assign the word &amp;quot;The&amp;quot; as a proper noun since it should be grouped together with the word &amp;quot;President&amp;quot; into a single proper name. However, if in the same document the system detects an alternative evidence e.g. &amp;quot;the President&amp;quot; or &amp;quot;the president&amp;quot; - it then blocks such assignment as unsafe.</Paragraph>
      <Paragraph position="3"> The Sequence Strategy strategy is extremely useful when dealing with names of organizations since many of them are multi-word phrases composed from common words. And indeed, as is shown in Table 2, the precision of this strategy was 100% and the recall about 7.5%: out of 826 proper names in ambiguous positions, 62 were marked and all of them were marked correctly. If we concentrate only on difficult cases when proper names are at the same time common words of English, the recall of the Sequence Strategy rises to 18.7%: out of 171 common words which acted as proper names 32 were correctly marked. Among such words were &amp;quot;News&amp;quot; from &amp;quot;News Corp.&amp;quot;, &amp;quot;Rocket&amp;quot; from &amp;quot;Rocket Systems Co.&amp;quot;, &amp;quot;Coast&amp;quot; from &amp;quot;Coast Guard&amp;quot; and &amp;quot;To&amp;quot; from &amp;quot;To B. Super&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="162" end_page="163" type="sub_section">
      <SectionTitle>
3.2 Single Word Assignment
</SectionTitle>
      <Paragraph position="0"> The Sequence Strategy is accurate, but it covers only a part of potential proper names in ambiguous positions and at the same time it does not cover cases when capitalized words do not act as proper names. For this purpose we developed another strategy which also uses information from the entire document. We call this strategy Single Word Assignment, and it can be summarized as follows: if we detect a word which in the current document is seen capitalized in an unambiguous position and at the same time it is not used lower-cased, this word in this particular document, even when  used capitalized in ambiguous positions, is very likely to stand for a proper name as well. And conversely, if we detect a word which in the current document is used only lower-cased in unambiguous positions, it is extremely unlikely that this word will act as a proper name in an ambiguous position and thus, such a word can be marked as a common word. The only consideration here should be made for high frequency sentence-initial words which do not normally act as proper names: even if such a word is observed in a document only as a proper name (usually as part of a multi-word proper name), it is still not safe to mark it as a proper name in ambiguous positions. Note, however, that these words can be still marked as proper names (or rather as parts of proper multi-word names) by the Sequence Strategy. To build such list of stop-words we ran the Sequence Strategy and Single Word Assignment on the Brown Corpus (Francis&amp;Kucera, 1982), and reliably collected 100 most frequent sentence-initial words.</Paragraph>
      <Paragraph position="1"> Table 2 shows the success of the Single Word Assignment strategy: it marked 511 proper names from which 510 were marked correctly, and it marked 1,273 common words from which 1,270 were marked correctly. The only word which was incorrectly marked as a proper name was the word &amp;quot;Insurance&amp;quot; in &amp;quot;Insurance company ...&amp;quot; because in the same document there was a proper phrase &amp;quot;China-Pacific Insurance Co.&amp;quot; and no lower-cased occurrences of the word &amp;quot;insurance&amp;quot; were found. The three words incorrectly marked as common words were: &amp;quot;Defence&amp;quot; in &amp;quot;Defence officials ..&amp;quot;, &amp;quot;Trade&amp;quot; in &amp;quot;Trade Representation office ..&amp;quot; and &amp;quot;Satellite&amp;quot; in &amp;quot;Satellite Business News&amp;quot;. Five out of ten words which were not listed in the lexicon ( &amp;quot;Pretax&amp;quot;, &amp;quot;Benchmark&amp;quot;, &amp;quot;Liftoff', &amp;quot;Downloading&amp;quot; and &amp;quot;Standalone&amp;quot;) were correctly marked as common words because they were found to exist lower-cased in the text. In general the error rate of the assignment by this method was 4 out of 1,784 which is less than 0.02%. It is interesting to mention that when we ran Single Word Assignment without the stop-list, it incorrectly marked as proper names only three extra common words (&amp;quot;For&amp;quot;, &amp;quot;People&amp;quot; and &amp;quot;MORE&amp;quot;).</Paragraph>
    </Section>
    <Section position="3" start_page="163" end_page="164" type="sub_section">
      <SectionTitle>
3.3 Taking Care of the Rest
</SectionTitle>
      <Paragraph position="0"> After Single Word Assignment we applied a simple strategy of marking as common words all unassigned words which were found in the stop-list of the most frequent sentence-initial words. This gave us no errors and covered extra 298 common words. In fact, we could use this strategy before Single Word Assignment, since the words from the stop-list are not marked at that point anyway. Note, however, that the Sequence Strategy still has to be applied prior to the stop-list assignment. Among the words which failed to be assigned by either of our strategies were 243 proper names, but only 30 of them were in fact ambiguous, since they were listed in the lexicon of common words. So at this point we marked as proper names all unassigned words which were not listed in the lexicon of common words. This gave us 223 correct assignments and 5 incorrect ones - the remaining five out of these ten common words which were not listed in the lexicon. So, in total, by the combination of the described methods we achieved a precision of correctly-assigned __ 2363 -- 99.62% all_assigned -- 2363+9and a recall of all_assigned __ 2363+9 __ 88.7%. total_ambiguous -2677 --Now we have to decide what to do with the remaining 305 words which failed to be assigned.</Paragraph>
      <Paragraph position="1"> Among such words there are 275 common words and 30 proper names, so if we simply mark all these words as common words we will increase our recall to 100% with some decrease in precision - from 99.62% down to 98.54%. Among the unclassified proper names there were a few which could be dealt by a part-of-speech tagget: &amp;quot;Gray, chief...&amp;quot;, &amp;quot;Gray said...&amp;quot;, &amp;quot;Bill Lattanzi...&amp;quot;, &amp;quot;Bill Wade...&amp;quot;, &amp;quot;Bill Gates...&amp;quot;, &amp;quot;Burns , an...&amp;quot; and &amp;quot;..Golden added&amp;quot;. Another four unclassified proper names were capitalized words which followed the &amp;quot;U.S.&amp;quot; abbreviation e.g. &amp;quot;U.S. Supreme Court&amp;quot;. This is a difficult case even for sentence boundary disambiguation systerns ((Mikheev, 1998), (Palmer&amp;Hearst, 1997) and (Reynar&amp;Ratnaparkhi, 1997)) which are built for exactly that purpose, i.e., to decide whether a capitalized word which follows an abbreviation is attached to it or whether there is a sentence boundary between them. The &amp;quot;U.S.&amp;quot; abbreviation is one of the most difficult ones because it can be as often seen at the end of a sentence as in the beginning of multi-word proper names. Another nine unclassified proper names were stable phrases like &amp;quot;Foreign Minister&amp;quot;, &amp;quot;Prime Minister&amp;quot;, &amp;quot;Congressional Republicans&amp;quot;, &amp;quot;Holy Grail&amp;quot;, etc. mentioned just  once in a document. And, finally, about seven or eight unclassified proper names were difficult to account for at all e.g. &amp;quot;Sate-owned&amp;quot; or &amp;quot;Freeman Zhang&amp;quot;. Some of the above mentioned proper names could be resolved if we accumulate multi-word proper names across several documents, i.e., we can use information from one document when we deal with another.</Paragraph>
      <Paragraph position="2"> This can be seen as an extension to our Sequence Strategy with the only difference that the proper noun sequences have to be taken not only from the current document but from the cache memory and all multi-word proper names identified in a document are to be appended to that cache. When we tried this strategy on our test corpus we were able to correctly assign 14 out of 30 remaining proper names which increased the system's precision on the corpus to 99.13% with 100% recall.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML