File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1021_intro.xml
Size: 7,782 bytes
Last Modified: 2025-10-06 14:06:55
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1021"> <Title>A Knowledge-free Method for Capitalized Word Disambiguation</Title> <Section position="3" start_page="0" end_page="159" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Disambiguation of capitalized words in mixed-case texts has hardly received much attention in the natural language processing and information retrieval communities, but in fact it plays an important role in many tasks. Capitalized words usually denote proper names names of organizations, locations, people, artifacts, etc. - but there are also other positions in the text where capitalization is expected. Such ambiguous positions include the first word in a sentence, words in all-capitalized titles or table entries, a capitalized word after a colon or open quote, the first capitalized word in a listentry, etc. Capitalized words in these and some other positions present a case of ambiguity they can stand for proper names as in &quot;White later said ...&quot;, or they can be just capitalized common words as in &quot;White elephants are ...&quot;. Thus the disambiguation of capitalized words in the ambiguous positions leads to the identification of proper names I and in this paper we will use these two terms interchangeably. Note that this task, does not involve the classification of proper names into semantic categories (person, organization, location, etc.) which is the objective of the Named Entity Recognition task.</Paragraph> <Paragraph position="1"> Many researchers observed that commonly used upper/lower case normalization does not necessarily help document retrieval. Church in (Church, 1995) among other simple text normalization techniques studied the effect of case normalization for different words and showed that &quot;...sometimes case variants refer to the same thing (hurricane and Hurricane), sometimes they refer to different things (continental and Continental) and sometimes they don't refer to much of anything (e.g. anytime and Anytime).&quot; Obviously these differences are due to the fact that some capitalized words stand for proper names (such as Continental- the name of an airline) and some don't.</Paragraph> <Paragraph position="2"> Proper names are the main concern of the Named Entity Recognition subtask (Chinchor, 1998) of Information Extraction. There the disambiguation of the first word of a sentence (and in other ambiguous positions) is one of the central problems. For instance, the word &quot;Black&quot; in the sentence-initial position can stand for a person's surname but can also refer to the colour. Even in multi-word capitalized phrases the first word can belong to the rest of the phrase or can be just an external modifier. In the sentence &quot;Daily, Mason and Partners lost their court case&quot; it is clear that &quot;Daily, Mason and Partners&quot; is the name of a company. In the sentence &quot;Unfortunately, Mason and Partners lost their court case&quot; the name of the company does not involve the word &quot;unfortunately&quot;, but ten capitalized but in fact can stand for an adjective (American president) as well as a proper noun (he was an American).</Paragraph> <Paragraph position="3"> the word &quot;Daily&quot; is just as common a word as &quot;unfortunately&quot;.</Paragraph> <Paragraph position="4"> Identification of proper names is also important in Machine Translation because normally proper names should be transliterated (i.e. phonetically translated) rather than properly (semantically) translated. In confidential texts, such as medical records, proper names must be identified and removed before making such texts available to unauthorized people. And in general, most of the tasks which involve different kinds of text analysis will benefit from the robust disambiguation of capitalized words into proper names and capitalized common words.</Paragraph> <Paragraph position="5"> Despite the obvious importance of this problem, it was always considered part of larger tasks and, to the authors' knowledge, was not studied closely with full attention. In the part-of-speech tagging field, the disambiguation of capitalized words is treated similarly to the disambiguation of common words. However, as Church (1988) rightly pointed out &quot;Proper nouns and capitalized words are particularly problematic: some capitalized words are proper nouns and some are not. Estimates from the Brown Corpus can be misleading. For example, the capitalized word &quot;Acts&quot; is found twice in Brown Corpus, both times as a proper noun (in a title). It would be misleading to infer from this evidence that the word &quot;Acts&quot; is always a proper noun.&quot; Church then proposed to include only high frequency capitalized words in the lexicon and also label words as proper nouns if they are &quot;adjacent to&quot; other capitalized words. For the rest of capitalized common words he suggested that a small probability of proper noun interpretation should be assumed and then one should hope that the surrounding context will help to make the right assignment.</Paragraph> <Paragraph position="6"> This approach is successful for some cases but, as we pointed out above, a sentence-initial capitalized word which is adjacent to other capitalized words is not necessarily a part of a proper name, and also many common nouns and plural nouns can be used as proper names (e.g. Riders) and their contextual expectations are not too different from their usual parts of speech.</Paragraph> <Paragraph position="7"> In the Information Extraction field the disambiguation of capitalized words in the ambiguous positions was always tightly linked to the classification of the proper names into semantic classes such as person name, location, company name, etc. and to the resolution of coreference between the identified and classified proper names. This gave rise to the methods which aim at these tasks simultaneously.</Paragraph> <Paragraph position="8"> (Mani&MacMillan, 1995) describe a method of using contextual clues such as appositives (&quot;PERSON, the daughter of a prominent local physician&quot;) and felicity conditions for identifying names. The contextual clues themselves are then tapped for data concerning the referents of the names. The advantage of this approach is that these contextual clues not only indicate whether a capitalized word is a proper name, but they also determine its semantic class. The disadvantage of this method is in the cost and difficulty of building a wide-coverage set of contextual clues and the dependence of these contextual clues on the domain and text genre.</Paragraph> <Paragraph position="9"> Contextual clues are very sensitive to the specific lexical and syntactic constructions and the clues developed for the news-wire texts are not useful for legal or medical texts.</Paragraph> <Paragraph position="10"> In this paper we present a novel approach to the problem of capitalized word disambiguation.</Paragraph> <Paragraph position="11"> The main feature of our approach is that it uses a minimum of pre-built resources and tries to dynamically infer the disambiguation clues from the entire document under processing. This makes our approach domain and genre independent and thus inexpensive to apply when dealing with unrestricted texts. This approach was used in a named entity recognition system (Mikheev et al., 1998) where it proved to be one of the key factors in the system achieving a nearly human performance in the 7th Message Understanding Conference (MUC'7) evaluation (Chinchor, 1998).</Paragraph> </Section> class="xml-element"></Paper>