File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/99/p99-1021_concl.xml
Size: 6,766 bytes
Last Modified: 2025-10-06 13:58:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1021"> <Title>A Knowledge-free Method for Capitalized Word Disambiguation</Title> <Section position="6" start_page="164" end_page="165" type="concl"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> In this paper we presented an approach to the disambiguation of capitalized common words when they are used in positions where capitalization is expected. Such words can act as proper names or can be just capitalized variants of common words. The main feature of our approach is that it uses a minimum of pre-built resources - we use only a list of common words of English and a list of the most frequent words which appear in the sentence-stating positions.</Paragraph> <Paragraph position="1"> Both of these lists were acquired without any human intervention. To compensate for the lack of pre-acquired knowledge, the system tries to infer disambiguation clues from the entire document itself. This makes our approach domain independent and closely targeted to each document. Initially our method was developed using the training data of the MUC-7 evaluation and tested on the withheld test-set as described in this paper. We then applied it to the Brown Corpus and achieved similar results with degradation of only 0.7% in precision, mostly due to the text zoning errors and unknown words. We deliberately shaped our approach so it does not rely on pre-compiled statistics but rather acts by analogy. This is because the most interesting events are inherently infrequent and, hence, are difficult to collect reliable statistics for, and at the same time pre-compiled statistics would be smoothed across multiple documents rather than targeted to a specific document.</Paragraph> <Paragraph position="2"> The main strategy of our approach is to scan the entire document for unambiguous usages of words which have to be disambiguated. The fact that the pre-built resources are used only at the latest stages of processing (Stop-List Assignment and Lexicon Lookup Assignment) ensures that the system can handle unknown words and disambiguate even very implausible proper names. For instance, it correctly assigned five out of ten unknown common words.</Paragraph> <Paragraph position="3"> Among the difficult cases resolved by the system were a multi-word proper name &quot;To B. Super&quot; where both &quot;To&quot; and &quot;Super&quot; were correctly identified as proper nouns and a multi-word proper name &quot;The Update&quot; where &quot;The&quot; was correctly identified as part of the magazine name. Both &quot;To&quot; and &quot;The&quot; were listed in the stop-list and therefore were very implausible to classify as proper nouns but nevertheless the system handled them correctly. In its generic configuration the system achieved precision of 99.62% with recall of 88.7% and precision 98.54% with 100% recall. When we enhanced the system with a multi-word proper name cache memory the performance improved to 99.13% precision with 100% recall. This is a statistically significant improvement against the bottom-line performance which fared about 94% precision with 100% recall.</Paragraph> <Paragraph position="4"> One of the key factors to the success in the proposed method is an accurate zoning of the documents. Since our method relies on the capitalization in unambiguous positions - such positions should be robustly identified. In the general case this is not too difficult but one should take care of titles, quoted speech and list entries - otherwise if treated as ordinary text they can provide false candidates for capitalization. Our method in general is not too sensitive to the capitalization errors: the Sequence Strategy is complimented with the negative evidence. This together with the fact that it is rare when several words appear by mistake more than once makes this strategy robust. The Single Word Assignment strategy uses the stop list which includes the most frequent common words. This screens out many potential errors.</Paragraph> <Paragraph position="5"> One notable difficulty for the Single Word Assignment represent words which denote profession/title affiliations. These words modifying a person name might require capitalization &quot;Sheriff John Smith&quot;, but in the same document they can appear lower-cased - &quot;the sheriff&quot;. When the capitalized variant occurs only as sentence initial our method predicts that it should be decapitalized. This, however, is an extremely difficult case even for human indexers - some writers tend to use certain professions such as Sheriff, Governor, Astronaut, etc., as honorific affiliations and others tend to do otherwise. This is a generally difficult case for Single Word Assignment - when a word is used as a proper name and as a common word in the same document, and especially when one of these usages occurs only in an ambiguous position. For instance, in a document about steel the only occurrence of &quot;Steel Company&quot; happened to start a sentence. This lead to an erroneous assignment of the word &quot;Steel&quot; as common noun. Another example: in a document about &quot;the Acting Judge&quot;, the word &quot;acting&quot; in a sentence &quot;Acting on behalf..&quot; was wrongly classified as a proper name.</Paragraph> <Paragraph position="6"> The described approach is very easy to implement and it does not require training or installation of other software. The system can be used as it is and, by implementing the cache memory of multi-word proper names, it can be targeted to a specific domain. The system can also be used as a pre-processor to a part-of-speech tagger or a sentence boundary disambiguation program which can try to apply more sophisticated methods to unresolved capitalized words In fact, as a by-product of its performance, our system disambiguated about 17% (9 out of 60) of ambiguous sentence boundaries when an abbreviation was followed by a capitalized word.</Paragraph> <Paragraph position="7"> Apart from collecting an extensive cache of multi-word proper names, another useful strategy which we are going to test in the future is to collect a list of common words which, at the beginning of a sentence, act most frequently as proper names and to use such a list in a similar fashion to the list of stop-words. Such a list can be collected completely automatically but this requires a corpus or corpora much larger than the Brown Corpus because the relevant sentences are rather infrequent. We are also planning to investigate the sensitivity of our method to the document size in more detail.</Paragraph> </Section> class="xml-element"></Paper>