File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/w93-0114_evalu.xml
Size: 5,090 bytes
Last Modified: 2025-10-06 14:00:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0114"> <Title>CATEGORIZING AND STANDARDIZING PROPER NOUNS FOR EFFICIENT INFORMATION RETRIEVAL</Title> <Section position="6" start_page="155" end_page="158" type="evalu"> <SectionTitle> 5. Performance Evaluation </SectionTitle> <Paragraph position="0"> While we are currently processing more than one gigabyte of text using the new version of the proper noun categorizer for the TIPSTER 24 month testing, the evaluation of the proper noun categorizer herein reported is based on 25 randomly selected Wall Slreet Journal documents, which were compared to the proper noun categorzafion done by a human. This document set was also used in evaluating our initial version of the categorizer (Paik et al, in press). Table 1 demonstrates the performance of the categorizex on 589 proper nouns occurring in the test set. In addition to 589 proper nouns, 14 common words were incorrectly identified as proper nouns due to errors by the part of speech tagger and typos in the original text; and the boundaries of 11 proper nouns were incorrectly recognized due to unusual proper noun phrases such as, 'Virginia Group to Alleviate Smoking in Public', which the proper noun boundary identification heuristics has failed to bracket.</Paragraph> <Paragraph position="1"> 64 proper nouns were correctly categorized as miscellaneous as they did not belong to any of our 29 meaningful categories. This may be considered a coverage problem in our proper noun categorization scheme, not an error in our categorizer. Some examples of the proper nouns belonging to the miscellaneous category are: 'Promised Land', 'Mickey Mouse', and 'IUD'. The last row of Table 1 shows the overall precision of our categorizer based on the proper nouns which belong to the 29 meaningful categ~ies.</Paragraph> <Paragraph position="2"> In our initial implementation (Palk et al, in press), errors in categorizing person and city names accounted for 68% of the total errors. To improve performance, we added a list of common f'LrSt names, which was semi- null automatically extracted from Associated Press and Wall Street Journal corpora, as a special lexicon to consult when there is no match using all categorization procedures. This addition improved our precision of categorizing person names from the initial system's 46% to 90%.</Paragraph> <Paragraph position="3"> The errors in categorizing city names, in our initial categorizer, were mainly due to two problems. They are: 1) The locational source of the news, when mentioned at the beginning of the document, is usually capitalized in Wall Street Journal. This special convention of newspaper texts caused miscategorizing the locational proper nouns (usually city names) as a miscellaneous; and 2) City names which were not in our proper noun knowledge base The first problem was handled in our new proper noun categorizer by moving the locatioual information of the news story to a new field, '<DATELINE>', and normalizing capitalization (from all upper case texts to mixed ~se) atthe document preprocessing stage before the part of speech tagging. For example, if a story is about a company in Dallas then the text will be as For the second: problem, we incorporated a context rule for identifying city names to our categorizer. The rule is that city names are followed by a country name or a province name from the United States and Canada unless the name is very well known. For example, 'Van Nuys', can now be categorized as a city name as it is preceded by a valid United States province name.</Paragraph> <Paragraph position="4"> ... Van Nuys, Calif....</Paragraph> <Paragraph position="5"> By adding the above new procedures to our categorization system as well as some well known city names which are not province capitals or heavily populated places based on IDA's Gazetteer to our proper noun knowledge base, the precision of categorizing city names has improved from initial system's 25% to 100%.</Paragraph> <Paragraph position="6"> The overall precision of our new proper noun categorizer has improved to 93% from 77% based on our initial attempt (Paik et al, in press) including proper nouns which are correctly categorized as miscellaneous. This significant advancement was achieved by adding a few sensible context heuristics and modification of the knowledge base. These additions or modifications were based on the analysis of randomly selected documents. We feel the limitations of not manually updating our proper noun knowledge base for uncommon proper nouns when confronted with proper nouns such as 'Place de la Reunion' and 'Pacific Northwest'. Thus, we are currently developing a strategy based on context clues using locational prepositions as well as appositional phrases to improve categorization of uncommon proper nouns.</Paragraph> <Paragraph position="7"> Table 2 shows the overall recall figure of our categorizex which is affected by the proper noun phrase boundary identification errors caused by the general-purpose phrase tracketter.</Paragraph> </Section> class="xml-element"></Paper>