File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/h93-1062_evalu.xml
Size: 3,844 bytes
Last Modified: 2025-10-06 14:00:08
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1062"> <Title>INTERPRETATION OF PROPER NOUNS FOR INFORMATION RETRIEVAL</Title> <Section position="6" start_page="310" end_page="312" type="evalu"> <SectionTitle> 5. PERFORMANCE EVALUATION </SectionTitle> <Paragraph position="0"> While we have processed more than one gigabyte of text using the current version of the proper noun categorizer for the TIPSTER 18 month testing, the evaluation of the proper noun categorizer herein reported is based on 25 randomly selected Wall Street Journal documents which were compared to the proper noun categorization done by a human. Table 1 shows the categorizer's pe~rformance over 588 proper nouns occurring in the test set. In addition to 588 proper nouns, 14 common words were incorrectly identified as proper nouns due to errors by the part of speech tagger and typos in the original text; and the boundaries of 17 proper nouns were incorrectly recognized by the general-purpose phrase bracketter error.</Paragraph> <Paragraph position="1"> 65 proper nouns were correctly categorized as miscellaneous as they did not belong to any of our 29 meaningful categories. This may be considered a coverage problem in our proper noun categorization scheme, not an error in our categorizer. Some examples of the proper nouns belonging to the miscellaneous category are: 'Promised Land', 'Mickey Mouse', and 'IUD'. The last row of Table 1 shows the overall precision of our categorizer based on the proper nouns which belong to the 29 meaningful categories.</Paragraph> <Paragraph position="2"> Most of the wrongly categorized proper nouns are assigned to the miscellaneous category, not mis-categorized to another meaningful category. The only notable case where a proper noun was mis-categorized as another meaningful category, occurred between the city and the province categories. Our categorizer assigned the province category (IDA's Gazetteer calls states provinces) to 'New York' when the proper noun was actually referring to the name of the city.</Paragraph> <Paragraph position="3"> Errors in the categorization of person and city names account for 68% of the total errors. To correct the categorization errors in person names, we are currently experinaenting with a list of common first names as a special lexicon to consult when there is no match in prefix and suffix lists nor any context clues to other meaningful categories. The main reason for miscategorizing city names as miscellaneous proper nouns was due to a special convention of newspaper text. The locational source of the news, when mentioned at the beginning of the document, is usually capitalized. For example, if the story is about a company in Dallas then the text will start as below: DALLAS: American Medical Insurance Inc. said that ... This problem will be solved in the new version of our proper noun categorizer by incorporating a capitalization normalizer, which converts words in all upper case to lower case except the first character of a word, before the part of speech tagging. We are also in the process of incorporating context information for identifying city names in our categorizer based on the observation that city names are usually followed by a country name or a province name from the United States and Canada.</Paragraph> <Paragraph position="4"> Low precision in categorizing region names such as 'Pacific Northwest' is due to incomplete coverage of possible region names in the proper noun database. We are currently developing a strategy based on context clues using locational prepositions.</Paragraph> <Paragraph position="5"> Table 2 shows the overall recall figure of our categorizer which is affected by the proper noun phrase boundary identification errors caused by the general-purpose phrase bracketter.</Paragraph> </Section> class="xml-element"></Paper>