XML Viewer - w93-0114

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0114_metho.xml
Size: 13,933 bytes
Last Modified: 2025-10-06 14:13:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0114">
  <Title>CATEGORIZING AND STANDARDIZING PROPER NOUNS FOR EFFICIENT INFORMATION RETRIEVAL</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CATEGORIZING AND STANDARDIZING PROPER NOUNS
FOR EFFICIENT INFORMATION RETRIEVAL
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 College of Engineering and Computer Science
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper, we describe the most recent implementation and evaluation of the proper noun categorization and standardization module of the DR-LINK document detection system being developed at Syracuse University, under the auspices of ARPA's TIPSTER program. We also discuss the expansion of group common nouns and group proper nouns to enhance retrieval recall. Successful proper noun boundary identification within the part of speech tagger is essential for successful categorization. The proper noun classification module is designed to assign a category code to each proper noun entity, using 30 categories generated from corpus analysis.</Paragraph>
    <Paragraph position="1"> Standardization of variant proper nouns occurs at three levels of processing. Expansion of group proper nouns and group common nouns is performed on queries.</Paragraph>
    <Paragraph position="2"> Standardization and categorization is performed on queries and documents. DR-LINK's overall precision for proper noun categorization was 93%, based on 589 proper nouns occurring in the evaluation set.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="154" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> In information retrieval, proper nouns, group proper nouns, and group common nouns present unique problems. Proper nouns are recognized as an imp(xlant source of information for detecting relevant documents in information retrieval and extracting contents from a text (Ran, 1991). Yet most of the unknown words in texts which degrade the performance of natural language processing information retrieval systems are proper nouns. Group proper nouns (e.g., Middle East) and group common nouns (e.g., third world) will not match on their constituents unless the group entity is mentioned in the document. The proper noun processor herein described is a module in the DR-LINK system (Liddy et at, in press) for document detection being developed under the auspices of ARPA's TIPSTER program.</Paragraph>
    <Paragraph position="1"> Our approach to solving the group common noun and the group proper noun problem has been to expand the appropriate terms in a query, such as 'third world,' to all possible names and variants of third world entities. For all proper nouns, our system assigns categories from a proper noun classification scheme to every proper noun in both documents and queries to permit proper noun matching at the category level. Category matching is more efficient than keyword matching if the query requires entities of a particular type. Standardization  provides a means of efficiently categorizing and retrieving documents containing variant forms of a proper noun.</Paragraph>
    <Paragraph position="2"> 2. Proper Noun Boundary</Paragraph>
    <Section position="1" start_page="0" end_page="154" type="sub_section">
      <SectionTitle>
Identification
</SectionTitle>
      <Paragraph position="0"> In our most recent implementation, which has improved from our initial attempt (Paik et at, in press), documents are first processed using a probabilistic part of speech tagger (Meeter et al, 1991). Then a proper noun boundary identifier utilizes proper noun part of speech tags from the previous stage to bracket adjacent proper nouns. Additionally, heuristics developed through corpus analysis are applied to bracket proper noun phrases with embedded conjunctions and prepositions as one unit. For example, a list of proper  nouns will be bracketed with non-adjacent proper nouns, if 'of is an embedded preposition. Some examples of preceding proper nouns include Council, Ministry, Secretary, University, etc.</Paragraph>
      <Paragraph position="1"> The success of ratio of our proper noun boundary identification module is approximately 96% in comparison to our initial system's 95% (Paik et ai, in press). This improvement was achieved by the re-ordering of the data flow. A general-purpose phrase bracketter, which was applied before the proper noun boundary identification heuristics for non-adjacent proper nouns, is now applied to texts after all the proper noun categorization and standardization steps. Thus, we have eliminated one major source of error, which is the conflict between the general-purpose noun phrase bracketter and the proper noun boundary identification heuristics. For example, embedded prepositions in a proper noun phrase are sometimes recognized as the beginnings of prepositional phrases by the general-purpose phrase bracketter. The remaining 3% of error is due mainly to incorrect proper noun tags assigned to the uncommon first word of a sentence by the part of speech tagger.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="154" end_page="155" type="metho">
    <SectionTitle>
3. Proper Noun Classification Scheme
</SectionTitle>
    <Paragraph position="0"> Our proper noun classification Scheme, which was developed through corpus analysis of newspaper texts, is organized as a hierarchy which consists of 9 branching nodes and 30 terminal nodes. Currently, we use only the terminal nodes to assign categories to proper nouns in texts. Based on an analysis of 588 proper nouns from a set of randomly selected documents from Wall Street Journal, we found that our 29 meaningful categories correctly accounted for 89% of all proper nouns in texts. We reserve the last category as a miscellaneous category. Figure 1 shows a hierarchical view of our proper noun categorization scheme.</Paragraph>
    <Paragraph position="1"> The system categorizes all identified proper nouns using several methods. The first approach is to compare the proper noun with a list of all identified prefixes, infixes and suffixes for possible categorization based on these lexicai clues. If the system cannot identify a category in this stage, the proper noun is passed to an alias database to determine if the proper noun has an alternate name form. If this is the case, the proper noun is standardized and categ~ized at this point. If there is no match in the alias database, the proper noun moves to the knowledge-base look up. These databases have been constructed using online lexical resources including the Gazetteer, the World Factbase, and the Executive Desk Reference. If the knowledge-base look up is not successful, the proper noun is run through a context hueristics application developed from corpus analysis, which suggests certain categories of proper nouns. For example, if a proper noun is followed by a comma and another proper noun, which has been identified as a state, we will label the proper noun as a city name, e.g., Time, Illinois. Finally, if the proper noun has still not been categorized, it is compared against a list of first names generated from the corpus for a final personal name categorization check. If the proper noun has not been categorized at this stage, it will be labeled with the 'miscellaneous' category code.</Paragraph>
    <Paragraph position="2"> For the categorization system to work efficiently, variant terms must be standardized. This procedure is performed at three levels, with the prefixes, infixes and suffixes standardized first. Next, the proper nouns in alias forms are standardized into the official form where available. These standardization techniques improve the retrieval performance. Finally, if a proper noun was mentioned at least twice in a document, for instance, Analog Devices, Inc. and later as Analog Devices, a partial string match of a proper noun is co-indexed for reference resolution. This technique allows for a full representation of a proper noun entity. Figure 2 illustrates the flow of the proper noun categorization system within the t-h-St stages of DR-LINK processing.</Paragraph>
    <Paragraph position="3"> When standardization and categorization have been completed, a new field is added to both the query and the document containing the proper noun and the corresponding category codes. These fields are then used for efficient matching and representation.</Paragraph>
    <Paragraph position="4"> 4. Use of Proper Nouns in Matching Both the lexical entry for the proper noun or the category code may be used for matching documents to queries. For example, if a query is about a boarder incursion, we can limit the potentially relevant documents to those documents which contain at least two different country names, flagged by the two country category codes in the proper noun field. Using the standardized fern of a proper noun reduces the number of possible variants which the system would otherwise need to search for.</Paragraph>
    <Paragraph position="5"> While the category matching strategy is useful in many cases, an expansion of a group proper noun such as 'European Community', which occurs in a query, to member country names is also beneficial. Relevant documents for a query about sanctions against Japan by European Community countries are likely to mention actions against Japan by member countries by name  Geo. Misc. Figure 1: Proper Noun Categorization Scheme rather than the term in the query, European Community. We are currently using a proper noun expansion database with 168 expandable enlries for query processing. In addition, certain ccmunon nouns or noun phrases in queries such as 'socialist countries' need to be expanded to the names of the countries which satisfy the definition of the term to improve performance in detecting relevant documents. The system consults a list of common nouns and noun phrases which can be expanded into proper nouns and actively searches for these terms during the query processing stage. Currently, the common noun expansion database has 37 entries.</Paragraph>
    <Paragraph position="6"> The creation and use of proper noun information is first utilized in the DR-LINK system as an addition to the subject-content based filtering module which uses a scheme of 122 subject field codes (SFCs) from a machine readable dictionary rather than keywords to represent documents. Although SFC representation and matching provides a very good first level of document filtering, not all proper nouns reveal subject information, so the proper noun concepts in texts are not actually represented in the SFC vectors.</Paragraph>
    <Paragraph position="7"> In our new implementation, categorized and standardized proper nouns are combined with Text Structure (Liddy et al, in press-b) information for matching queries against documents. Text Structure is a recognition of a discernible, predictable schema of texts of a particular type. The Text Structurer module in the DR-LINK system delineates the discourse-level organization of document content so that processing at later stages can focus on those components identified by the Text Structurer as being the most likely location in the document where the information requested in a query is to be found.</Paragraph>
    <Paragraph position="8"> All proper nouns in a document collection are indexed in an inverted file with the document accession number, the Text Structure component in which the proper noun was located, and the category code. For processing the queries for their proper noun requirements, we have developed a Boolean criteria script which determines which proper nouns or combinations of proper nouns are needed from certain Text Structure components in each query. These requirements are then run against the propel noun inverted file to rank documents according to the extent to which they match these requirements.</Paragraph>
    <Paragraph position="9"> Also, the categorization information of proper nouns is currently used in a later module of the system, which extracts concepts and relations from text to produce a more refined representation. For example, proper nouns may reveal the location of a company or the nationality of an individual.</Paragraph>
    <Paragraph position="10"> We do not have information retrieval evaluation results based on the new implementation using the proper noun information in conjunction with the Text Structure information. However, in previous testing of our initial system which did not utilize Text Structure information (Paik et al, in press), reranking of documents received from the SFC module, based on the degree of proper noun requirements matching a set of queries against a document collection, resulted in placing all the relevant documents within the top 28%  of the document collection. It should also be noted that the precision figures on the output of the SFC module plus the proper noun matching module produced very reasonable precision results (.22 for the ll-point precision average), even though the combination of these two modules was never intended to function as a stand-alone retrieval system.</Paragraph>
    <Paragraph position="11"> Finally, the proper noun extraction and categorization module, although developed as part of the DR-LINK system, could be used to provide improved document representation for any information retrieval system.</Paragraph>
    <Paragraph position="12"> The standardization and categorization features permit queries and documents to be matched with greater precision, while the expansion functions of group proper nouns and group common nouns improve recall.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML