File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1062_metho.xml
Size: 7,653 bytes
Last Modified: 2025-10-06 14:13:25
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1062"> <Title>INTERPRETATION OF PROPER NOUNS FOR INFORMATION RETRIEVAL</Title> <Section position="3" start_page="0" end_page="309" type="metho"> <SectionTitle> 2. BOUNDARY IDENTIFICATION </SectionTitle> <Paragraph position="0"> The proper noun processor herein described is a module in the DR-LINK System (Liddy et al, in press) for document detection being developed under the auspices of DARPA's TIPSTER Program. In our implementation, documents are first processed using a probabilistic part of speech tagger (Meeter et al, 1991) and general-purpose noun phrase bracketter which identifies proper nouns and proper noun phrases in texts. We have developed a special purpose proper noun phrase boundary identification module which extends the proper noun bmcketting to include proper noun phrases with embedded conjunctions and prepositions. The module utilizes heuristics developed through corpus analysis. The success ratio is approximately 95%.</Paragraph> <Paragraph position="1"> Incorrectly identified proper noun phrases are due mainly to two reasons: 1) the part of speech tagger identifies common words as proper nouns; and, 2) conflicts between the general-purpose noun phrase bracketter and the special-purpose proper noun boundary identifier.</Paragraph> <Paragraph position="2"> While the first source of error is difficult to fix, we are currently experimenting with applying the special purpose proper noun boundary identifier before the general-purpose noun phrase bmcketter. Our preliminary results show that this would result in a 97% correct ratio for identifying boundaries of proper nouns.</Paragraph> </Section> <Section position="4" start_page="309" end_page="309" type="metho"> <SectionTitle> 3. CATEGORIZATION </SectionTitle> <Paragraph position="0"> Next, the system categorizes all the identified proper nouns using several methods: 1) comparison to lists of known prefixes, infixes and suffixes for each category of proper noun; 2) consulting an alias database consisting of alternate names for some proper nouns; 3) look-up in a proper noun knowledge-base of proper nouns and their categories extracted from online lexical resources (e.g., World Factbase, Gazetteer), and finally; 4) applying context heuristics developed from corpus analysis of the contexts which suggest certain categories of proper nouns.</Paragraph> <Paragraph position="1"> While being categorized, the proper nouns are standardized in three ways: 1) prefixes, infixes, and suffixes of proper nouns are standardized; 2) proper nouns in alias forms are translated into their official form, and; 3) the partial string of a proper noun which was mentioned in full earlier in the document is co-indexed for reference resolution.</Paragraph> <Paragraph position="2"> A new field containing the list of each standardized proper noun and its category code is added to the document for later use in several stages of matching and representation. The first two techniques improve retrieval performance, while the co-indexing of references produces a full representation of a proper noun entity and all its accompanying information. Figure 2 shows a schematic view of DR-LINK's proper noun categofizer.</Paragraph> </Section> <Section position="5" start_page="309" end_page="310" type="metho"> <SectionTitle> 4. USE OF PROPER NOUN IN MATCHING </SectionTitle> <Paragraph position="0"> When matching documents to queries, either the lexical entry for the proper noun can be matched or the match can be at the category level, as each proper noun occurring in a document is recorded in the proper noun field of the document along with its appropriate category code. For example, if a query is about a business merger, we can limit the potentially relevant documents to those documents which contain at least two different company names, flagged by two company category codes in the proper noun field. For many queries, using the standardized form of a proper noun reduces the number of possible variants which the system would otherwise need to search for. For example, 'MCI Communications Corp.', 'MCI Communications', and 'MCI', are all standardized as 'MCI Communications CORP' by our proper noun categorizer. This process is similar in purpose to the common practice in standard retrieval matching of reducing variants by stemming. However, stemming is not a viable means for standardizing proper names.</Paragraph> <Paragraph position="1"> While the category matching strategy is useful in many cases, an expansion of a group proper noun such as 'European Community', which occurs in a query, to member country names is also beneficial. Relevant documents to a query about sanctions against Japan by European Community countries are likely to mention actions against Japan by member countries by name rather than the term in the query, European Community. We are currently using a proper noun expansion database with 168 expandable entries for query processing. In addition, certain common nouns or noun phrases in queries such as 'socialist countries' need to be expanded to the names of the countries which satisfy the definition of the term to improve performance in detecting relevant documents. The system consults a list of common nouns and noun phrases which can be expanded into proper nouns and actively searches for these terms during the query processing stage. Currently, the common noun expansion database has 37 entries.</Paragraph> <Paragraph position="2"> The creation and use of proper noun information is first utilized in DR-LINK system as an addition to the subject-content based filtering module which uses a scheme of 122 subject field codes (SFCs) from a machine readable dictionary rather than keywords to represent documents. Although SFC representation and matching provides a very good first level of document filtering, not all proper nouns reveal subject information, so the proper noun concepts in texts are not actually represented in the SFC vectors.</Paragraph> <Paragraph position="3"> For processing the queries for their proper noun requirements, we have developed a Boolean criteria script which determines which proper nouns or combinations of proper nouns are needed by each query. This requirement is then run against the proper noun field of each document to rank documents according to the extent to which they match this requirement. In the recent testing of our system, these values were used to rerank the ranked list of documents received from the SFC module. The results of this reranking placed all the relevant documents within the top 28% of the database.</Paragraph> <Paragraph position="4"> It should also be noted that the precision figures on the output of the SFC module plus the proper noun matching module produced very reasonable precision results (.22 for the ll-point precision average), even though the combination of these two modules was not intended to function as a stand-alone retrieval system. Also, the categorization information of proper nouns is currently used in the system's later module which extract concepts and relations from text to produce a more refined representation. For example, proper nouns reveal the location of a company or the nationality of an individual. The proper noun extraction and categorization module, although developed as part of the DR-LINK System, could be used to provide improved document representation for any information retrieval system, because it permits queries and documents to be matched with greater precision and the expansion functions improve recall.</Paragraph> </Section> class="xml-element"></Paper>