File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1107_evalu.xml
Size: 13,953 bytes
Last Modified: 2025-10-06 13:58:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1107"> <Title>REXTOR: A System for Generating Relations from Natural Language</Title> <Section position="8" start_page="72" end_page="75" type="evalu"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> Informal analysis of documents using REXTOR reveals that it can potentially serve as an effective framework for extracting &quot;meaning&quot; from documents. In particular, the system is capable of identifying the following types of linguistic constructions and generating relations from them: * Simple sentences can be extracted by noting a simple NounGroup VerbGroup NounGroup pattern. From this, subject-verl>object (SVO) relations can be derived.</Paragraph> <Paragraph position="1"> * Predicative nominatives can be recognized by identifying the &quot;be&quot; verb and the NounGroup directly following it. These constructions may be useful in establishing ontological hierarchies, i.e., is-a trees.</Paragraph> <Paragraph position="2"> * Predicative adjectives can be recognized by the &quot;be&quot; verb and a succession of one or more adjectives (or adjectival phrase). They may provide addition information regarding the attributes of entities, e.g., has-property.</Paragraph> <Paragraph position="3"> * Appositives are characteristically offset by commas and usually contain a single noun phrase; thus, they can be recognized relatively easily. Common in prose, appositives offer a wealth of additional information regarding various entities, e.g., location of sites, age or position of people, etc.</Paragraph> <Paragraph position="4"> * Prepositional phrases are relatively easy to extract, and may supply valuable relations that increase the precision of information retrieval systems. Ternary expressions allow for a better representation of prepositional phrases (compared to pairs) because they allow the preposition to more specifically determine the type of relation (thus, examples like &quot;boat by the water&quot; and &quot;boat under the water,&quot; which have completely different meanings, may be indexed separately and distinctly). null However, the prepositional phrase attachment problem (in the general-domain case) is still an open research topic, and thus poses some problems to content analysis. Regardless, for the purposes of information retrieval, it may be acceptable to err on the side of over-generation in considering attachment, i.e., enumerate all possible relations. This will no doubt generate a large number of (possibly incorrect) relations, and more research is required to determine effective methods of controlling this explosion.</Paragraph> <Paragraph position="5"> * Relative clauses of some types can be identiffed by a finite-state language model. They may supply additional useful SVO relations for indexing purposes.</Paragraph> <Paragraph position="6"> We believe that future breakthroughs in natural language information retrieval will occur in the generation of meaningful relations. Although the finite-state language model of REXTOR is powerful enough to extract many linguistically interesting constructions, the approach is not fundamentally new. What differentiates our system from previous work such as FASTUS (Hobbs et al., 1996) is that REXTOR not only provides a mechanism for extraction, but also introduces the paradigm of ternary expressions to capture document content for information retrieval. The relations view of natural language documents is highly amenable to integration with information retrieval systems. Through a relations representation, REXTOR is able to distinguish the subtle differences in meaning between the pairs of sentences and phrases given in the introduction: (1) The man ate the dog.</Paragraph> <Paragraph position="7"> < man is-subject-of eat > < dog is-object-of eat > (I') The dog ate the man.</Paragraph> <Paragraph position="8"> < man is-object-of eat > < dog is-subject-of eat > (2) The meaning of life < meaning possessive-relation life > (2') A meaningful life < meaningful describes life > (3) The bank of the river < bank possessive-relation river > (3') The bank near the river < bank near-relation river > The ability to extract subject-verb-object relations, e.g., (1) and (1'), allows an IR system to distinguish between two very different statements. Similarly, REXTOR can differentiate between prepositional phrases (2) and adjectival modification (2'). Although the system does not have any notion of semantics (e.g., word sense), syntax may offer crucial clues to meaning in cases such as (3) and (3').</Paragraph> <Paragraph position="9"> Similarly, REXTOR is capable of performing linguistic normalization at the syntactic and morphological levels. Consider these sets of examples originally presented in the introduction: (4) What is Bill Gates' net worth? (4') What is the net worth of Bill Gates? < &quot;net worth&quot; related-to &quot;Bill Gates&quot; > (5) John gave the book to Mary.</Paragraph> <Paragraph position="10"> (5') John gave Mary the book.</Paragraph> <Paragraph position="11"> (5&quot;) Mary was given the book by John.</Paragraph> <Paragraph position="12"> < John is-subject-of give > < book is-direct-object-of give > < Mary is-indirect-object-of give > (6) The president surprised the country with his actions.</Paragraph> <Paragraph position="13"> < president is-subject-of surprise > < country is-object-of surprise > < surprise with actions > (6') The president's actions surprised his country.</Paragraph> <Paragraph position="14"> < actions related-to president > < actions is-subject-of surprise > < country is-object-of surprise > (7) Over 22 million people live in Waiwan.</Paragraph> <Paragraph position="15"> < &quot;22 million&quot; is-quantity-of people > < people is-subject-of live > < live in Taiwan > (7') The population of Taiwan is 22 million. < population is &quot;22 million&quot; > < population related-to Taiwan > With relations, different surface forms of expressing the &quot;possession relation&quot; may be normalized into the same structure, e.g., (4) and (4'). Similarly, alternative surface realization of the same verb-headed relation can be recognized and equated with each other by writing different extraction rules that generate the same relations, e.g., (5), (5'), and (5&quot;). The process of normalization will hopefully lead to greater recall in information retrieval systems. Note that (6) and (6') demonstrate a limitation of REXTOR, namely its inability to deal with alternative realizations of verb arguments. Also, the system does not have any notion of semantics, and thus is unable to equate two sentences that have the same meaning, e.g., (7) and (7'). Although it is certainly possible to manually encode such semantic knowledge as extraction and relation rules, this solution is far from elegant.</Paragraph> <Paragraph position="16"> A potential solution to this semantic variations problem is to borrow the solution employed by START. A ternary expression representation of natural language mimics its syntactic organization, and hence sentences that differ in surface form but are close in meaning will not map into the same structure. In order to solve this problem, START deploys &quot;S-rules&quot; (Katz and Levin, 1988), which are reversible syntactic/semantic transformational rules that render explicit the relationship between alternate realizations of the same meaning. For example, a buy expression is semantically equivalent to a sell expression, except the subject and indirect objects are exchanged. Because many verbs can undergo the same alternations, they can in fact be grouped into verb classes, and hence governed by the same S-rules. Thus, S-rules can be viewed as metarules applied over ternary expressions. A similar technique for handling both syntactic and semantic variations can be found in (Grishman, 1995; Jacquemin et al., 1997). Both utilize metarules (e.g., for passive/active transformation) applied over textual patterns in order to generate and handle variations.</Paragraph> <Paragraph position="17"> Below we present a concrete example of how REXTOR could potentially improve the performance of existing keyword search engines dramatically. We indexed an electronic version of the Worldbook Encyclopedia at the sentence level using the following two techniques: 1. A simple inverted keyword index. All stop-words are thrown out, and all content words are stemmed. Retrieval was performed by matching content words in the query with content words in the encyclopedia articles.</Paragraph> <Paragraph position="18"> 2. A ternary expressions index using the relations generated by REXTOR. The grammar was written to extract possessive relations, description relations (adjective-noun modification), prepositional relations, subject-verb relations, and verb-object relations. Retrieval was performed by matching ternary expressions from the query (extracted using a separate grammar) with ternary expressions extracted from the encyclopedia articles.</Paragraph> <Paragraph position="19"> The following shows the results of the keyword search engine: Question: What do frogs eat? Answer: (R1) Adult frogs eat mainly insects and other small animals, including earthworms, minnows, and spiders.</Paragraph> <Paragraph position="20"> (R2) Bow'fms eat mainly other fish, frogs, and crayfish.</Paragraph> <Paragraph position="21"> (R3) Most cobras eat many kinds of animals, such as frogs, fishes, birds, and various small mammals.</Paragraph> <Paragraph position="22"> (R4) One group of South American frogs feeds mainly on other frogs.</Paragraph> <Paragraph position="23"> (RS) Cranes eat a variety of foods, including frogs, fishes, birds, and various small mammals. null (R6) Frogs eat many other animals, including spiders, flies, and worms.</Paragraph> <Paragraph position="24"> (R7) ...</Paragraph> <Paragraph position="25"> After removing stopwords from the query, our simple keyword search engine returned 33 results that contain the keywords frog and eat. However, only (R1), (R4), and (R6) correctly answer the user query; the other results answer the question &quot;What eats frogs?&quot; or otherwise coincidentally contain those two terms. (Apparently, our poor frog has more predators than prey.) A bag-of-words approach fundamentally cannot differentiate between a query in which the frog is in the subject position and a query in which the frog is in the object position. However, by parsing subject-verb-object relations using REXTOR, a ternary expressions indexer can effectively filter out irrelevant results, returning the three correct responses. While indexing relations may potentially lower recall, due to unanticipated constructions, it has a tremendous potential in increasing precision.</Paragraph> <Paragraph position="26"> Furthermore, consider the following queries, in which REXTOR would outperform traditional key- null word engines: (8) How many South Koreans were recently allowed to visit their North Korean relatives? null (9) Where did John see Mary? (10) Regarding what issue did the president of Russia criticize China? (11) Are electronics the biggest export from Japan to the United States? A traditional search engine using the bag-of-words approach would suffer from poor precision when faced with the above queries. Many verbs take arguments of the same semantic type, and in most of these sentences, reordering the verb arguments drastically alters their meaning. For example, a keyword search engine would not be able to distinguish between a question regarding South Koreans visiting North Korea and North Koreans visiting South Korea (8) because both queries have the same keyword content. Similarly, the keyword approach would be unable to determine who did the seeing (9), or who did the criticizing (I0). Modification relations also pose difllculties to the bag-of-words paradigm, e.g., was it the North Korean or South Korean relatives (8)? Was it the president of Russia or the president of China (10)? Furthermore, there are some constructions whose meaning critically depends on relations between the entities, e.g., (11), because &quot;from X to Y&quot; and &quot;from Y to X&quot; usually differ in meaning. The current version of REXTOR is merely a prototype; thus, we have made minimal attempts to optimize its processing speed. On a Pentium Ill 933 MHz Linux system with 512 megabytes of RAM, s analyzing a sentence in the Worldbook Encyclopedia required 0.0378 seconds on average. This translates into a content analysis rate of roughly 340 words a second, or approximately 11.4 megabytes of text per hour. Although the system composed of REXTOR and the ternary expressions indexer is slower than the simple keyword indexer, we believe that the potential to dramatically increase precision offsets the longer processing time. SHowever, REXTOR is not a memory-intensive system; RAM utilization during trial runs was rather low. This paper presents only the first stage of an linguistically-motivated information retrieval system. Although we have presented the results of a preliminary investigation into the effectiveness of this approach, we cannot draw any conclusions until more comprehensive tests have been conducted. However, many prior techniques used in natural language information retrieval (e.g., head/modifier pairs) can be expressed within the ItEXTOR framework, and furthermore the system provides a playground for experimenting with new techniques. Thus, we believe that our approach shows great promise in moving towards higher performance information retrieval systems.</Paragraph> </Section> class="xml-element"></Paper>