File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1107_metho.xml
Size: 22,194 bytes
Last Modified: 2025-10-06 14:07:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1107"> <Title>REXTOR: A System for Generating Relations from Natural Language</Title> <Section position="4" start_page="67" end_page="68" type="metho"> <SectionTitle> \[POPULATION-1 IS 11044147\] \[POPULATION-1 RELATED-T0 ZIMBABWE\] </SectionTitle> <Paragraph position="0"> Experience from START has shown that a robust full-text natural language question-answering system cannot be realistically expected any time soon. Numerous problems such as intersentential reference, paraphrasing, summarization, common sense implication, and many more, will take a long time to solve satisfactorily. In order to bypass intractable complexities of language, START uses computer-analyzable natural language annotations, which consist of simplified English sentences and phrases, to describe various information segments (which may be text, images, or even video and other multimedia content). These natural language annotations serve as metadata and inform START regarding the type of questions that a particular information segment is capable of answering (Katz, 1997). By performing retrieval on natural language annotations, the system is able to provide knowledge that it may not be able to analyze itself (either language that is too complex or non-textual segments). Because these annotations must be manually generated, expanding START'S knowledge base is relatively timeintensive. null REXTOR attempts to eliminate the need for human involvement during content analysis, and also aims to serve as the foundation of a natural language information retrieval system. Ultimately, we hope that REXTOR will serve as a stepping stone towards a comprehensive system capable of providing users with &quot;just the right information&quot; to queries posed in natural language.</Paragraph> </Section> <Section position="5" start_page="68" end_page="68" type="metho"> <SectionTitle> 3 Previous Work </SectionTitle> <Paragraph position="0"> The concept of indexing more than simple key-words is not new; the idea of indexing (parts of) phrases, for example, is more than a decade old (Fagan, 1987). Arampatzis (1998) introduced the phrase retrieval hypothesis, which asserted that phrases are a better indication of document content than keywords. Several researchers have also explored different techniques of linguistic norrealization for information retrieval (Strzalkowski et al., 1996; Zhai et al., 1996; Arampatzis et al., 2000). The performance improvements were neither negligible nor dramatic, but despite the lack of any significant breakthroughs, the authors affirmed the potential value of linguistically-motivated indexing schemes and the advantages they offer over traditional IR.</Paragraph> <Paragraph position="1"> Previous research in linguistically motivated information retrieval concentrated primarily on noun phrases and their attached prepositional phrases. Techniques that involve head/modifier relations have been tried, e.g., indexing adjective/noun and noun/right adjunct pairs (which normalizes variants such as &quot;information retrievai&quot; and &quot;retrieval of information&quot;). However, there has been little experimentation with other types of linguistic relations, e.g., appositives, predicate nominatives (i.e., the is-a relation), predicate adjectives (i.e., the has-property relation), etc. Furthermore, indexing of word pairs and phrases in many previous systems was accomplished by converting those representations into lexical items and atomic terms, indexed in the same manner as single words. The treatment of these representational structures using a restrictive bag-of-words paradigm limits the type of queries that may be formulated. For example, treating adjective/noun pairs (\[adj., noun\]) as lexical atoms renders it impossible to find the equivalent of &quot;all big things,&quot; corresponding to the pair \[big, *\].</Paragraph> <Paragraph position="2"> The extraction of these relations from documents has been relatively inefficient and unsystematic. One approach is to first parse the document using a full-text parser, and then extract interesting relations from the resulting parse tree (Fagan, 1987; Grishman and Sterling, 1993; Loper, 2000). This approach is slow and inefficient because full-text parsing is very time-intensive.</Paragraph> <Paragraph position="3"> Due to current limitations of computational technology, only a small fraction of the information gathered by a full parser can be efficiently indexed.</Paragraph> <Paragraph position="4"> For the most part, relations that can be effectively utilized for information retrieval purposes only occupy a few nodes of a (possibly dense) parse tree; thus, most of the knowledge gathered by the parser is thrown away. Also, extracting non-linguistic relations from parse trees is very difficult; many interesting relations (from an IR point of view) have no linguistic foundation, e.g., adjacent word pairs. The other approach to extracting relations from text is to build simple filters for every new relation. This approach is unsystematic, and does not allow for rapid addition of new relations to a system.</Paragraph> <Paragraph position="5"> The REXTOR System utilizes an integrated model to systematically extract arbitrary textual patterns and relations (ternary expressions) from documents. The concept of coupling structure-building actions with parsing originated with augmented transition networks (ATNs)(Thorne et al., 1968; Woods, 1970). Similarly, PLNLP (Heidorn, 1972; Jensen et al., 1993) is a programming language for writing phrase structure rules that include specific conditions under which the rule can be applied. These rules may also be augmented by structure-building actions that are to be taken when the rule is applied. However, these systems that attempt full-text parsing are less efficient for information retrieval applications due to the long time necessary to generate full linguistic parse trees. REXTOR was designed with a simple language model and an equally simple, yet expressive, representation of &quot;meaning.&quot;</Paragraph> </Section> <Section position="6" start_page="68" end_page="70" type="metho"> <SectionTitle> 4 Bridging Natural Language and </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="68" end_page="68" type="sub_section"> <SectionTitle> Information Retrieval </SectionTitle> <Paragraph position="0"> In order to bridge the gap between natural language and information retrieval, natural language text must be distilled into a representational structure that is amenable to fast, large-scale indexing. We argue that a finite-state model of natural language with ternary expressions is currently the most suitable combination for this task.</Paragraph> </Section> <Section position="2" start_page="68" end_page="69" type="sub_section"> <SectionTitle> 4.1 Finite-State Language Model </SectionTitle> <Paragraph position="0"> Despite its limitations, a finite-state grammar seems to provide the best natural language model for information retrieval purposes. One of the most notable computational inadequacies of the finite-state model is the absence of a pushdown mechanism to suspend the processing of a constituent at a given level while using the same grammar to process an embedded constituent (Woods, 1970). Due to this inadequacy, certain English constructions, such as center embedding, cannot be described by any finite-state grammar (Chomsky, 1959a; Chomsky, 1959b). However, Church (1980) demonstrated that the finite-state language model is adequate to describe a performance model of language (i.e., constrained by memory, attention, and other realistic limitations) that approximates competence (i.e., language ability under optimal conditions without resource constraints). Many phenomena that cannot be handled by fiuite-state grammars are awkward from a psycholinguistic point of view, and hence rarely seen. More recently, Pereira and Wright (1991) developed formal methods of approximating context-free grammars with finite-state grammars, s Thus, for practical purposes, computationally simple finite-state grammars can be utilized to adequately model natural language.</Paragraph> <Paragraph position="1"> Empirically, the effectiveness of the finite-state language model has been demonstrated in the Message Understanding Conferences (MUCs), which evaluated information extraction (IE) systerns on a variety of domain-specific tasks. The conferences have shown that superficial parsing using finite-state grammars performs better than deep parsing using context-free grammars (at least under the current constraints of technology). The NYU team switched over from a system that performed full parsing (PROTEUS) in MUC-5 (Grishman and Sterling, 1993) to a regular expression matching parser in MUC-6 (Grishman, 1995). Full parsing was slow and error-prone, and the process of building a full syntactic analysis involved relatively unconstrained search which consumed large amounts of both time and space. The longer debug-cycles that resulted from this translated into fewer iterations with which to tune the system within a given amount of time. Furthermore, the complexity of a full context-free grammar contributed to maintenance problems; complex interactions within the grammar prevented rapid updating of the system to handle new constructions. null Finite-state grammars have been used to extract entities such as proper nouns, names, locations, etc., with relatively high precision. To a lesser extent, these grammars have proven to be effective in identifying syntactic constructions such as noun phrases and verb phrases. FASTUS (Hobbs et al., 1996), the most notable of these systems, is modeled after cascaded, nondeterrninistic finite-state automata. The finite-state transducers are &quot;cascaded&quot; in that they are arranged in SHowever, these approximations overgenerate, although in predictable, systematic ways.</Paragraph> <Paragraph position="2"> series; each one maps the output structures from the previous transducer into structures that comprise the input to the next transducer.</Paragraph> <Paragraph position="3"> There are many similarities between information extraction and building effective representational structures for information retrieval. Both tasks involve identifying entities (e.g., phrases) and the relationships between those entities.</Paragraph> <Paragraph position="4"> Thus, the application of proven information extraction techniques (i.e., finite-state technology) to information retrieval offers promise in raising the performance of IR systems.</Paragraph> </Section> <Section position="3" start_page="69" end_page="70" type="sub_section"> <SectionTitle> 4.2 Ternary Expressions </SectionTitle> <Paragraph position="0"> Ternary (three-place) expressions currently appear to be the most suitable representational structure for meaning extracted from text. They may be intuitively viewed as subject-relation-object triples, and can easily express many types of relations, e.g., subject-verb-object relations, possession relations, etc. From a syntactic point of view, ternary expressions may be viewed as typed binary relations. Given the binary branching hypothesis of linguistic theory, ternary expressions are theoretically capable of expressing any arbitrary tree -- thus, ternary expressions are compatible with linguistic theory. From a semantic point of view, ternary expressions may be viewed as two-place predicates, and can be manipulated using predicate logic. Finally, ternary expressions are highly amenable to rapid large-scale indexing, which is a necessary prerequisite of information retrieval systems. Although other representational structures (e.g., trees or case frames) may be better adapted for some purposes, they are much more difficult to index and retrieve efficiently due to their size and complexity.</Paragraph> <Paragraph position="1"> In fact, indexing linguistic tree structures has been attempted (Smeaton et al., 1994), with very disappointing results: precision actually decreased due to the inability to handle variations in tree structure (i.e., the same semantic content could be expressed using different syntactic structures), and to the poor quality of the full-text natural language parser, which was also rather slow. Despite recent advances, full-text natural language parsers are still relatively error-prone; indexing incorrect parse trees is a source of performance degradation. Furthermore, matching trees and sub-trees is a computationally intensive task, especially since full linguistic parse trees may be relatively deep.</Paragraph> <Paragraph position="2"> Relations are easier to match because they are typically much simpler than parse trees. For example, the tree \[\[shiny happy people \] \[of \[Wonderland\]\]\] may be &quot;flattened&quot; into three relations: < shiny describes people > < happy describes people > < people related-to Wonderland > Indexing chse frames has also been attempted (Croft and Lewis, 1987; Loper, 2000), but with limited success. Full semantic analysis is still an open research problem, especially in the general domain. Since full semantic analysis cannot be performed without full-text parsing, case frame analysis inherits the unreliability of current parsers. Furthermore, semantic analysis requires extensive knowledge in the lexicon, which is extremely time-intensive to construct. Finally, due to the complex structure of case frames, they are more difficult to store and index than ternary expressions. null Since ternary expressions are merely three-place relations, they may be indexed and retrieved much in the same way as rows within the table of a relational database; 4 hence, well-known optimizations for databases may be applied for extremely high performance.</Paragraph> <Paragraph position="3"> Previous linguistically-motivated indexing schemes may easily be reformulated using ternary expressions. For example, indexing adjacent word pairs consists of indexing adjacent words with the adjacent relation. In fact, all pairs (e.g., adjective-noun, head-modifier) can be reformulated as ternary expressions by assigning a type to the pair. This finer granulaxity allows the capture of more intricate relations between words in a document.</Paragraph> </Section> </Section> <Section position="7" start_page="70" end_page="72" type="metho"> <SectionTitle> 5 The REXTOR System </SectionTitle> <Paragraph position="0"> Using its finite-state language model, the REXTOR System generates a set of ternary expressions that correspond to content of a part-of-speech-tagged input document. Currently, the Brill Tagger (Brill, 1992) (with minor postprocessing) is used for the part-of-speech (POS) tagging. The relations construction process consists of two distinct processes, each guided by its own externally specified grammar file. Extraction rules are applied to match arbitrary patterns of text, based either on one of thirty-nine POS tags or on exact words. Whenever an item is extracted, a corresponding relation rule is triggered, which handles the actual generation of the ternary expressions (relations).</Paragraph> <Paragraph position="1"> 4In fact, our first implementation of a ternary expressions indexer used a SQL database.</Paragraph> <Section position="1" start_page="70" end_page="70" type="sub_section"> <SectionTitle> 5.1 Extraction Rules </SectionTitle> <Paragraph position="0"> Extraction rules are used to extract arbitrary patterns of text according to a grammar specification.</Paragraph> <Paragraph position="1"> The REXTOR grammar is written as regular expression rules, which are computationally equivalent to finite-state automata, s Writing grammar rules in this fashion allows for perspicuity, the property whereby permitted types of constructions are readily apparent from the rules. Such a human-readable formulation simplifies maintenance of the grammar.</Paragraph> <Paragraph position="2"> The extraction stage of the REXTOR System performs a no-lookahead left-to-right scan of every input sentence, identifies the longest matching pattern (from any grammar rule), reduces the input sequence based on the matched rule, and continues with the next unmatched word. If a word cannot be included in any grammar rule, it is skipped.</Paragraph> <Paragraph position="3"> An extraction rule takes the following form:</Paragraph> <Paragraph position="5"> The rule can be read as Ent+-tyType is defined as template. A successful match of the pattern in template signifies a successfully extracted entity.</Paragraph> <Paragraph position="6"> The template consists of a series of legal tokens, which are shown in Table 1. In addition, token modifiers (also in Table 1) can alter the meaning of the immediately preceding token. Tokens surrounded by curly braces ({}) are saved as bound variables, which can be later utilized to build relations (ternary expressions). These variables are referenced numerically starting at zero (e.g., the 0th bound variable).</Paragraph> </Section> <Section position="2" start_page="70" end_page="72" type="sub_section"> <SectionTitle> 5.2 Relation Rules </SectionTitle> <Paragraph position="0"> A relation rule is triggered by the successful extraction of a particular entity (EntPStyType). The relations grammar directs the construction of the actual ternary expression. A relation rule takes the following form: EntityType :=> <atoml atom2 acorn3>; The EntityType is the trigger for the relation, i.e., the rule is applied whenever a string of that type is extracted. The right hand side of the relation rule is the ternary expression to be generated, which is a triple composed of three atoms. Valid atoms are shown in Table 2. They are either string literals or they manipulate the bound variables saved from the extraction process in some manner.</Paragraph> <Paragraph position="1"> 5For an algorithm converting regular expressions to nondeterministic finite-state automata, please refer to (Aho et al., 1988), Chapter 3.</Paragraph> <Paragraph position="2"> This matches any word tagged as the part-of-speech POS. This matches a specific word (string) of a specific part-of-speech (POS). This matches any extracted string of type EntityType. This expression matches any one of the alternative tokens given within the parentheses. Matches are attempted in the order in which they are written, e.g., the first token is tried first.</Paragraph> <Paragraph position="3"> Token Modifier Description This modifier matches zero or more occurrences of the previous token. This modifier matches zero or one occurrence of the previous token. This modifier matches one or more occurrences of the previous token. Evaluates to the nth bound variable of the trigger EntityType, interpreted as a string.</Paragraph> <Paragraph position="4"> Evaluates to the nth bound variable of the trigger EntityType, interpreted as a list of strings. The extraction rule token inside the bound variable is stripped of its outermost * or +, and the bound variable is broken into a list according to this pattern. For example, {JJX*} is interpreted as a list of JJX, or adjectives. This expression extracts a bound variable nested inside other bound variables. The ith bound variable of trigger FaxtityType is extracted; if this item is of type FEntityTypel, then the jth bound variable is extracted (the expression returns false if the entity types do not match); each comma separated unit is interpreted in this manner, up to an arbitrary depth.</Paragraph> <Paragraph position="5"> This compound expression evaluates to the disjunction of an arbitrary number of valid atoms (as defined in this table). Each alternative is evaluated in a left to right order; the disjunction evaluates to the first alternative that returns a non-empty string. A literal string.</Paragraph> <Paragraph position="6"> DT for determiners, JJX for adjectives, J JR for comparative adjectives, JJS for superlative adjectives, NNX for singular or mass nouns, NNS for plural nouns, NNPX for singular proper nouns, NNPS for plural proper nouns, IN for prepositions.)</Paragraph> </Section> <Section position="3" start_page="72" end_page="72" type="sub_section"> <SectionTitle> 5.3 Examples </SectionTitle> <Paragraph position="0"> A few extraction and relation rules are given in Figure 1. The first extraction rule defines a NounGroup as a sequence consisting of: an optional possessive pronoun or determiner, any number of adjectives, one or more nouns (of any type). Also, the sequence of adjectives is saved as the 0th bound variable, and the sequence of nouns is saved as the 1st bound variable. The rules for PrepositionalPhrase and ComplexNounGroup can be interpreted similarly.</Paragraph> <Paragraph position="1"> Consider the following noun phrase: the big, bad wolf of the dark forest REXTOR recognizes two NounGroups in the above phrase: the big, bad wolf and the dark forest. The corresponding relation rule triggers, and generates the following relations: < (big, bad) describes wolf > < (dark) describes forest > Note that the first bound variable in NounGroup is interpreted as a list; thus, the above two relations expand into three distinct relations when completely enumerated: < big describes wolf > < bad describes wolf > < dark describes forest > The ability to interpret bound variables as a list of strings allows for easy manipulation of repeated structure, like textual lists or enumerations. In addition, the entire noun phrase the big, bad wolf of the dark/forest will be recognized as a ComplexNounGroup. This will result in the fop lowing relation: < wolf related-to forest > The relation rule associated with ComplexNounGroup involves extracting nested bound variables. The first atom evaluates to the lth bound variable (a NounGroup) inside the 0th bound variable inside the trigger item ComplexNounGroup. The third atom is similarly evaluated.</Paragraph> </Section> </Section> class="xml-element"></Paper>