File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1604_metho.xml
Size: 18,085 bytes
Last Modified: 2025-10-06 14:08:35
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1604"> <Title>Exploiting Paraphrases in a Question Answering System</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Question Answering System for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Technical Domains </SectionTitle> <Paragraph position="0"> Over the past few years our research group has developed an Answer Extraction system (ExtrAns) that works by transforming documents and queries into a semantic representation called Minimal Logical Form (MLF) (Moll a et al., 2000a) and derives the answers by logical proof from the documents. A full linguistic (syntactic and semantic) analysis, complete with lexical alternations (synonyms and hyponyms) is performed. While documents are processed in an o -line stage, the query is processed on-line.</Paragraph> <Paragraph position="1"> Two real world applications have so far been implemented with the same underlying technology. The original ExtrAns system (Moll a et al., 2000b) is used to extract answers to arbitrary user queries over the Unix documentation les (\man pages&quot;). A set of 500+ unedited man pages has been used for this application. An on-line demo of ExtrAns can be found at the project web page.3 ments More recently we tackled a di erent domain, the Airplane Maintenance Manuals (AMM) of the Airbus A320 (Rinaldi et al., 2002b), which o ered the additional challenges of an SGML-based format and a much larger size (120MB).4 Despite being developed initially for a speci c domain, ExtrAns has demonstrated a high level of domain independence.</Paragraph> <Paragraph position="2"> As we work on relatively small volumes of data we can a ord to process (in an o -line stage) all the documents in our collection rather than just a few selected paragraphs (see Figure 2). Clearly in some situations (e.g. processing incoming news) such an approach might not be feasible and paragraph indexing techniques would need to be used. Our current approach is particularly targeted to small and medium sized collections. In an initial phase all multi-word expressions from the domain are collected and structured in an external resource, which we will refer to as the TermBase (Rinaldi et al., 2003; Dowdall et al., 2003). The document sentences (and user queries) are syntactically processed with the Link Grammar (LG) parser (Sleator and Temperley, 1993) which uses a ment collections used for TREC grammar with a wide coverage of English and has a robust treatment of ungrammatical sentences and unknown words. The multi-word terms from the thesaurus are identi ed and passed to the parser as single tokens. This prevents (futile) analysis of the internal structure of terms (see Figure 1), simplifying parsing by 46%. This solves the rst of the problems that we have identi ed in the introduction (\The Parsing Problem&quot;).</Paragraph> <Paragraph position="3"> In later stages of processing, a corpus-based approach (Brill and Resnik, 1994) is used to deal with ambiguities that cannot be solved with syntactic information only, in particular attachments of prepositional phrases, gerunds and in nitive constructions. ExtrAns adopts an anaphora resolution algorithm (Moll a et al., 2003) that is based on Lappin and Leass' approach (Lappin and Leass, 1994). The original algorithm, which was applied to the syntactic structures generated by McCord's Slot Grammar (Mc-Cord et al., 1992), has been ported to the output of Link Grammar. So far the resolution is restricted to sentence-internal pronouns but the same algorithm can be applied to sentence-external pronouns too. A lexicon of nominalisations based on NOMLEX (Meyers et al., 1998) is used for the most important cases. The main problem here is that the semantic relationship between the base words (mostly, but not exclusively, verbs) and the derived words (mostly, but not exclusively, nouns) is not su ciently systematic to allow a derivation lexicon to be compiled automatically. Only in relatively rare cases is the relationship as simple as with to edit <a text> $ editor of <a text> / <text> editor, as the e ort that went into building resources such as NOMLEX also shows.</Paragraph> <Paragraph position="4"> User queries are processed on-line and converted into MLFs (possibly expanded by synonyms) and proved by refutation over the document knowledge base (see Figure 3). Pointers to the original text attached to the retrieved logical forms allow the system to identify and highlight those words in the retrieved sentence that contribute most to that particular answer. When the user clicks on one of the answers provided, the corresponding document will be displayed with the relevant passages highlighted. The meaning of the documents and of the queries produced by ExtrAns is expressed by means of Minimal Logical Forms (MLFs). The MLFs are designed so that they can be found for any sentence (using robust approaches to treat very complex or ungrammatical sentences), and they are optimized for NLP tasks that involve the semantic comparison of sentences, such as Answer Extraction.</Paragraph> <Paragraph position="5"> The expressivity of the MLFs is minimal in the sense that the main syntactic dependencies between the words are used to express verb-argument relations, and modi er and adjunct relations. However, complex quanti cation, tense and aspect, temporal relations, plurality, and modality are not expressed. One of the e ects of this kind of underspeci cation is that several natural language queries, although slightly di erent in meaning, produce the same logical form.</Paragraph> <Paragraph position="6"> The main feature of the MLFs is the use of rei cation (the expression of abstract concepts as concrete objects) to achieve at expressions (Moll a et al., 2000b). The MLFs are expressed as conjunctions of predicates with all the variables existentially bound with wide scope. For example, the MLF of the sentence \cp will quickly copy the les&quot; is:</Paragraph> <Paragraph position="8"> prop(quickly,p3,[e4]).</Paragraph> <Paragraph position="9"> In other words, there is an entity x1 which represents an object of type cp and of type command, there is an entity x6 (a le), there is an entity e4, which represents a copying event where the rst argument is x1 and the second argument is x6, there is an entity p3 which states that e4 is done quickly, and the event e4, that is, the copying, holds. The entities o1, o2, o3, e4, and p3 are the result of rei cation. The rei cation of the event, e4, has been used to express that the event is done quickly. The other entities are not used in this MLF, but other more complex sentences may need to refer to the rei cation of properties (adjective-modifying adverbs) or object predicates (non-intersective adjectives such as the alleged suspect).</Paragraph> <Paragraph position="10"> ExtrAns nds the answers to the questions by forming the MLFs of the questions and then running Prolog's default resolution mechanism to nd those MLFs that can prove the question. When no direct proof for the user query is found, the system is capable of relaxing the proof criteria in a stepwise manner. First, hyponyms of the query terms will be added as disjunctions in the logical form of the question, thus making it more general but still logically correct. If that fails, the system will attempt approximate matching, in which the sentence (or sentences) with the highest overlap of predicates with the query is retrieved. The (partially) matching sentences are scored and the best ts are returned. In the case that this method nds too many answers because the overlap is too low, the system will attempt key-word matching, in which syntactic criteria are abandoned and only information about word classes is used. This last step corresponds approximately to a traditional passage-retrieval methodology with consideration of the POS tags.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Dealing with Paraphrases </SectionTitle> <Paragraph position="0"> The system is capable of dealing with paraphrases at two di erent levels. On the phrase level, di erent surface realizations (terms) which refer to the same domain concept will be mapped into a common identi er (synset identi er). On the sentence level, paraphrases which involve a (simple) syntactic transformation will be dealt with by mapping them into the same logical form. In this section we will describe these two approaches and discuss ways to cope with complex types of parapharases.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Identifying Terminological Paraphrases </SectionTitle> <Paragraph position="0"> During the construction of the MLFs, thesaurus terms are replaced by their synset identi ers. This results in an implicit 'terminological normalization' for the domain. The bene t to the QA process is an assurance that a query and answer need not involve exactly the same surface realization of a term.</Paragraph> <Paragraph position="1"> Utilizing the synsets in the semantic representation means that when the query includes a term, ExtrAns returns sentences that logically answer the query, in- null volving any known paraphrase of that term.</Paragraph> <Paragraph position="2"> For example, the logical form of the query Where are the stowage compartments installed? is translated internally into the Horn query (2).</Paragraph> <Paragraph position="3"> (2) evt(install,A,[B,C]), object(D,E,[B]), object(s stowage compartment,G,[C]) This means that a term (belonging to the same synset as stowage compartment) is involved in an install event with an anonymous object. If there is an MLF from the document that can match example (2), then it is selected as a candidate answer and the sentence it originates from is shown to the user. The process of terminological variation is well investigated (Ibekwe-SanJuan and Dubois, 2002; Daille et al., 1996; Ibekwe-Sanjuan, 1998). The primary focus has been to use linguistically based variation to expand existing term sets through corpus investigation or to produce domain representations. A subset of such variations identi es terms which are strictly synonymous. ExtrAns gathers these morpho-syntactic variations into synsets. The sets are augmented with terms exhibiting three weaker synonymy relations described by Hamon & Nazarenko (2001). These synsets are organized into a hyponymy (isa) hierarchy, a small example of which can be seen in Figure 5. Figure 4 shows a schematic representation of this process.</Paragraph> <Paragraph position="4"> The rst stage is to normalize any terms that contain punctuation by creating a punctuation free version and recording the fact that that the two are strictly synonymous. Further processing is involved in terms containing brackets to determine if the bracketed token is an acronym or simply optional. In the former case an acronym-free term is created and the acronym is stored as a synonym of the remaining tokens which contain it as a regular expression. So evac is synonymous with evacuation and ohsc is synonymous with overhead stowage compartment. In cases such as emergency (hard landings) the bracketed tokens can not be interpreted as acronyms and so are not removed.</Paragraph> <Paragraph position="5"> The synonymy relations are identi ed using the terminology tool Fastr (Jacquemin, 2001). Every token of each term is associated with its part-of-speech, its morphological root, and its synonyms. Phrasal rules represent the manner in which tokens combine to form multi-token terms, and feature-value pairs carry the token speci c information. Metarules license the relation between two terms by constraining their phrase structures in conjunction with the morphological and semantic information on the individual tokens.</Paragraph> <Paragraph position="6"> The metarules can identify simple paraphrases that result from morpho-syntactic variation (cargo compartment door ! doors of the cargo compartment), terms with synonymous heads (electrical cable ! electrical line), terms with synonymous modi ers (fastener strip ! attachment strip) and both (functional test ! operational check). For a description of the frequency and range of types of variation present in the AMM see Rinaldi et al. (2002a).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Identifying Syntactic Paraphrases </SectionTitle> <Paragraph position="0"> An important e ect of using a simpli ed semantic-based representation such as the Minimal Logical Forms is that various types of syntactic variations are automatically captured by a common representation. This ensures that many potential paraphrases in a user query can map to the same answer into the manual.</Paragraph> <Paragraph position="1"> For example the question shown in Figure 6 can be answered thanks to the combination of two factors. On the lexical level ExtrAns knows that APU is an abbreviation of Auxiliary Power Unit, while on the syntactic level the active and passive voices (supplies vs supplied with) map into the same underlying representation (the same MLF).</Paragraph> <Paragraph position="2"> Another type of paraphrase which can be detected at this level is the kind that was classi ed as type (3) in the introduction. For example the question: Is the sensor connected to the APU ECB?, can locate the answer This sensor is connected to the Electronic Control Box (ECB) of the APU. This has been achieved by introducing meaning postulates that operate at the level of the MLFs (such as \any predicate that a ects an object will also a ect the of -modi ers of that object&quot;).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Weaker Types of Paraphrases </SectionTitle> <Paragraph position="0"> When the thesaurus de nition of terminological synonymy fails to locate an answer from the document collection, ExtrAns explores weaker types of paraphrases, where the equivalence between the two terms might not be complete.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="11" type="metho"> <SectionTitle> TERM </SectionTitle> <Paragraph position="0"> doors of the cargo compartment cargo compartment door Now the alternative objects are in a logical OR relation. This query nds the answer in Figure 7 (where stowage compartment is a hyperonym of overhead stowage compartment).</Paragraph> <Paragraph position="1"> We have implemented a very simple ad-hoc algorithm to determine lexical hyponymy between terms. Term A is a hyponym of term B if (i) A has more tokens than B, (ii) all the tokens of B are present in A, and (iii) both terms have the same head. There are three provisions. First, ignore terms with dashes and brackets as cargo compartment is not a hyponym of cargo - compartment and this relation (synonymy) is already known from the normalisation process. Second, compare lemmatised versions of the terms to capture that stowage compartment is a hyperonym of overhead stowage compartments. Finally, the head of a term is the rightmost non-symbol token (i.e. a word) which can be determined from the part-of-speech tags. This hyponymy relation is comparable to the insertion variations de ned by Daille et al. (1996).</Paragraph> <Paragraph position="2"> The expressivity of the MLF can further be expanded through the use of meaning postulates of the type: \If x is installed in y, then x is in y&quot;. This ensures that the query Where are the equipment and furnishings? extracts the answer The equipment and furnishings are installed in the cockpit.</Paragraph> </Section> <Section position="7" start_page="11" end_page="11" type="metho"> <SectionTitle> 4 Related Work </SectionTitle> <Paragraph position="0"> The importance of detecting paraphrasing in Question Answering has been shown dramatically in TREC9 by the Falcon system (Harabagiu et al., 2001), which made use of an ad-hoc module capable of caching answers and detecting question similarity. As in that particular evaluation the organisers deliberately used a set of paraphrases of the same questions, such approach certainly helped in boosting the performance of the system. In an environment where the same question (in di erent formulations) is likely to be repeated a number of times, a module capable of detecting paraphrases can significantly improve the performance of a Question An- null swering system.</Paragraph> <Paragraph position="1"> Another example of application of paraphrases for Question Answering is given in (Murata and Isahara, 2001), which further argues for the importance of paraphrases for other applications such Summarisation, error correction and speech generation. Our approach for the acquisition of terminological paraphrases might have some points in common with the approach described in (Terada and Tokunaga, 2001). The motivation that they bring forward for the necessity of identifying abbreviations is related to the problem that we have called \the Parsing Problem&quot;. null A very di erent approach to paraphrases is taken in (Takahashi et al., 2001) where they formulate the problem as a special case of Machine Translation, where the source and target language are the same but special rules, based on di erent parameters, license di erent types of surface realizations.</Paragraph> <Paragraph position="2"> Hamon & Nazarenko (2001) explore the terminological needs of consulting systems. This type of IR guides the user in query/keyword expansion or proposes various levels of access into the document base on the original query. A method of generating three types of synonymy relations is investigated using general language and domain speci c dictionaries.</Paragraph> </Section> class="xml-element"></Paper>