File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0508_metho.xml

Size: 19,042 bytes

Last Modified: 2025-10-06 14:09:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0508">
  <Title>Answering Questions in the Genomics Domain</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The original Question Answering system
</SectionTitle>
    <Paragraph position="0"> ExtrAns is a Question Answering system aimed at restricted domains, in particular terminology-rich domains. While open domain Question Answering systems typically are targeted at large text collections and use relatively little linguistic information, ExtrAns answers questions over such domains by exploiting linguistic knowledge from the documents and terminological knowledge about a specific domain. Various applications of the ExtrAns system have been developed, from the original prototype aimed at the Unix documentation files (Moll'a et al., 2000) to a version targeting the Aircraft Maintenance Manuals (AMM) of the Airbus A320 (Moll'a et al., 2003; Rinaldi et al., 2004). In the present paper we describe current work in applying the system to a different domain and text type: research papers in the genomics area.</Paragraph>
    <Paragraph position="1"> Our approach to Question Answering is particularly computationally intensive; this allows a deeper linguistic analysis to be performed, at the cost of higher processing time. The documents are analyzed in an off-line stage and transformed in a semantic representation (called 'Minimal Logical Forms' or MLFs), which is stored in a Knowledge Base (KB). In an on-line phase (see fig. 2) the user queries are analyzed using the same basic machinery (however the cost of processing them is negligible, so that there is no visible delay) and their semantic representation is matched in the KB. If a match is encountered, the sentences that gave origin to the match are presented as possible answer to the question.</Paragraph>
    <Paragraph position="2"> Documents (and queries) are first tokenized, then they go through a terminology-processing module.</Paragraph>
    <Paragraph position="3"> If a term belonging to a synset in the terminological knowledge base is detected, then the term is replaced by a synset identifier in the logical form.</Paragraph>
    <Paragraph position="4"> This results in a canonical form, where the synset identifier denotes the concept that each of the terms in the synset names. In this way any term contained in a user query is automatically mapped to all its variants. This approach amounts to an implicit 'terminological normalization' for the domain, where the synset identifier can be taken as a reference to  the 'concept' that each of the terms in the synset describes (Kageura, 2002).</Paragraph>
    <Paragraph position="5"> ExtrAns depends heavily on its use of logical forms, which are designed so that they are easy to build and to use, yet expressive enough for the task at hand (Moll'a, 2001). The logical forms and associated semantic interpretation methods are designed to cope with problematic sentences, which include very long sentences, even sentences with spelling mistakes, and structures that are not recognized by the syntactic analyzer. An advantage of ExtrAns' Minimal Logical Forms (MLFs) is that they can be produced with minimal domain knowledge. This makes our technology easily portable to different domains. The only true impact of the domain is during the preprocessing stage of the input text and during the creation of a thesaurus that reflects the specific terms used in the chosen domain, their lexical relations and their word senses.</Paragraph>
    <Paragraph position="6"> Unlike sentences in documents, user queries are processed on-line and the resulting MLFs are proved by deduction over the MLFs of document sentences stored in the KB. When no direct answer for a user query can be found, the system is able to relax the proof criteria in a stepwise manner. First, hyponyms are added to the query terms. This makes the query more general but maintains its logical correctness. If no answers can be found or the user determines that they are not good answers, the system will attempt approximate matching, in which the sentence that has the highest overlap of predicates with the query is retrieved. The matching sentences are scored and the best matches are returned. The MLFs contain pointers to the original text which allow ExtrAns to identify and highlight those words in the retrieved sentence that contribute most to a particular answer. An example of the output of ExtrAns can be seen in fig. 3. When the user clicks on one of the answers provided, the corresponding document will be displayed with the relevant passages highlighted. Another click displays the answer in the context of the document and allows the user to verify the justification of the answer.</Paragraph>
    <Paragraph position="7"> 3 Moving to the new domain The first step in adapting the system to a new domain is identifying the specific set of documents to be analyzed. We have experimented with two different collections in the genomics domain. The first collection (here called the 'Biovista' corpus) has been generated from Medline using two seed term lists of genes and pathways (biological process) to extract an initial corpus of research papers (full articles). The second collection is constituted by the GENIA corpus (Kim et al., 2003)3, which contains 2000 abstracts from Medline (a total of 18546 sentences). The advantage of the latter is that domain-specific terminology is already manually annotated. However focusing only on that case would mean disregarding a number of real-world problems (in particular terminology detection).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Formatting information
</SectionTitle>
      <Paragraph position="0"> An XML based filtering tool has been used to select zones of the documents that need to be processed in a specific fashion. Consider for instance the case of bibliography. The initial structure of the document allows to identify easily each bibliographical item. Isolating the authors, titles and publication information is then trivial (because it follows a regular structure). The name of the authors (together with the html cross-references) can then be used to identify the citations within the main body of the paper.</Paragraph>
      <Paragraph position="1"> If a preliminary zone identification (as described) is not performed, the names of the authors used in the citations would appear as spurious elements within sentences, making their analysis very difficult.</Paragraph>
      <Paragraph position="2"> Another common case is that of titles. Normally they are Nominal Phrases rather than sentences. If  the parser was expecting to find a sentence it would fail. However using the knowledge that a title is being processed, we can modify the configuration of the parser so that it accepts an NP as a correct parse.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Terminology
</SectionTitle>
      <Paragraph position="0"> The high frequency of terminology in technical text produces various problems when locating answers.</Paragraph>
      <Paragraph position="1"> A primary problem is the increased difficulty of parsing text in a technical domain due to domain-specific sublanguage. Various types of multi-word terms characterize these domains, in particular referring to specific concepts (e.g. genome sequences, proteins). These multi-word expressions might include lexical items which are either unknown to a generic lexicon (e.g. &amp;quot;argentine methylation&amp;quot;), have a specific meaning unique to this domain or deverbal adjectives (and nouns) are often mistagged as verbs (e.g. &amp;quot;mediated activation&amp;quot;, &amp;quot;cell killing&amp;quot;). Abbreviations and acronyms, often complex (e.g.</Paragraph>
      <Paragraph position="2"> bracketed inside NPs, like &amp;quot;adenovirus (ad) infection&amp;quot;) are another common source of inconsistencies. In such cases the parser might either fail to identify the compound as a phrase and consequently fail to parse the sentence including such items. Alternatively a parser might attempt to 'guess' their lexical category (in the set of open class categories), leading to an exponential growth of the number of possible syntactic parses and often incorrect decisions. Not only the internal structure of the compound can be multi-way ambiguous, also the boundaries of the compounds are difficult to detect and the parsers may try odd combinations of the tokens belonging to the compounds with neighboring tokens.</Paragraph>
      <Paragraph position="3"> We have described in (Rinaldi et al., 2002) some approaches that might be taken towards terminology extraction for a specific domain. The GENIA corpus removes these problems completely by providing pre-annotated terminological units. This allows attention to be focused on other challenges of the QA task, rather than getting 'bogged down' with terminology extraction and organization.</Paragraph>
      <Paragraph position="4"> In the case of the Biovista corpus, we had to perform a phase of terminology discovery, which was facilitated by the existence of the seed lists of genes and pathways. We first marked up those terms which appear in the corpus using additional xml tags. This identified 900 genes and 218 pathways that occur in the corpus - represented as boxed tokens in fig. 4. Next the entire corpus is chunked into nominal and verbal chunks using LT Chunk (Finch and Mikheev, 1997). Ignoring prepositions and gerunds the chunks are a minimal phrasal group represented as the square braces in fig. 4. The corpus terms are then expanded to the boundary of the phrasal chunk they appear in. For example, NP3 in fig. 4 contains two terms of interest producing the new term &amp;quot;IFN-induced transcription&amp;quot;. The 1118 corpus terms were expanded into 6697 new candidate terms. 1060 involve a pathway in head position and 1154 a gene. The remaining 4483 candidate terms involve a novel head with at least one gene or pathway as a modifier.</Paragraph>
      <Paragraph position="5"> Once the terminology is available, it is necessary to detect relations among terms in order to exploit  it. We have focused our attention in particular to the relations of synonymy and hyponymy, which are detected as described in (Dowdall et al., 2003) and gathered in a Thesaurus. The organizing unit is the WordNet style synset which includes strict synonymy as well as three weaker synonymy relations.</Paragraph>
      <Paragraph position="6"> These sets are further organized into a isa hierarchy based on two definitions of hyponymy.</Paragraph>
      <Paragraph position="7"> One of the most serious problems that we have encountered in working in restricted domains is the syntactic ambiguity generated by multi-word units, in particular technical terms. Any generic parser, unless developed specifically for the domain at hand, will have serious problems dealing with those multi-words. The solution that we have adopted is to parse multi-word terms as single syntactic units. The tokenizer detects the terms (previously collected in the Thesaurus) as they appear in the input stream, and packs them into single lexical tokens prior to syntactical analysis, assigning them the syntactic properties of their head word. In previous work this approach has proved to be particularly effective, bringing a reduction in the complexity of parsing of 46% (Rinaldi et al., 2002).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Parsing
</SectionTitle>
      <Paragraph position="0"> The deep syntactic analysis builds upon the chunks to identify sentence level syntactic relations between the heads of the chunks. The output is a hierarchical structure of syntactic relations - functional dependency structures - represented as the directed arrows in fig. 4. The parser (Pro3Gres) uses hand-written declarative rules to encode acknowledged facts, such as verbs typically take one but never two subjects, combined with a statistical language model that calculates lexicalized attachment probabilities, similar to (Collins, 1999). Parsing is seen as a decision process, the probability of a total parse is the product of probabilities of the individual decisions at each ambiguous point in the derivation.</Paragraph>
      <Paragraph position="1"> Probabilistic parsers generally have the advantage that they are fast and robust, and that they resolve syntactic ambiguities with high accuracy.</Paragraph>
      <Paragraph position="2"> Both of these points are prerequisites for a statistical analysis that is feasible over large amounts of text and beneficial to the Q&amp;A system's performance.</Paragraph>
      <Paragraph position="3"> In comparison to shallow processing methods, parsing has the advantage that relations spanning long stretches of text can still be recognized, and that the parsing context largely contributes to the disambiguation. In comparison to deep linguistic, formal grammar-based parsers, however, the output of probabilistic parsers is relatively shallow, pure context-free grammar (CFG) constituency output, tree structures that do not include grammatical function annotation nor co-indexation and empty nodes annotation expressing long-distance dependencies (LDD). In a simple example sentence &amp;quot;John wants to leave&amp;quot;, a deep-linguistic syntactic analysis expresses the identity of the explicit matrix clause subject and implicit subordinate clause subject by means of co-indexing the explicit and the empty implicit subject trace t: &amp;quot;[John1 wants [t1 to leave]]&amp;quot;. A parser that fails to recognize these implicit subjects, so-called control subjects, misses very important information, quantitatively about 3 % of all subjects. null Although LDD annotation is actually provided in Treebanks such as the Penn Treebank (Marcus et al., 1993) over which they are typically trained, most probabilistic parsers largely or fully ignore this information. This means that the extraction of LDDs and the mapping to shallow semantic representations such as MLF is not always possible, because first co-indexation information is not available, second a single parsing error across a tree fragment containing an LDD makes its extraction impossible, third some syntactic relations cannot be recovered  We therefore adapt ExtrAns to use a new statistical broad-coverage parser that is as fast as a probabilistic parser but more deep-linguistic because it delivers grammatical relation structures which are closer to predicate-argument structures and shallow semantic structures like MLF, and more informative if non-local dependencies are involved (Schneider, 2003). It has been evaluated and shown to have state-of-the-art performance.</Paragraph>
      <Paragraph position="4"> The parser expresses distinctions that are especially important for a predicate-argument based shallow semantic representation, as far as they are expressed in the Penn Treebank training data, such as PP-attachment, most LDDs, relative clause anaphora, participles, gerunds, and the argument/adjunct distinction for NPs.</Paragraph>
      <Paragraph position="5"> In some cases functional relations distinctions that are not expressed in the Penn Treebank are made. Commas are e.g. disambiguated between apposition and conjunction, or the Penn tag IN is disambiguated between preposition and subordinating conjunction. Other distinctions that are less relevant or not clearly expressed in the Treebank are left underspecified, such as the distinction between PP arguments and adjuncts, or a number of types of subordinate clauses. The parser is robust in that it returns the most promising set of partial structures when it fails to find a complete parse for a sentence.</Paragraph>
      <Paragraph position="6"> For sentences syntactically more complex than this illustrative example, as many hierarchical relations are returned as possible. A screenshot of its graphical interface can be seen in fig. 5. Its parsing speed is about 300,000 words per hour.</Paragraph>
      <Paragraph position="7"> Fig. 4 displays the three levels of analysis that are performed on a simple sentence. Term expansion yields NP3 as a complete candidate term. However, NP1 and NP2 form two distinct, fully expanded noun phrase chunks. Their formation into a noun phrase with an embedded prepositional phrase is recovered from the parser's syntactic relations giving the maximally projected noun phrase involving a term: &amp;quot;Argentine methylation of STAT1&amp;quot; (or juxtaposed &amp;quot;STAT1 Argentine methylation&amp;quot;). Finally, the highest level syntactic relations (subj and obj) identifies a transitive predicate relation between these two candidate terms.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 MLFs
</SectionTitle>
      <Paragraph position="0"> The deep-linguistic dependency based parser partly simplifies the construction of MLF. First, the mapping between labeled dependencies and a surface semantic representation is often more direct than across a complex constituency subtree (Schneider, 2003), and often more accurate (Johnson, 2002).</Paragraph>
      <Paragraph position="1"> Dedicated labels can directly express complex relations, the lexical participants needed for the construction are more locally available.</Paragraph>
      <Paragraph position="2"> Let us look at the example sentence &amp;quot;Adenovirus infection and transfection were used to model changes in susceptibility to cell killing caused by E1A expression&amp;quot;. The control relation (infection is the implicit subject of model) and the PP relation (including the description noun) are available locally. The reduced relative clause killing caused by is expressed by a local dedicated label (modpart).</Paragraph>
      <Paragraph position="3"> Only the conjunction infection and transfection, expressed here by bracketing, needs to be searched across the syntactic hierarchy.</Paragraph>
      <Paragraph position="4"> This leads to the following MLFs: object(infection, o1, [o1]).</Paragraph>
      <Paragraph position="5"> object(transfection, o2, [o2]).</Paragraph>
      <Paragraph position="6"> object(change, o3, [o3]).</Paragraph>
      <Paragraph position="7"> object(susceptibility, o4, [o4]).</Paragraph>
      <Paragraph position="8"> object(killing, o5, [o5]).</Paragraph>
      <Paragraph position="9"> object(expression, o6, [o6]).</Paragraph>
      <Paragraph position="10"> object(cell, o7, [o7]).</Paragraph>
      <Paragraph position="11"> evt(cause, e3, [o6]).</Paragraph>
      <Paragraph position="12"> evt(model, e1, [(o1,o2), o3]).</Paragraph>
      <Paragraph position="13"> evt(use, e2, [(o1,o2), e1]).</Paragraph>
      <Paragraph position="14"> by(e3, o6).</Paragraph>
      <Paragraph position="15"> in(o5, o7).</Paragraph>
      <Paragraph position="16"> to(o4, o5).</Paragraph>
      <Paragraph position="17"> in(o3, o4).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML