File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2701_intro.xml
Size: 3,413 bytes
Last Modified: 2025-10-06 14:04:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2701"> <Title>Representing and Querying Multi-dimensional Markup for Question Answering</Title> <Section position="2" start_page="0" end_page="3" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Corpus-based question answering is a complex task that draws from information retrieval, information extraction and computational linguistics to pinpoint information users are interested in. The flexibility of natural language means that potential answers to questions may be phrased in different ways--lexical and syntactic variation, ambiguity, polysemy, and anaphoricity all contribute to a gap between questions and answers. Typically, QA systems rely on a range of linguistic analyses, provided by a variety of different tools, to bridge this gap from questions to possible answers.</Paragraph> <Paragraph position="1"> In our work, we focus on how we can integrate the analyses provided by completely independent linguistic processing components into a uniform QA framework. On the one hand, we would like to be able, as much as possible, to make use of off-the-shelfNLPtoolsfromvarioussourceswithout having to worry about whether the output of the tools are compatible, either in a strong sense of forming a single hierarchy or even in a weaker sense of simply sharing common tokenization. On the other hand, we would like to be able to issue simple and clear queries that jointly draw upon annotations provided by different tools.</Paragraph> <Paragraph position="2"> To this end, we store annotated data as stand-off XML and query it using an extension of XQuery with our new StandOff axes, inspired by (Burkowski, 1992). Key to our approach is the use of stand-off annotation at every stage of the annotation process. The source text, or character data, isstoredinaBinaryLargeOBject(BLOB),andall annotations, in a single XML document. To generate and manage the annotations we have adopted XIRAF (Alink, 2005), a framework for integrating annotation tools which has already been successfully used in digital forensic investigations.</Paragraph> <Paragraph position="3"> Before performing any linguistic analysis, the sourcedocuments, whichmaycontainXMLmetadata, are split into a BLOB and an XML document, and the XML document is used as the initial annotation. Various linguistic analysis tools are run over the data, such as a named-entity tagger, atemporalexpression(timex)tagger, andsyntactic phrase structure and dependency parsers.</Paragraph> <Paragraph position="4"> The XML document will grow during this analysisphaseasnewannotationsareaddedbytheNLP null tools, while the BLOB remains intact. In the end, the result is a fully annotated stand-off document, and this annotated document is the basis for our QA system, which uses XQuery extended with the new axes to access the annotations.</Paragraph> <Paragraph position="5"> The remainder of the paper is organized as follows. In Section 2 we briefly discuss related work.</Paragraph> <Paragraph position="6"> Section 3is devoted tothe issue ofquerying multi-dimensional markup. Then we describe how we coordinate the process of text annotation, in Sec- null tion 4, before describing the application of our multi-dimensional approach to linguistic annotation to question answering in Section 5. We conclude in Section 6.</Paragraph> </Section> class="xml-element"></Paper>