File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2701_metho.xml
Size: 13,477 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2701"> <Title>Representing and Querying Multi-dimensional Markup for Question Answering</Title> <Section position="4" start_page="3" end_page="5" type="metho"> <SectionTitle> 3 Querying Multi-dimensional Markup </SectionTitle> <Paragraph position="0"> Our approach to markup is based on stand-off XML. Stand-off XML is already widely used, although it is often not recognized as such. It can be found in many present-day applications, especially where annotations of audio or video are concerned. Furthermore, many existing multidimensional-markup languages, such as LMNL, can be translated into stand-off XML.</Paragraph> <Paragraph position="1"> We split annotated data into two parts: the BLOB (Binary Large OBject) and the XML annotations that refer to specific regions of the BLOB. A BLOB may be an arbitrary byte string (e.g., the contents of a hard drive (Alink, 2005)), and the annotations may refer to regions using positions such as byte offsets, word offsets, points in time or frame numbers (e.g., for audio or video applications). In text-based applications, such as described in this paper, we use character offsets. The advantage of such character-based references over word- or token-based ones is that it allows us to reconcile possibly different tokenizations by different text analysis tools (cf. Section 4).</Paragraph> <Paragraph position="2"> Inshort, amulti-dimensionaldocumentconsists of a BLOB and a set of stand-off XML annotations of the BLOB. Our approach to querying such documents extends the common XML query languages XPath and XQuery by defining 4 new axes that allow one to move from one XML tree to another. Until recently, there have been very few (text characters)Figure 1: Two annotations of the same data. approaches to querying stand-off documents. We take the approach of (Alink, 2005), which allows the user to relate different annotations using containment and overlap conditions. This is done using the new StandOff XPath axis steps that we add to the XQuery language. This approach seems to be quite general: in (Alink, 2005) it is shown that many of the query scenarios given in (Iacob et al., 2004) can be easily handled by using these Stand-Off axis steps.</Paragraph> <Paragraph position="3"> Let us explain the axis steps by means of an example. Figure 1 shows two annotations of the same character string (BLOB), where the first</Paragraph> <Paragraph position="5"> can be queried using standard XML query languages, together they make up a more complex structure.</Paragraph> <Paragraph position="6"> StandOff axis steps, inspired by (Burkowski, 1992), allow for querying overlap and contain- null ment of regions, but otherwise behave like regular XPath steps, such as child (the step between A and B in Figure 1) or sibling (the step between C and D). The new StandOff axes, denoted with select-narrow, select-wide, reject-narrow, and reject-wide select contained, overlapping, non-contained and non-overlapping region elements, respectively, from possibly distinct layers of XML annotation of the data. Table 1 lists some examples for the annotations of our example document.</Paragraph> <Paragraph position="7"> In XPath, the new axis steps are used in exactly the same way as the standard ones. For example, returns nodes that contain the span of B: in our case, the nodes A and E.</Paragraph> <Paragraph position="8"> In implementing the new steps, one of our design decisions was to put all stand-off annotations in a single document. For this, an XML processor isneededthatiscapableofhandlinglargeamounts of XML. We have decided to use MonetDB/X-Query, an XQuery implementation that consists of the Pathfinder compiler, which translates XQuery statements into a relational algebra, and the relational database MonetDB (Grust, 2002; Boncz, 2002).</Paragraph> <Paragraph position="9"> The implementation of the new axis steps in MonetDB/XQuery is quite efficient. When the XMark benchmark documents (XMark, 2006) are represented using stand-off notation, querying with the StandOff axis steps is interactive for document size up to 1GB. Even millions of regions are handled efficiently. The reason for the speed of the StandOff axis steps is twofold. First, they are accelerated by keeping a database index on the region attributes, which allows fast merge-algorithms to be used in their evaluation. Such merge-algorithms make a single linear scan through the index to compute each StandOff step. The second technical innovation is &quot;looplifting.&quot; ThisisageneralprincipleinMonetDB/X-Query(Boncz et al., 2005) for the efficient execution of XPath steps that occur nested in XQuery iterations (i.e., inside for-loops). A naive strategy would invoke the StandOff algorithm for each iteration, leading to repeated (potentially many) sequential scans. Loop-lifted versions of the Stand-Offalgorithms, incontrast, handlealliterationstogether in one sequential scan, keeping the average complexity of the StandOff steps linear.</Paragraph> <Paragraph position="10"> The StandOff axis steps are part of release 0.10 of the open-source MonetDB/XQuery product, which can be downloaded from http:// www.monetdb.nl/XQuery.</Paragraph> <Paragraph position="11"> In addition to the StandOff axis steps, a key-word search function has been added to the XQuery system to allow queries asking for regions containing specific words. This function is called so-contains($node, $needle) which will return a boolean specifying whether $needle occurs in the given region represented by the element $node.</Paragraph> </Section> <Section position="5" start_page="5" end_page="5" type="metho"> <SectionTitle> 4 Combining Annotations </SectionTitle> <Paragraph position="0"> In our QA application of multi-dimensional markup, we work with corpora of newspaper articles, each of which comes with some basic annotation, such as title, body, keywords, timestamp, topic, etc. We take this initial annotation structure and split it into raw data, which comprises all textual content, and the XML markup. The raw data is the BLOB, and the XML annotations are converted to stand-off format. To each XML element originally containing textual data (now stored in the BLOB), we add a start and end attribute denoting its position in the BLOB.</Paragraph> <Paragraph position="1"> We use a separate system, XIRAF, to coordinate the process of automatically annotating the text. XIRAF (Figure 2) combines multiple text processing tools, each having an input descriptor and a tool-specific wrapper that converts the tool output into stand-off XML annotation. Figure 3 shows the interaction of XIRAF with an automatic annotation tool using a wrapper.</Paragraph> <Paragraph position="2"> The input descriptor associated with a tool is used to select regions in the data that are candidates for processing by that tool. The descriptor may select regions on the basis of the original metadata or annotations added by other tools. For example, bothoursentencesplitterandourtemporal expression tagger use original document meta-data to select their input: both select document text, with //TEXT. Other tools, such as syntactic parsers and named-entity taggers, require separated sentences as input and thus use the output annotations of the sentence splitter, with the input descriptor //sentence. In general, there may bearbitrarydependenciesbetweentext-processing tools, which XIRAF takes into account.</Paragraph> <Paragraph position="3"> In order to add the new annotations generated by a tool to the original document, the output of the tool must be represented using stand-off XML annotation of the input data. Many text processing tools (e.g., parsers or part-of-speech taggers) do not produce XML annotation per se, but their output can be easily converted to stand-off XML annotation. More problematically, text processing tools may actually modify the input text in the course of adding annotations, so that the offsets referenced in the new annotations do not correspond to the original BLOB. Tools make a variety of modifications to their input text: some perform their own tokenization (i.e., inserting whitespaces or other word separators), silently skip parts of the input (e.g., syntactic parsers, when the parsing fails), or replace special symbols (e.g., parentheses with -LRB- and -RRB-). For many of the availabletext processingtools, such possiblemodifications are not fully documented.</Paragraph> <Paragraph position="4"> XIRAF, then, must map the output of a processing tool back to the original BLOB before adding thenewannotationstotheoriginaldocument. This re-alignment of the output of the processing tools with the original BLOB is one of the major hurdles in the development of our system. We approach the problems systematically. We compare the text data in the output of a given tool with the data that was given to it as input and re-align input and output offsets of markup elements using an edit-distance algorithm with heuristically chosen weights of character edits. After re-aligning the output with the original BLOB and adjusting the offsets accordingly, the actual data returned by thetoolisdiscardedandonlythestand-offmarkup is added to the existing document annotations.</Paragraph> </Section> <Section position="6" start_page="5" end_page="7" type="metho"> <SectionTitle> 5 Question Answering </SectionTitle> <Paragraph position="0"> XQuesta, our corpus-based question-answering system for English and Dutch, makes use of the multi-dimensional approach to linguistic annotation embodied in XIRAF. The system analyzes an incoming question to determine the required answer type and keyword queries for retrieving relevant snippets from the corpus. From these snippets, candidate answers are extracted, ranked, and returned.</Paragraph> <Paragraph position="1"> The system consults Dutch and English newspaper corpora. Using XIRAF, we annotate the corpora with named entities (including type information), temporal expressions (normalized to ISO values), syntactic chunks, and syntactic parses (dependencyparsesforDutchandphrasestructure parses for English).</Paragraph> <Paragraph position="2"> XQuesta'squestionanalysismodulemapsquestions to both a keyword query for retrieval of relevant passages and a query for extracting candidate answers. For example, for the question How many seats does a Boeing 747 have?, the keyword query is Boeing 747 seats, while the extraction query is the pure XPath expression: //phrase[@type=&quot;NP&quot;][.//WORD [@pos=&quot;CD&quot;]][so-contains(., &quot;seat&quot;)] This query can be glossed: findphraseelements of type NP that dominate a word element tagged as a cardinal determiner and that also contain the string &quot;seat&quot;. Note that phrase and word elements are annotations generated by a single tool (the phrase-structure parser) and thus in the same annotation layer, which is why standard XPath can be used to express this query.</Paragraph> <Paragraph position="3"> For the question When was Kennedy assassinated?, on the other hand, the extraction query is an XPath expression that uses a StandOff axis: //phrase[@type=&quot;S&quot; and headword= &quot;assassinated&quot; and so-contains(., &quot;Kennedy&quot;)]/select-narrow::timex This query can be glossed: find temporal expressions whose textual extent is contained inside a sentence (or clause) that is headed by assassinated and contains the string &quot;Kennedy&quot;. Note that phrase and timex elements are generated by different tools (the phrase-structure parser and the temporal expression tagger, respectively), and therefore belong to different annotation layers. Thus, the select-narrow:: axis step must be used in place of the standard child:: or descendant:: steps.</Paragraph> <Paragraph position="4"> As another example of the use of the Stand-Off axes, consider the question Who killed John F. Kennedy?. Here, the keyword query is kill John Kennedy, and the extraction query is the following This query can be glossed: find person named-entities whose textual extent overlaps the textual extent of an NP phrase that is the subject of a sentence phrase that is headed by killed and contains the string &quot;Kennedy&quot;. Again, phrase elements and ne elements are generated by different tools (the phrase-structure parser and named-entity tagger, respectively), and therefore belong to different annotation layers. In this case, we further do not want to make the unwarranted assumption that the subject NP found by the parser properly contains the named-entity found by the named-entity tagger. Therefore, we use the select-wide:: axis to indicate that the named-entity which will serve as our candidate answer need only overlap with the sentential subject.</Paragraph> <Paragraph position="5"> How do we map from questions to queries like this? For now, we use hand-crafted patterns, but we are currently working on using machine learning methods to automatically acquire questionquery mappings. For the purposes of demonstrating the utility of XIRAF to QA, however, it is immaterial how the mapping happens. What is important to note is that queries utilizing the Stand-Off axes arise naturally in the mapping of questionstoqueriesagainstcorpusdatathathasseveral null layers of linguistic annotation.</Paragraph> </Section> class="xml-element"></Paper>