File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2704_metho.xml
Size: 17,098 bytes
Last Modified: 2025-10-06 14:10:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2704"> <Title>Towards an Alternative Implementation of NXT's Query Language via XQuery</Title> <Section position="4" start_page="29" end_page="30" type="metho"> <SectionTitle> 5 Implementation Strategy </SectionTitle> <Paragraph position="0"> In our investigation, we compare two possible implementation strategies to NXT Search, our existing implementation.</Paragraph> <Section position="1" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 5.1 Using NXT's stand-off format </SectionTitle> <Paragraph position="0"> The first strategy is to use XQuery directly on NXT's stand-off data storage format. The bulk of the work here is in writing libraries of XQuery functions that correctly interpret NXT's stand-off child links in order to allow navigation over the same primary axes as are used in XPath, but with multiple parenthood, and operating over NXT's multiple files. The libraries can resolve the XLinks NXT uses both forwards and backwards.</Paragraph> <Paragraph position="1"> Backwards resolution requires functions that access the corpus metadata to find out which files could contain annotations that could stand in the correct relationship to the starting node. Built on top of this infrastructure would be functions which implement the NQL operators.</Paragraph> <Paragraph position="2"> Resolving ancestors is a rather expensive operation which involves searching an entire coding file for links to a node with a specified identity. Additionally, if a query includes variables which are not bound to a particular type, this precludes the possibility of reducing the search space to particular coding files.</Paragraph> <Paragraph position="3"> A drawback to using XPath to query a hierarchy which is serialised to multiple annotation files, is that much of the efficiency of XPath expressions can be lost through the necessity of resolving XLinks at every child or parent step of the expression. This means that even the descendant and ancestor axes of XPath may not be used directly but must be broken down into their constituent single-step axes.</Paragraph> <Paragraph position="4"> In addition to providing a transparent interface for navigating the data, it may be necessary to provide additional indexing of the data, to increase efficiency and avoid the duplication of calculations.</Paragraph> <Paragraph position="5"> An alternative is to overcome the standoff nature of the data by resolving links explicitly, as described in the following section.</Paragraph> </Section> <Section position="2" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 5.2 Using a redundant data representation </SectionTitle> <Paragraph position="0"> The second strategy makes use of the classic trade-off between memory and speed by employing a redundant data representation that is both easy to calculate from NXT's data storage format and ensures that most of the navigation required exercises common parts of XPath, since these are the operations upon which XQuery implementations will have concentrated their resources.</Paragraph> <Paragraph position="1"> The particular redundancy we have in mind relies on NXT's concept of &quot;knitting&quot; data. In NXT's data model, every node may have multiple parents, but only one set of children. Where multiple parents exist, at most one will be in the same file as the child node, with the rest connected by XLinks. &quot;Knitting&quot; is the process of starting with one XML file and recursively following children and child links, storing the expanded result as an XML file. The redundant representation we used is then the smallest set of expanded files that contains within it every child link from the original data as an XML child. .</Paragraph> <Paragraph position="2"> Although this approach has the advantage of using XPath more heavily than our first approach, it has the added costs of generating the knitted data and handling the redundancy. The knitting stylesheet that currently ships with NXT is very slow, but a very fast implementation of the knitting process that works with NXT format data has been developed and is expected as part of an upcoming LTXML release (University of Edinburgh Language Technology Group, nd). The cost of dealing with redundancy depends on the branching structure of the corpus. To date, most corpora with multiple parenthood have a number of quite shallow trees that do not branch themselves but all point to the same few base levels (e.g. orthography), suggesting we can at least avoid exponential expansion.</Paragraph> </Section> </Section> <Section position="5" start_page="30" end_page="31" type="metho"> <SectionTitle> 6 Tests </SectionTitle> <Paragraph position="0"> For initial testing, we chose a small set of queries which would allow us to judge potential implementations in terms of whether they could do everything we need to do, whether they would give the correct results, and how they would perform against our stated requirements. This allows us to form an opinion whilst only writing portions of the code required for a complete NQL implementation. Our set of queries is therefore designed to involve all of the basic operations required to exploit NXT's ability to represent multi-rooted trees and to traverse a large amount of data, so that they are computationally expensive and could return many results. In the tests, we ran the queries over the NXT translation for the Penn Treebank syntax annotated version of one Switchboard dialogue (Carletta et al., 2004), sw4114. The full dialogue is approximately 426Kb in physical size, and contains over 1101 word elements.</Paragraph> <Section position="1" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 6.1 Test queries </SectionTitle> <Paragraph position="0"> Our test queries were as follows.</Paragraph> <Paragraph position="2"> (words not dominated by any turn) In the data, the category &quot;nt&quot; represents syntactic non-terminals. The third query was chosen because it is particularly slow in the current NQL implementation, but is easily expressed as a path and therefore is likely to execute efficiently in XPath implementations.</Paragraph> <Paragraph position="3"> Although NXT's object model also allows for arbitrary relationships between nodes using pointers with named roles, increasing speed for queries over them is only a secondary concern, and we know that implementing operators over them is possible in XQuery because it is very similar to resolving stand-off child links. For this reason, none of our test queries involve pointers.</Paragraph> </Section> <Section position="2" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 6.2 Test environment </SectionTitle> <Paragraph position="0"> For processing XQuery, we used Saxon (www.saxonica.com), which provides an API so that it can be called from Java. There are several available XQuery interpreters, and they will differ in their implementation details. We chose Saxon because it appears to be most complete and is well-supported. Alternative interpreters, Galax (www.galaxquery.org) and Qexo (www.gnu.org/software/qexo/), provided only incomplete implementations at the time of writing.</Paragraph> </Section> <Section position="3" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 6.3 Comparability of results </SectionTitle> <Paragraph position="0"> It is not possible in a test like this to produce completely comparable results because the different implementation strategies are doing very different things to arrive at their results. For example, consider our second query. Apart from some primitive optimizations, on this and all queries, NXT Search does an exhaustive search of all possible k-tuples that match the types given in the query, varying the rightmost variable fastest. Our XQuery implementation on stand-off data first finds matches to $w1, $w2, and $np; then calls a function that calculates the ancestries for matches to $w1 and $w2; for each ($w1, $w2) pair, computes the intersection of the two ancestries; and finally filters this intersection against the list of $np matches.</Paragraph> <Paragraph position="1"> On the other hand, the implementation on the knitted data is shown in figure 1. It first sets variables representing the XML document containing our knitted data and all distinct nt elements within that document which both have a category attribute &quot;NP&quot; and have further word descendants. It then sets a variable to represent the sequence of results. The results are calculated by taking each NP-type element and checking its word descendants for those pairs where a word &quot;the&quot; precedes another word. The implementation also applies the condition that the NP-type element must not have another NP element as an ancestor -- this is to remove duplicates introduced by the way we find the initial set of NPs.</Paragraph> <Paragraph position="2"> In addition to the execution strategies, the methods used to start off processing were quite different. For each of the implementations, we did whatever gave the best performance. For the XQuery-based implementations, this meant writing a Java class to start up a static context for the execution of the query and reusing it to run the query repeatedly. For NXT, it meant using a shell script to run the command-line utility SaveQueryResults repeatedly on a set of observations, exiting each</Paragraph> <Paragraph position="4"> where (struct:node-precedes($a, $b) and not($np/ancestor::nt[@cat=&quot;NP&quot;])) (: only return for the uppermost common NP ancestor :) return (element match {$a, $b}) ) ) union () return element result {attribute count count($result), $result} Our aim in performing the comparison is to assess what is possible in each approach rather than to do the same thing in each, and this is why we have attempted to achieve best possible performance in each context rather than making the conditions as similar as possible. In all cases, the figures we report are the mean timings over five runs of what the Linux time command reports as 'real' time.</Paragraph> </Section> </Section> <Section position="6" start_page="31" end_page="33" type="metho"> <SectionTitle> 7 Speed Results </SectionTitle> <Paragraph position="0"> The results of our trial are shown in the following table. Timings which are purely in seconds are given to 2 decimal places; those which extend into the minutes are given to the nearest second.</Paragraph> <Paragraph position="1"> &quot;NXT&quot; means NXT Search; &quot;XQ&quot; is the condition with XQuery using stand-off data; and &quot;XQ-K&quot; is the condition with XQuery using the redundant knitted data.</Paragraph> <Paragraph position="2"> Although it would be wrong to read too much into our simple testing, these results do suggest some tentative conclusions. The first is that using XQuery on NXT's stand-off data format is unlikely to increase execution speed except for queries that are computationally very expensive for NXT, and may decrease performance for other queries. If users show any tolerance for delays, it is more likely to be for the delays to the former, and therefore this does not seem a winning strategy. On the other hand, using XQuery on the knitted data provides useful (and sometimes impressive) gains across the board.</Paragraph> <Paragraph position="3"> It should be noted that our results are based upon a single XQuery implementation and are inevitably implementation-specific. Future work will also attempt to make comparisons with alternatives, including those provided by XML databases.</Paragraph> <Section position="1" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 7.1 Memory results </SectionTitle> <Paragraph position="0"> To explore our second requirement, the ability to load more data, we generated a series of corpora which double in size from an initial set of 4 children with 2 parents.</Paragraph> <Paragraph position="1"> We ran both NXT Search and XQuery in Saxon on these corpora, with the Java Virtual Machine initialised with increasing amounts of memory, and recorded the maximum corpus size each was able to handle. Both query languages were exercised on NXT stand-off data, with the simple task of calculating parent/child relationships. Results are shown in the following table.</Paragraph> <Paragraph position="2"> These initial tests suggest that at its best, the XQuery implementation in Saxon can manage around 4 times as much data as NXT Search. It is interesting to note that the full set of tests took about 19 minutes for XQuery, but 18 hours for NXT Search. That is, Saxon appears to be far more efficient at managing large data sets. We also discovered that the NXT results were different when a different query was used; we hope to elaborate these results more accurately in the future. null We did not specifically run this test on the implementation that uses XQuery on knitted data because the basic characteristics would be the same as for the XQuery implementation with stand-off data. The size of a knitted data version will depend on the amount of redundancy that knitting creates. Knitting has the potential to increase the amount of memory required greatly, but it is worth noting that it does not always do so. The knitted version of the Switchboard dialogue used for these tests is actually smaller than the stand-off version, because the original stand-off stores terminals (words) in a separate file from syntax trees even though the terminals are defined to have only one parent. That is, there can be good reasons for using stand-off annotation, but it does have its own costs, as XLinks take space.</Paragraph> </Section> <Section position="2" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 7.2 Query rewriting </SectionTitle> <Paragraph position="0"> In the testing described far, we used the existing version of NXT Search. Rather than writing a new query language implementation, we could just invest our resources in improvement of NXT Search itself. It is possible that we could change the underlying XML handling to use libraries that are more memory-efficient, but this is unlikely to give real scalability. The biggest speed improvements could probably be made by re-ordering terms before query execution. Experienced query authors can often speed up a query if they rewrite the terms to minimize the size of the search space, assuming they know the shape of the underlying data set.</Paragraph> <Paragraph position="1"> Although we do not yet have an algorithm for this rewriting, it roughly involves ignoring the &quot;exists&quot; quantifier, splitting the query into a complex one with one variable binding per subquery, sequencing the component queries by increasing order of match set size, and evaluating tests on the earliest subquery possible. For example, consider the query ($w1 word):text($w1)=&quot;the&quot; :: ($p nt):$p@cat eq &quot;NP&quot; && $p^$w1 :: ($w2 word): $p^$w2 && $w1<>$w2 This query, which bears a family resemblance to query 2, takes 4.31s, which is a considerable improvement. Of course, the result tree is a different shape from the one specified in the original query, and so this strategy for gaining speed improvements would incur the additional cost of rewriting the result tree after execution.</Paragraph> <Paragraph position="2"> Our testing suggests that if we want to make speed improvements, creating a new NQL implementation that uses XQuery on a redundant data representation is a good option. Although not the result we initially expected, it is perhaps unsurprising. This XQuery implementation strategy draws more heavily on XPath than the stand-off strategy, and XPath is the most well-exercised portion of XQuery. The advantages do not just come from recasting our computations as operations over trees. XPath allows us, for instance, to write a single expression that both binds a variable and performs condition tests on it, rather than requiring us to first bind the variable and then loop through each combination of nodes to determine which satisfy the constraints. Using a redundant data representation increases memory requirements, but the XQuery-based strategies use enough less memory that the redundancy in itself will perhaps not be an issue. In order to settle this question, we must think more carefully about the size and shape of current and potential NXT corpora. null Our other option for making speed improvements is to augment NXT Search with a query rewriting strategy. This needs further evaluation because the improvements will vary widely with the query being rewritten, but our initial test worked surprisingly well. However, augmenting the current NXT Search in this way will not reduce its memory use, and it is not clear whether this improvement can readily be made by other means.</Paragraph> </Section> </Section> class="xml-element"></Paper>