XML Viewer - w06-1516

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1516_metho.xml
Size: 13,059 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1516">
  <Title>SemTAG, the LORIA toolbox for TAG-based Parsing and Generation</Title>
  <Section position="4" start_page="115" end_page="116" type="metho">
    <SectionTitle>
1 GB of RAM.
3 Two TAG parsers
</SectionTitle>
    <Paragraph position="0"> The toolbox includes two parsing systems: the LLP2 parser and the DyALog system. Both of them can be used in conjunction with XMG. First we will briefly introduce both of them, and then show that they can be used with a semantic grammar (e.g., SEMFRAG) to perform not only syntactic parsing but also semantic construction.</Paragraph>
    <Paragraph position="1"> LLP2 The LLP2 parser is based on a bottom-up algorithm described in (Lopez, 1999). It has relatively high parsing times but provides a user friendly graphical parsing environment with much statistical information (see Figure 2). It is well suited for teaching or for small scale projects.</Paragraph>
    <Paragraph position="2"> DyALog The DyALog system on the other hand, is a highly optimised parsing system based on tabulation and automata techniques (Villemonte de la Clergerie, 2005). It is implemented using the DyALog programming language (i.e., it is bootstrapped) and is also used to compile parsers for other types of grammars such as Tree Insertion Grammars.</Paragraph>
    <Paragraph position="3"> The DyALog system is coupled with a semantic construction module whose aim is to associate witheachparsedstringasemanticrepresentation5.</Paragraph>
    <Paragraph position="4"> This module assumes a TAG of the type described in (Gardent and Kallmeyer, 2003; Gardent, 2006)  where initial trees are associated with semantic information and unification is used to combine semantic representations. In such a grammar, the semantic representation of a derived tree is the union of the semantic representations of the trees entering in the derivation of that derived tree modulo the unifications entailed by analysis. As detailed in(GardentandParmentier, 2005), suchgrammars support two strategies for semantic construction.</Paragraph>
    <Paragraph position="5"> The first possible strategy is to use the full grammar and to perform semantic construction during derivation. In this case the parser must manipulate both syntactic trees and semantic representations. The advantage is that the approach is simple (the semantic representations can simply be an added feature on the anchor node of each tree). The drawback is that the presence of semantic information might reduce chart sharing.</Paragraph>
    <Paragraph position="6"> The second possibility involves extracting the semantic information contained in the grammar and storing it into a semantic lexicon. Parsing then proceeds with a purely syntactic grammar and semantic construction is done after parsing on the basis of the parser output and of the extracted semantic lexicon. This latter technique is more suitableforlargescalesemanticconstructionasitsup- null ports better sharing in the derivation forests. It is implemented in the LORIA toolbox where a module permits both extracting a semantic lexicon from a semantic TAG and constructing a semantic representation based on this lexicon and on the derivation forests output by DyALog (see Figure 3).</Paragraph>
    <Paragraph position="7"> The integration of the DyALog system into the toolbox is relatively new so that parsing evaluation  is still under progress. So far, evaluation has been restricted to parsing the TSNLP with DyALog with the following preliminary results. On sentences ranging from 1 to 18 words, with an average of 7 words per sentence, and with a grammar containing 5 069 trees, DyALog average parsing time is of 0.38 sec with a P4 processor 2.6 GHz and 1 GB of RAM6.</Paragraph>
  </Section>
  <Section position="5" start_page="116" end_page="118" type="metho">
    <SectionTitle>
4 A TAG-based surface realiser
</SectionTitle>
    <Paragraph position="0"> The surface realiser GenI takes a TAG and a flat semantic logical form as input, and produces all the sentences that are associated with that logical form by the grammar. It implements two bottomupalgorithms, onewhichmanipulatesderived trees as items and one which is based on Earley for TAG. Both of these algorithms integrate a number of optimisations such as delayed adjunction and polarity filtering (Kow, 2005; Gardent and Kow, 2005).</Paragraph>
    <Paragraph position="1"> GenI is written in Haskell and includes a graphical debugger to inspect the state of the generator at any point in the surface realisation process (see Figure 4). It also integrates a test harness for automated regression testing and benchmarking of the surface realiser and the grammar. The harness gtester is written in Python. It runs the surface realiser on a test suite, outputting a single document with a table of passes and failures and various performance charts (see Figures 5 and 6).</Paragraph>
    <Paragraph position="2"> Test suite and performance The test suite is built with an emphasis on testing the surface re- null formance data by the test harness.</Paragraph>
    <Paragraph position="3"> aliser's performance in the face of increasing paraphrastic power i.e., ambiguity. The suite consists of semantic inputs that select for and combines verbs with different valencies. For example, given a hypothetical English grammar, a valency (2,1) semantics might be realised in as Martin thinks Faye drinks (thinks takes 2 arguments and drinks takes 1), whereas a valency (2,3,2) one would be Dora says that Martin tells Bob that Faye likes music. The suite also adds a varying number of intersective modifiers into the mix, giving us for instance, The girl likes music, The pretty scary girl likes indie music.</Paragraph>
    <Paragraph position="4"> The sentences in the suite range from 2 to 15 words (8 average). Realisation times for the core suite range from 0.7 to 2.84 seconds CPU time (average 1.6 seconds).</Paragraph>
    <Paragraph position="5"> We estimate the ambiguity for each test case in two ways. The first is to count the number of paraphrases. Given our current grammar, the test cases in our suite have up to 669 paraphrases (average 41). The second estimate for ambiguity is the number of combinations of lexical items covering the input semantics.</Paragraph>
    <Paragraph position="6"> This second measure is based on optimisation known as polarity filtering (Gardent and Kow, 2005). This optimisation detects and eliminates combinations of lexical items that cannot be used to build a result. It associates the syntactic resources (root nodes) and requirements (substitution nodes) of the lexical items to polarities, which are then used to build &amp;quot;polarity automata&amp;quot;. The automata are minimised to eliminate lexical combinations where the polarities do not cancel out, that is those for which the number of root and substitution nodes for any given category do not equal each other.</Paragraph>
    <Paragraph position="7"> Once built, the polarity automata can also serve to estimate ambiguity. The number of paths in the automaton represent the number of possible combinations of lexical items. To determine how effective polarity filtering with respect to ambiguity, we compare the combinations before and after polarity filtering. Before filtering, we start with an initial polarity automaton in which all items are associated with a zero polarity. This gives us the lexical ambiguity before filtering. The polarity filter then builds upon this to form a final automaton where all polarities are taken into account. Counting the paths on this automaton gives us the ambiguity after filtering, and comparing this number with the lexical initial ambiguity provides an estimate on the usefulness of the polarity filter. In our suite, the initial automata for each case have 1 to 800 000 paths (76 000 average). The final automata have 1 to 6000 paths (192 average).</Paragraph>
    <Paragraph position="8"> Thiscanrepresentquitealargereductioninsearch space, 4000 times in the case of the largest automaton. The effect of this search space reduction is most pronounced on the larger sentences or those with the most modifiers. Indeed, realisation timeswithandwithoutfilteringarecomparablefor most of the test suite, but for the most complicated sentence in the core suite, polarity filtering makes surface realisation 94% faster, producing a result in 2.35 seconds instead of 37.38.</Paragraph>
    <Paragraph position="9"> 5 Benefits of an integrated toolset As described above, the LORIA toolbox for TAG based semantic processing includes a lexicon, a grammar, a parser, a semantic construction module and a surface realiser. Integrating these into a single platform provides some accrued benefits which we now discuss in more details.</Paragraph>
    <Paragraph position="10"> Simplified resource management The first advantage of an integrated toolkit is that it facilitates  the management of the linguistic resources used namely the grammar and the lexicon. Indeed it is common that each NLP tool (parser or generator) has its own representation format. Thus, managing the resources gets tiresome as one has to deal with several versions of a single resource. When one version is updated, the others have to be recomputed. Using an integrated toolset avoid such a drawback as the intermediate formats are hidden and the user can focus on linguistic description.</Paragraph>
    <Paragraph position="11"> Better support for grammar development When developing parsers or surface realisers, it is useful to test them out by running them on large, realistic grammars. Such grammars can explore nooks and crannies in our implementations that would otherwise have been overlooked by a toy grammar. For example, it was only when we ran GenIon our French grammar that we realised our implementation did not account for auxiliary trees with substitution nodes (this has been rectified).</Paragraph>
    <Paragraph position="12"> In this respect, one could argue that XMG could almost be seen as a parser/realiser debugging utility because it helps us to build and extend the large grammars that are crucial for testing.</Paragraph>
    <Paragraph position="13"> This perspective can also be inverted; parsers and surface realiser make for excellent grammardebugging devices. For example, one possible regression test is to run the parser on a suite of known sentences to make sure that the modified grammar still parses them correctly. The exact reverse is useful as well; we could also run the surface realiser over a suite of known semantic inputs and make sure that sentences are generated for each one. This is useful for two reasons. First, reading surface realiser output (sentences) is arguably easier for human beings than reading parser output (semantic formulas). Second, the surface realiser can tell us if the grammar overgeneratesbecauseitwouldoutputnonsensesentences. null Parsers,ontheotherhand,aremuchbetteradapted for testing for undergeneration because it is easier to write sentences than semantic formulas, which makes it easier to test phenomena which might not already be in the suite.</Paragraph>
    <Paragraph position="14"> Towards a reversible grammar Another advantage of using such a toolset relies on the fact that we can manage a common resource for both parsing and generation, and thus avoid inconsistency, redundancy and offer a better flexibility as advocated in (Neumann, 1994).</Paragraph>
    <Paragraph position="15"> On top of these practical questions, having a unique reversible resource can lead us further.</Paragraph>
    <Paragraph position="16"> For instance, (Neumann, 1994) proposes an interleaved parsing/realisation architecture where the parser is used to choose among a set of paraphrases proposed by the generator; paraphrases which are ambiguous (that have multiple parses) are discarded in favour of those whose meaning is most explicit. Concretely, we could do this with a simple pipeline using GenI to produce the paraphrases, DyALog to parse them, and a small shell script to pick the best result. This would only be a simulation, of course. (Neumann, 1994) goes as far as to interleave the processes, keeping the shared chart and using the parser to iteratively prune the search space as it is being explored by the generator. The version we propose would not have such niceties as a shared chart, but the point is that having all the tools at our disposable makes such experimentation possible in the first place.</Paragraph>
    <Paragraph position="17"> Moreover, there are several other interesting applications of the combined toolbox. We could use the surface realiser to build artificial corpora. These can in turn be parsed to semi-automatically create rich treebanks containing syntactico-semantic analyses `a la Redwoods (Oepen et al., 2002).</Paragraph>
    <Paragraph position="18"> Eventually, anotheruseforthetoolboxmightbe in components of standard NLP applications such as machine translation, questioning answering, or interactive dialogue systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML