Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms, pages 115–120,
Sydney, July 2006. c©2006 Association for Computational Linguistics
SemTAG, the LORIA toolbox for TAG-based Parsing and Generation
Eric Kow
INRIA / LORIA
Universit´e Henri Poincar´e
615, rue du Jardin Botanique
F-54 600 Villers-L`es-Nancy
kow@loria.fr
Yannick Parmentier
INRIA / LORIA
Universit´e Henri Poincar´e
615, rue du Jardin Botanique
F-54 600 Villers-L`es-Nancy
parmenti@loria.fr
Claire Gardent
CNRS / LORIA
615, rue du Jardin Botanique
F-54 600 Villers-L`es-Nancy
gardent@loria.fr
Abstract
In this paper, we introduce SEMTAG, a
toolbox for TAG-based parsing and gen-
eration. This environment supports the
development of wide-coverage grammars
and differs from existing environments
for TAG such as XTAG, (XTAG-Research-
Group, 2001) in that it includes a semantic
dimension. SEMTAG is open-source and
freely available.
1 Introduction
Inthispaperweintroduceatoolboxthatallowsfor
both parsing and generation with TAG. This tool-
box combines existing software and aims at facili-
tating grammar development, More precisely, this
toolbox includes1:
• XMG: a grammar compiler which supports the
generation of a TAG from a factorised TAG
(Crabb´e and Duchier, 2004),
• LLP2 and DyALog: two chart parsers, one
with a friendly user interface (Lopez, 2000)
and the other optimised for efficient parsing
(Villemonte de la Clergerie, 2005)2
• GenI: a chart generator which has been
tested on a middle size grammar for French
(Gardent and Kow, 2005)
1All these tools are freely available, more information and
links at http://trac.loria.fr/˜semtag.
2Note that DyALog refers in fact to a logic program-
ming language, and a tabular compiler for this language. The
DyALog system is well-adapted to the compilation of effi-
cient tabular parsers.
2 XMG, a grammar writing environment
for Tree Based Grammars
XMG provides a grammar writing environment for
tree based grammars3 with three distinctive fea-
tures. First, XMG supports a highly factorised and
fully declarative description of tree based gram-
mars. Second, XMG permits the integration in a
TAG of a semantic dimension. Third, XMGis based
on well understood and efficient logic program-
ming techniques. Moreover, it offers a graphical
interface for exploring the resulting grammar (see
Figure 1).
Factorising information. In the XMG frame-
work,a TAG isdefinedbyasetofclassesorganised
in an inheritance hierarchy where classes define
tree fragments (using a tree logic) and tree frag-
ment combinations (by conjunction or disjunc-
tion). XMG furthermore integrates a sophisticated
treatment of names whereby variables scope can
be local, global or user defined (i.e., local to part
of the hierarchy).
In practice, the resulting framework supports a
very high degree of factorisation. For instance, a
first core grammar (FRAG) for French comprising
4 200 trees was produced from roughly 300 XMG
classes.
Integrating semantic information. In XMG,
classes can be multi-dimensional. That is, they
can be used to describe several levels of linguis-
tic knowledge such as for instance, syntax, seman-
tics or prosody. At present, XMG supports classes
including both a syntactic and a semantic dimen-
sion. As mentioned above, the syntactic dimen-
3Although in this paper we only mention TAG, the XMG
frameworkisalsousedtodevelopsocalledInteractionGram-
mars i.e., grammars whose basic units are tree descriptions
rather than trees (Parmentier and Le Roux, 2005).
115
Figure 1: XMG’s graphical interface
sion is based on a tree logic and can be used to
describe (partial) tree fragments. The semantic di-
mension on the other hand, can be used to asso-
ciatewitheachtreeaflatsemanticformula. Sucha
formula can furthermore include identifiers which
corefer with identifiers occurring in the associated
syntactic tree. In other words, XMG also provides
support for the interface between semantic formu-
lae and tree decorations. Note that the inclusion of
semantic information remains optional. That is, it
is possible to use XMG to define a purely syntactic
TAG.
XMG was used to develop a core grammar for
French (FRAG) which was evaluated to have 75%
coverage4 on the Test Suite for Natural Language
Processing (TSNLP, (Lehmann et al., 1996)). The
FRAG grammar was furthermore enriched with
semantic information using another 50 classes de-
scribing the semantic dimension (Gardent, 2006).
The resulting grammar (SEMFRAG) describes
both the syntax and the semantics of the French
core constructions.
Compiling an XMG specification. By build-
ing on efficient techniques from logic program-
ming and in particular, on the Warren’s Abstract
4This means that for 75 % of the sentences, a TAG parser
can build at least one derivation.
Figure 2: The LLP2 parser.
Machine idea (Ait-Kaci, 1991), the XMG com-
piler allows for very reasonable compilation times
(Duchier et al., 2004). For instance, the compila-
tion of a TAG containing 6 000 trees takes about 15
minutes with a Pentium 4 processor 2.6 GHz and
1 GB of RAM.
3 Two TAG parsers
The toolbox includes two parsing systems: the
LLP2 parser and the DyALog system. Both of
them can be used in conjunction with XMG. First
we will briefly introduce both of them, and then
show that they can be used with a semantic gram-
mar (e.g., SEMFRAG) to perform not only syntac-
tic parsing but also semantic construction.
LLP2 The LLP2 parser is based on a bottom-
up algorithm described in (Lopez, 1999). It has
relatively high parsing times but provides a user
friendly graphical parsing environment with much
statistical information (see Figure 2). It is well
suited for teaching or for small scale projects.
DyALog The DyALog system on the other
hand, is a highly optimised parsing system based
on tabulation and automata techniques (Ville-
monte de la Clergerie, 2005). It is implemented
using the DyALog programming language (i.e.,
it is bootstrapped) and is also used to compile
parsers for other types of grammars such as Tree
Insertion Grammars.
The DyALog system is coupled with a seman-
tic construction module whose aim is to associate
witheachparsedstringasemanticrepresentation5.
This module assumes a TAG of the type described
in (Gardent and Kallmeyer, 2003; Gardent, 2006)
5ThecorrespondingsystemiscalledSemConst(cf section
6).
116
Figure 3: The SemConst system
where initial trees are associated with semantic in-
formation and unification is used to combine se-
mantic representations. In such a grammar, the se-
mantic representation of a derived tree is the union
of the semantic representations of the trees enter-
ing in the derivation of that derived tree modulo
the unifications entailed by analysis. As detailed
in(GardentandParmentier, 2005), suchgrammars
support two strategies for semantic construction.
The first possible strategy is to use the full
grammar and to perform semantic construction
during derivation. In this case the parser must ma-
nipulate both syntactic trees and semantic repre-
sentations. The advantage is that the approach is
simple (the semantic representations can simply
be an added feature on the anchor node of each
tree). The drawback is that the presence of seman-
tic information might reduce chart sharing.
The second possibility involves extracting the
semantic information contained in the grammar
and storing it into a semantic lexicon. Parsing then
proceeds with a purely syntactic grammar and se-
mantic construction is done after parsing on the
basis of the parser output and of the extracted se-
mantic lexicon. This latter technique is more suit-
ableforlargescalesemanticconstructionasitsup-
ports better sharing in the derivation forests. It
is implemented in the LORIA toolbox where a
module permits both extracting a semantic lexi-
con from a semantic TAG and constructing a se-
mantic representation based on this lexicon and on
the derivation forests output by DyALog (see Fig-
ure 3).
The integration of the DyALog system into the
toolbox is relatively new so that parsing evaluation
Figure 4: The GenI debugger
is still under progress. So far, evaluation has been
restricted to parsing the TSNLP with DyALog
with the following preliminary results. On sen-
tences ranging from 1 to 18 words, with an aver-
age of 7 words per sentence, and with a grammar
containing 5 069 trees, DyALog average parsing
time is of 0.38 sec with a P4 processor 2.6 GHz
and 1 GB of RAM6.
4 A TAG-based surface realiser
The surface realiser GenI takes a TAG and a flat
semantic logical form as input, and produces all
the sentences that are associated with that logi-
cal form by the grammar. It implements two bot-
tomupalgorithms, onewhichmanipulatesderived
trees as items and one which is based on Earley for
TAG. Both of these algorithms integrate a number
of optimisations such as delayed adjunction and
polarity filtering (Kow, 2005; Gardent and Kow,
2005).
GenI is written in Haskell and includes a
graphical debugger to inspect the state of the gen-
erator at any point in the surface realisation pro-
cess (see Figure 4). It also integrates a test harness
for automated regression testing and benchmark-
ing of the surface realiser and the grammar. The
harness gtester is written in Python. It runs the
surface realiser on a test suite, outputting a single
document with a table of passes and failures and
various performance charts (see Figures 5 and 6).
Test suite and performance The test suite is
built with an emphasis on testing the surface re-
6These features only concern classic syntactic parsing as
the semantic construction module has not been tested on real
grammars yet.
117
test expected simple earley
t1 il le accepter pass pass
t32 il nous accepter pass pass
t83 le ingnieur le lui apprendre pass DIED
t114 le ingnieur nous le prsenter pass pass
t145 le ingnieur vous le apprendre pass pass
t180 vous venir pass pass
Figure 5: Fragment of test harness output - The
Earley algorithm timed out.
Figure 6: Automatically generated graph of per-
formance data by the test harness.
aliser’s performance in the face of increasing para-
phrastic power i.e., ambiguity. The suite consists
of semantic inputs that select for and combines
verbs with different valencies. For example, given
a hypothetical English grammar, a valency (2,1)
semantics might be realised in as Martin thinks
Faye drinks (thinks takes 2 arguments and drinks
takes 1), whereas a valency (2,3,2) one would be
Dora says that Martin tells Bob that Faye likes
music. The suite also adds a varying number of
intersective modifiers into the mix, giving us for
instance, The girl likes music, The pretty scary girl
likes indie music.
The sentences in the suite range from 2 to 15
words (8 average). Realisation times for the core
suite range from 0.7 to 2.84 seconds CPU time
(average 1.6 seconds).
We estimate the ambiguity for each test case
in two ways. The first is to count the number of
paraphrases. Given our current grammar, the test
cases in our suite have up to 669 paraphrases (av-
erage 41). The second estimate for ambiguity is
the number of combinations of lexical items cov-
ering the input semantics.
This second measure is based on optimisation
known as polarity filtering (Gardent and Kow,
2005). This optimisation detects and eliminates
combinations of lexical items that cannot be used
to build a result. It associates the syntactic re-
sources (root nodes) and requirements (substitu-
tion nodes) of the lexical items to polarities, which
are then used to build “polarity automata”. The
automata are minimised to eliminate lexical com-
binations where the polarities do not cancel out,
that is those for which the number of root and sub-
stitution nodes for any given category do not equal
each other.
Once built, the polarity automata can also serve
to estimate ambiguity. The number of paths in the
automaton represent the number of possible com-
binations of lexical items. To determine how ef-
fective polarity filtering with respect to ambiguity,
we compare the combinations before and after po-
larity filtering. Before filtering, we start with an
initial polarity automaton in which all items are
associated with a zero polarity. This gives us the
lexical ambiguity before filtering. The polarity fil-
ter then builds upon this to form a final automaton
where all polarities are taken into account. Count-
ing the paths on this automaton gives us the am-
biguity after filtering, and comparing this number
with the lexical initial ambiguity provides an es-
timate on the usefulness of the polarity filter. In
our suite, the initial automata for each case have
1 to 800 000 paths (76 000 average). The fi-
nal automata have 1 to 6000 paths (192 average).
Thiscanrepresentquitealargereductioninsearch
space, 4000 times in the case of the largest au-
tomaton. The effect of this search space reduc-
tion is most pronounced on the larger sentences or
those with the most modifiers. Indeed, realisation
timeswithandwithoutfilteringarecomparablefor
most of the test suite, but for the most complicated
sentence in the core suite, polarity filtering makes
surface realisation 94% faster, producing a result
in 2.35 seconds instead of 37.38.
5 Benefits of an integrated toolset
As described above, the LORIA toolbox for TAG
based semantic processing includes a lexicon, a
grammar, a parser, a semantic construction mod-
ule and a surface realiser. Integrating these into
a single platform provides some accrued benefits
which we now discuss in more details.
Simplified resource management The first ad-
vantage of an integrated toolkit is that it facilitates
118
the management of the linguistic resources used
namely the grammar and the lexicon. Indeed it is
common that each NLP tool (parser or generator)
has its own representation format. Thus, manag-
ing the resources gets tiresome as one has to deal
with several versions of a single resource. When
one version is updated, the others have to be re-
computed. Using an integrated toolset avoid such
a drawback as the intermediate formats are hidden
and the user can focus on linguistic description.
Better support for grammar development
When developing parsers or surface realisers, it is
useful to test them out by running them on large,
realistic grammars. Such grammars can explore
nooks and crannies in our implementations that
would otherwise have been overlooked by a toy
grammar. For example, it was only when we ran
GenIon our French grammar that we realised our
implementation did not account for auxiliary trees
with substitution nodes (this has been rectified).
In this respect, one could argue that XMG could al-
most be seen as a parser/realiser debugging utility
because it helps us to build and extend the large
grammars that are crucial for testing.
This perspective can also be inverted; parsers
and surface realiser make for excellent grammar-
debugging devices. For example, one possible
regression test is to run the parser on a suite of
known sentences to make sure that the modified
grammar still parses them correctly. The exact
reverse is useful as well; we could also run the
surface realiser over a suite of known semantic
inputs and make sure that sentences are gener-
ated for each one. This is useful for two reasons.
First, reading surface realiser output (sentences)
is arguably easier for human beings than reading
parser output (semantic formulas). Second, the
surface realiser can tell us if the grammar overgen-
eratesbecauseitwouldoutputnonsensesentences.
Parsers,ontheotherhand,aremuchbetteradapted
for testing for undergeneration because it is easier
to write sentences than semantic formulas, which
makes it easier to test phenomena which might not
already be in the suite.
Towards a reversible grammar Another ad-
vantage of using such a toolset relies on the fact
that we can manage a common resource for both
parsing and generation, and thus avoid inconsis-
tency, redundancy and offer a better flexibility as
advocated in (Neumann, 1994).
On top of these practical questions, having a
unique reversible resource can lead us further.
For instance, (Neumann, 1994) proposes an inter-
leaved parsing/realisation architecture where the
parser is used to choose among a set of para-
phrases proposed by the generator; paraphrases
which are ambiguous (that have multiple parses)
are discarded in favour of those whose meaning is
most explicit. Concretely, we could do this with a
simple pipeline using GenI to produce the para-
phrases, DyALog to parse them, and a small shell
script to pick the best result. This would only be
a simulation, of course. (Neumann, 1994) goes
as far as to interleave the processes, keeping the
shared chart and using the parser to iteratively
prune the search space as it is being explored by
the generator. The version we propose would not
have such niceties as a shared chart, but the point
is that having all the tools at our disposable makes
such experimentation possible in the first place.
Moreover, there are several other interest-
ing applications of the combined toolbox. We
could use the surface realiser to build artifi-
cial corpora. These can in turn be parsed to
semi-automatically create rich treebanks contain-
ing syntactico-semantic analyses `a la Redwoods
(Oepen et al., 2002).
Eventually, anotheruseforthetoolboxmightbe
in components of standard NLP applications such
as machine translation, questioning answering, or
interactive dialogue systems.
6 Availability
The toolbox presented here is open-source and
freelyavailableunderthetermsoftheGPL7.More
information about the requirements and installa-
tion procedure is available at http://trac.
loria.fr/˜semtag. Note that this toolbox is
made of two main components: the GenI8 sys-
tem and the SemConst9 system, which respec-
tively performs generation and parsing from com-
mon linguistic resources. The first is written in
Haskell (except the XMG part written in Oz) and is
multi-platform (Linux, Windows, Mac OS). The
latter is written in Oz (except the DyALog part
which is bootstrapped and contains some Intel as-
sembler code) and is available on Unix-like plat-
7Note that XMG is released under the terms of the
CeCILL license (http://www.cecill.info/index.
en.html), which is compatible with the GPL.
8http://trac.loria.fr/˜geni
9http://trac.loria.fr/˜semconst
119
forms only.
7 Conclusion
The LORIA toolbox provides an integrated envi-
ronment for TAG based semantic processing: ei-
ther to construct the semantic representation of a
given sentence (parsing) or to generate a sentence
verbalising a given semantic content (generation).
Importantly, both the generator and the parsers
use the same grammar (SEMFRAG) so that both
tools can be used jointly to improve grammar pre-
cision. All the sentences outputted by the surface
realiser should be parsed to have at least the se-
mantic representation given by the test suite, and
all parses of a sentence should be realised into at
least the same sentence.
Current and future work concentrates on de-
veloping an automated error mining environment
for both parsing and generation; on extending the
grammar coverage; on integrating further optimi-
sations both in the parser (through parsing with
factorised trees) and in the generator (through
packing and accessibility filtering cf. (Carroll and
Oepen, 2005); and on experimenting with differ-
ent semantic construction strategies (Gardent and
Parmentier, 2005).

References
H. Ait-Kaci. 1991. Warren’s Abstract Machine: A Tu-
torial Reconstruction. In K. Furukawa, editor, Proc.
of the Eighth International Conference of Logic Pro-
gramming. MIT Press, Cambridge, MA.
J. Carroll and S. Oepen. 2005. High efficiency re-
alization for a wide-coverage unification grammar.
In R. Dale and K-F. Wong, editors, Proceedings of
the Second International Joint Conference on Natu-
ral Language Processing, volume 3651 of Springer
Lecture Notes in Artificial Intelligence, pages 165–
176.
B. Crabb´e and D. Duchier. 2004. Metagrammar Re-
dux. In Proceedings of CSLP 2004, Copenhagen.
D. Duchier, J. Le Roux, and Y. Parmentier. 2004. The
Metagrammar Compiler: An NLP Application with
aMulti-paradigmArchitecture. In 2nd International
Mozart/Oz Conference (MOZ’2004), Charleroi.
C. Gardent and L. Kallmeyer. 2003. Semantic con-
struction in FTAG. In Proceedings of EACL’03, Bu-
dapest.
C. Gardent and E. Kow. 2005. Generating and select-
ing grammatical paraphrases. ENLG, Aberdeen.
C. Gardent and Y. Parmentier. 2005. Large scale
semantic construction for tree adjoining grammars.
In Proceedings of The Fifth International Confer-
ence on Logical Aspects of Computational Linguis-
tics (LACL05).
C. Gardent. 2006. Int´egration d’une dimension
s´emantique dans les grammaires d’arbres adjoints.
In Actes de la conf´erence TALN’2006 Leuven.
E. Kow. 2005. Adapting polarised disambiguation
to surface realisation. In 17th European Summer
School in Logic, Language and Information - ESS-
LLI’05, Edinburgh, UK, Aug.
S. Lehmann, S. Oepen, S. Regnier-Prost, K. Netter,
V. Lux, J. Klein, K. Falkedal, F. Fouvry, D. Estival,
E. Dauphin, H. Compagnion, J. Baur, L.Balkan, and
D. Arnold. 1996. TSNLP — Test Suites for Natural
Language Processing. In Proceedings of COLING
1996, Kopenhagen.
P. Lopez. 1999. Analyse d’´enonc´es oraux pour le dia-
logue homme-machine `a l’aide de grammaires lex-
icalis´ees d’arbres. Ph.D. thesis, Universit´e Henri
Poincar´e – Nancy 1.
P. Lopez. 2000. Extended Partial Parsing for
Lexicalized Tree Grammars. In Proceedings of
the International Workshop on Parsing Technology
(IWPT2000), Trento, Italy.
G. Neumann. 1994. A Uniform Computational
Model for Natural Language Parsing and Gener-
ation. Ph.D. thesis, University of the Saarland,
Saarbr¨ucken.
S. Oepen, E. Callahan, C. Manning, and K. Toutanova.
2002. Lingo redwoods—a rich and dynamic tree-
bank for hpsg.
Y. Parmentier and J. Le Roux. 2005. XMG: an Exten-
sible Metagrammatical Framework. In Proceedings
of the Student Session of the 17th European Summer
School in Logic, Language and Information, Edin-
burg, Great Britain, Aug.
E. Villemonte de la Clergerie. 2005. DyALog: a tabu-
lar logic programming based environment for NLP.
In Proceedings of CSLP’05, Barcelona.
XTAG-Research-Group. 2001. A lexical-
ized tree adjoining grammar for english.
Technical Report IRCS-01-03, IRCS, Uni-
versity of Pennsylvania. Available at
http://www.cis.upenn.edu/˜xtag/gramrelease.html.
