STRING-TREE CORRESPONDENCE GRAMMAR: A DECLARATIVE GRAMMAR FORMALISM FOR DEFINING 
THE CORRESPONDENCE BETWEEN STRINGS OF TERMS AND TREE STRUCTURES 
YUSOFF ZAHARIN 
Groupe d'Etudes pour la Traduction Automatique 
B.P. n ° 68 
Universit~ de Grenoble 
38402 SAINT-MARTIN-D'HERES 
FRANCE 
ABSTRACT 
The paper introduces a grammar formalism for 
defining the set of sentences in a language, a set 
of labeled trees (not the derivation trees of the 
grammar) for the representation of the interpreta- 
tion of the sentences, and the (possibly non-pro- 
jective) correspondence between subtrees of each 
tree and substrings of the related sentence. The 
grammar formalism is motivated by the linguistic 
approach (adopted at GETA) where a multilevel inter- 
pretative structure is associated to a sentence. The 
topology of the multilevel structure is 'meaning' 
motivated, and hence its substructures may not cor- 
respond projectively to the substrings of the rela- 
ted sentence. 
Grammar formalisms have been developed for va- 
rious purposes. Generative-Transformational Gram- 
mars, General Phrase Structure Grammars, Lexical 
Functional Gr-mmar, etc. were designed to be expla- 
natory models for human language performance, while 
others like the Definite Clause Grammars were more 
geared towards direct interpretability by machines. 
In this naper, we introduce a declarative grammar 
formalism for the task of establishing the relation 
between on one hand a set of strings of terms and 
on the other a set of structural representations - 
a structural representation being in a form amena- 
ble to processing (say for translation into another 
language), where all and only the relevant conten~.s 
or 'meaning' (in some sense adequate for the purpo- 
se) of the related string are exhibited. The gram- 
mar can also be interpreted to perform analysis 
(given a string of terms, to produce a structural 
representation capturing the 'meaning' of the 
string) or to perform generation (given a structu- 
ral representation, to produce a string of terms 
whose meaning is captured by the said structural 
representation). 
It must be emphasised here that the grammar 
writer is at liberty (within certain constraints)to 
design the structural representation for a given 
string of terms (because its topology is indepen- 
dent of the derivation tree of the grammar), as 
well as the nature of the correspondence between 
the two (for example, according to certain linguis- 
tic criteria). The grammar formalism is only a tool 
for expressing the structural representation, the 
related string, and the correspondence. 
The formalism is motivated by the linguistic 
approach (adopted at GETA) where a multilevel in~r- 
pretative structure is associated to a sentence. 
The multilevel structure is 'meaning' motivated, 
and hence its substructures may not correspond pro- 
jectively to the substrings of the related sentence 
The characteristic of the linguistic approach is 
the design of the multilevel structures, 
while the grammar formalism is the tool (notation) 
for expressing these multilevel structures, their 
related sentences, and the nature of the correspon- 
dence between the two. In this paper, we present 
only the grammar formalism ; a discussion on the 
linguistic approach can be found in \[Vauquois 78\] 
and \[Zaharin 87\]. 
For this grammar formalism, a structural 
representation is given in the form of a labeled 
tree, and the relation between a string of terms 
and a structural representation is defined as a 
mapping between elements of the set of substrings 
of the string and elements of the set of subtrees 
of the tree : such a relation is called a string- 
tree correspondence. An example of a string-tree 
correspondence is given in fig. I. 
TREE: NP 
I \[ 4..NP 
2 :AP J 8 :hunter I I I 
Fig.1 - A string-tree correspondence. 
The example is taken from \[Pullum 84\] where he 
called for a 'simple' grammar which can analyse/ 
generate the non-context free sublanguage of the 
African language Bambara given by : 
L = ~ o ~I~ in N* for some set of nouns N, I N.l~l } 
and at the same time the grammar must produce a 
'linguistically motivated' structural representa- 
tion for the corresponding string of words. For 
instance, the noun phrase "dog catcher hunter o dog 
catcher hunter" means "any dog catcher hunter" and 
so the structural representation should describe 
precisely that. 
160 
In the string-tree correspondence in fig. I, 
there are three concepts involved : the TREE which 
is a labeled tree taking the role of the structu- 
ral representation, the STRING which is a string 
of terms, and finally the correspondence which is 
a mapping (given by the arrows ~--.-.~>) defined 
between substrings of STRING and subtrees of TREE 
(a more formal notation using indices would be 
less readable for demonstrational purposes). In 
the TREE, a node is given by an identifier and a 
label (eg. |:NP). To avoid a very messy diagram, 
in fig. l we have omitted the other subcorrespon- 
dence between substrings and subtrees, for example 
between the whole TREE and the whole STRING (tri- 
vial), between the subtree 4(5(6),7) and the two 
occurrences of the substring "dog catcher" (non- 
trivial), etc. We shall do the same in the rest 
of this paper. (Then again, this is the string- 
tree correspondence we wish to express for our 
examples - recall the remark earlier saying that 
the grammar writer is at liberty to define the na- 
ture of the string-tree correspondence he or she 
desires, and this is done in the rules, see later). 
We also note that the nodes in the TREE are simply 
concepts in the structural representation and thus 
the interpretation is independent of any grammar 
that defines the correspondence (in fact, we have 
yet to speak of a grammar) ; for instance, the TREE 
in fig. 1 does not necessitate the presence of a 
rule of the form "AP NT hunter ~ NP" to be in the 
grammar. 
A more complex string-tree correspondence is 
given in fig. 2 where we choose to define a struc- 
tural representation of a particular form for each 
string in the language anbnc n. Here, the case for 
n=3 is given• The problem is akin to the 'respec- 
tively' problem, where for a sentence like "Peter, 
Paul and Mary gave a book, a pen and a pencil to 
Jane, Elisabeth and John respectively", we wish to 
associate a structural representation giving the 
'meaning' "Peter gave a book to Jane, Paul gave a 
len to Elisabeth, and Mary gave a pencil to John". 
TREE : 
21a 
STRING a 
I:S 
I i 3:b 4:c 5!S 
6:a 7:b 8:c 9:S 1', i, , 
\ ~ ~kl~:a l l:b 12:C 
Fig. 2 - A non-projective string-tree 
correspondence for a~bnc n 
At this point, again we repeat our earlier 
statement that the choice of such structural re- 
presentations and the need for such string-tree 
correspondence are not the topics of discussion in 
this paper. 
The aim of this paper is to introduce the tool, in 
the form of a grammar formalism, which can define 
such string-tree correspondence as well as be inter- 
pretable for analysis and for generation between 
strings of terms and structural representations. 
The grammar formalism for such a purpose is 
called the String-Tree Correspondence Grammar 
(STCG). The STCG is a more formal version of the 
Static Grammar developed by \[Chappuy 83\] \[Vauquois 
& Chappuy 85\]. The Static Grammar (shortly later 
renamed the Structural Correspondence Specification 
Grammar), was designed to be a declarative grammar 
formalism for defining linguistic structures and 
their correspondence with strings of utterances in 
natural languages. It has been extensively used for 
specification and documentation,as well as a (manua$ 
reference for writing the linguistic programs (ana- 
lysers and generators) in the machine translation 
system ARIANE-78 \[Boitet-et-al 82\]. Relatively lar- 
ge scale Static Grammars have been written for 
French in the French national machine translation 
project \[Boitet 86\] translating French intoEnglis~ 
and for Malay in the Malaysian national project 
\[Tong 86\] translating English to Malay ; the two 
projects share a common Static Grammar for English 
(naturally). The STCG derives its formal properties 
from the Static Gra~mmar, but with more formal defi- 
nitions of the properties. In the passage from the 
Static Grammar to the STCG, the form as well as 
some other characteristics have undergone certain 
changes, and hence the change to a more appropriate 
name. The STCG first appeared in \[Zaharin 86\], 
where the formal definitions of the grammar are 
given (but under the name Of the Tree Corresponden- 
ce Gran~nar). 
A STCG contains a set of correspondence rules, 
each of which defines a correspondence between a 
structural representation (or rather a set or fami- 
ly of) and a string of terms (similarly a set or 
family of). Each rule is of the form : 
.Rule: R 
CORRESPONDENCE: 
( ~4~, ) ..... (~,,, "~) 
The simplest form of such a rule is when al,...a n 
are terms and B is a tree. The rule then states 
that the string of terms ~l,...,ctn corresponds (") 
to the tree B, while the entry cORRESPONDENCE gives 
the substring-subtree correspondence between the 
terms ~i, ,~_ and the subtrees BI,...,B_ of B. An 
• * • 71 . LL * lexample is given by rule SI below whlch deflnes the 
Istring-tree correspondence in fig. 3. 
Rule : S l l:S 
(2:a)(3:b)(4:c) ~ 2:a 3:b 4:c 
CORRESPONDENCE : 
(2--2), (3~3), (4"4) 
161 
TREE : 1 : S 
I 
3:b 4:c 
STRING : a b c 
Fig. 3 - Correspondence 
defined by Sl 
Although in the example in fig. 3 above, the 
leaves of the TREE are labeled and ordered exactly 
as the terms in the STRING, this is not obligatory. 
For example, it is indeed possible to change the 
label of node 2 to something else, or to move the 
node to the right of node 4, or even to exclude 
the node altogether. In short, the string-tree 
correspondence defined by a rule need not be 
projective. 
Such elementary rules el...cz ~8 (with 
ul,..,u_ terms) can be generahsed to a form where 
each e."(i-l,..,n) represents a string of terms, 
say A.. I Here, generalities can be captured if u i 
spec~'~ies the name of a rule which defines a strlng- 
A.~T. tree correspondence--i I (for some tree T. given 
in the said rule, but it is of httle slgnlflcance 
here), in which case the interpretation of the 
string-tree correspondence defined by el..e ~8 is 
taken to be AI..A ~8 (here AI..A means thenconca - 
tenation of ~he s~rings Al,?. ,A-~. The substring- 
subtree correspondence will sti--~l be given by the 
entry CORRESPONDENCE. Fig. 4 illustrates this. 
The alternative to the above is to give each 
u. in terms of a tree (ie. without reference to any 
r~le), but then there is no guarantee that this 
tree will correspond to some string of terms. Even 
if it does, one cannot be certain that it would be 
the string of terms one wishes to include in the 
rule - after all, two entirely different strings of 
terms may correspond to the same tree (a paraphrase) 
by means of two different rules. 
We shall discard the alternative and adopt the 
first approach.The generalised rule ~l,..~n~8 (with 
each u. being the name of a rule) can be extended 
furthe~ by letting u. be a list of rule names, 
• 1 where this is Interpreted as a choice for the 
string-tree correspondence A.'-T. to be referred to, 
and hence the choice for th~ist~ing of terms A. 
represented by u.. In such a situation, it ma~Iso 
be possible thatZwe wish the topology oT the tree 
B to vary according to the choice of A., and this 
variation to be zn terms of the subtrees of the 
tree T.. For these reasons, we specify each ~. as 
a pairI(REFERENCE, STRUCTURE) where REFERENCEIis 
the said list of rule names and STRUCTURE is a tree 
schema containing variables, such that the struc- 
ture represents the tree found on the right hand 
side of the "~" in each rule referred to in the 
list REFERENCE. This way, the tree 8 can be defi- 
ned in terms of T i by means of the variables (for 
example those appearing simultaneously in both u. 
and 8). See the example later in fig. 5 for an i 
illustration. 
'it~.: R t . \[ RUZ.Z: RX 
~ R, .... R e are rule names; 
RULE, R^ I ~ the correspondence by 
I~ Rule R~ is interpreted 
and hence 
Rules RNI and RN2 below are examples 
of STCG rules in the form discussed 
above, where RN2 refers to RNI and 
itself. Variables in the entry STRUCTURE 
are given in boxes, eg. \[\] , where each 
variable can be instantiated to a linear 
ordered sequence of trees. For a given 
element (REFERENCE, STRUCTURE), the ins- 
tanciations of the variables in STRUCTU- 
RE can be obtained only by identifying 
(an operation intuitively similar to the 
standard notion of unification - again, 
see later in fig. 5) the STRUCTURE with 
the right hand side of a rule given in 
the entry REFERENCE. 
Fig.4 -String-tree correspondence with reference to other rules 
Rule= RN2 
STRI~DI~RE= 
l=mP 
2 ) , (z ~ \]. ) 
Rule: RN1. 
.Tt'Rt XTT~\]RE= 
1 z noun 
CORRP.(;PONDENCE.: 
(1 N 1) 
cU 
0 =NI~.. 
I 
1 ~ noun 
162 
As an immediate consequence to the above, an 
STCG rule thus defines a correspondence between a 
set of strings of terms on one hand and a set of 
trees on the other (by means of a linear sequence 
of sets of trees). The rule RN! describes a corres- 
pondence between a single term and a tree 
containing a node NP dominating a single leaf (for 
example, it gives the respective structural repre- 
sentations for "dog", "catcher", etc.). The rule 
RN2 describes a correspondence between two or more 
terms and a single tree - note the recursive 
REFERENCE in the first element of RN2 (for example, 
it gives the structural representation for "catcher 
hunter" as well as for "dog catcher hunter", see 
later in fig. 5). 
The entry STRUCTURE of an element may also 
act as a constraint by making explicit certain 
nodes in the STRUCTURE instead of just a node 
dominating a forest (we have no examples for this 
in this paper, but one can easily visualise the 
idea). This means that the entry STRUCTURE of an 
element u. = (REFERENCE, STRUCTURE) in a rule 
i . • ~I..~ ~B Is also a constralnt on the trees in T., 
n . . 1 and hence on the strlngs in A. (as A. and T. are 
• --i --i • now sets), in a correspondence A."T. deflne~ by a 
rule referred to by u. In its entry REFERENCE. 
.. I Whenever It is made use of, such a constralnt en- 
sures that only certain subsets of T., and hence of I 
A., are referred to and used in the correspondence 
descrlbed by ~I..~ ~. 
n 
The string-tree correspondence in fig. | is 
defined by rule RN3 below, which refers to rules 
RN! and RN2. We show how this is done in fig. 5. 
Note that if two variables in a single rule have 
the same label, then their instantiations must be 
identical. The concept of derivation as well as the 
derivation tree have been defined for the STCG 
\[Zaharin 86\], but it would be too long to explain 
them here. Instead, we shall use a diagram like the 
one in fig. 5, which should be quite self-explana- 
tory. 
Rulez RN3 
• // s,~ / ,,oJI, \[~ J 
CORI~.SPOg~\]~:E 
( ~ ,~ i ), ( 4.~ 2 ), ( s ~ i 
Going back to fig. 2 where the string-tree 
3 J 3 correspondence for a b c is given, each substruc- 
ture below a node S in the TREE corresponds to a 
substring "abe", but the terms in this substring 
are distributed over the whole STRING. In general, 
Jin a string-tree correspondence AI..A ~8 defined 
by a rule ~l..e--8, it is posslble that we w~sh to 
• n deflne a substrlng-subtree correspondence of 
~jjBk, are disjoint the form A. .. where A. ''•'~Jm --\]~ --J1 
substrings of the string A ...A and Sk is a sub- 
--I --n 
tree of 8, and that 8 k cannot be expressed in terms 
of the respective structural representations (if 
any) of ~j .... ~Jm" Such a correspondence cannot be 
I 
handled by a rule of the form discussed so far be- 
cause a structural representation (STRUCTURE) found 
on the left hand side can correspond only to a unit 
(connected) substring. 
We can overcome this problem by allowing a rule 
to define a subcorrespondence between a substructu- 
re in the TREE (in the RHS) and a disjoint sub- 
string in the STRING (in the LHS), where this sub- 
correspondence is described in another rule (ie. 
using a reference - SUBREFERENCE - for a substruc- 
ture in the TREE, rather than uniquely for the 
elements in the LHS). One also allows elements in 
the LHS to be given in terms of variables which can 
be instantiated to substrings. Rule $2 (after fig. 
5) gives an example of such a rule where X,Y,Z are 
variables. 
The rule $2 is of the following general type. 
(Recall that we wish to define a substrinq-subtree 
correspondence of the form A .... ~J~gk'm Where --Jl 
A. ,..,A, m-J are disjoint substrings of the string 
--J1 
A ..A and B k is a subtree of B, and that ~k cannot 
be expressed in terms of the respective structural 
,..,A.). In the rule representations (if any) of ~Jl 
~ m 
..~n~B, the elements ~ ''''~n are to be as before I ! 
except for those representing the substrings 
A. ,..,A. m-J which are to be left as unknowns, written 
--\]i 
say~jl .... ~Jm respectively. The correspondence 
A...A,~8 k is to be written in the entry CORRES- 
--3 I --3 m 
|PONDENCE as ~...~. ~gk, and this is given a refe- 
r---r-~ ~the correspondence elsewhere in another rule. In 
F~-)|this SUBREFERENCE, if a rule ~'..~'~8 is a possibi O 2zNP I P 
i NP \[--~J|lity, the identification between the sequences K...X. and ~'..~' must be given• The interpreta- 
3zany -\]i --3m I p 
tion of the rule is that the SUBREFERENCE gives a 
string-tree correspondence A'..A'~8'which precise- --I --p 
) \[y defines the string-tree correspondence 
£j s k ,where B k is identified with 8' and 
Aj ..A. is identified with A'..A' with the 
1. --Jm --i --~ 
separation points being obtained from the predeter- 
mined identification between X...X. and ~'..e' 
mentioned above. --3 1 --3 m l p 
A STCG containing the rules $I and $2 defines 
the language anbnc n, and associates a structural re- 
presentation like the one in fig.2 to every string 
in the language. Fig.6 illustrates how this grammar 
defines the string-tree correspondence in fig.2. 
163 
Rule : RN3 
TREE: 
STRING: 
Identical 
to give 
lull A m A1 1~. 
1~ o NP 
Rule: RN2 
in ~ 
I 
STRING: ~i 
~2 to give 
TREE: 2'~ 
Unknown in TREE -F~ 
Unknown in STRING -- A 
~2 
Rule: I~;i I 
STRING: hunter 
~.~=~cz(~o ..I to give "~<f.~. ~~ to ~ to gi~ 
STRING: dog STRING: catcher 
Fig.S - Rules RN1,R~2,RN3 to define the correspondence in fig.1 
164 
Rule: $2 
~rR= 
2:a 3:b 4:c 
CORRESPONDENCE: 
( 2 eu 2 ),( 3 ,"d3 ), ( 4,~J 4 ), 
( x Y z ~ 5 ) - SUBR~r~ENCE(by): 
l:S 
l I i I 
2:a 3:b 4:0 5:S I 
sl, _x- 2', "'- 3'. _~- 4' ") ~ 
(2',3',4' in referred S1) 
P or 
s2, _x- 2'_x', ~- 3'Y_', _z- 4'z' 
(2' ,3' ,4' ,x' ,_y' ,z' 
in referred S2 ) 
Rule: $2 I:S 
I I I 
2:a 3:b 4:c 5:S 
STRING: a X b Y c Z 
( no R=r~NCE in LHS ) 
Rule: $2' 1':S 
3' :b 
Unknown in tree - \[\] 
Onknown in STRING - 
x,Y,z 
SUBREFERENCE for S 
to $2 
to give : 
I I "\[\]in S2 
4':C 5':S 
and _X=a_X' 
Z-bX' 
_Z - c Z_' 
J~SUBREFERENCE for S 
J ~/ to give : 
STRING: a X' b Y' c z'/~ ~ and ~' - a 
( no ~.~uu~CE in LaS ) Y_' - b 
----~ Z' -c 
1:S Rule: $1 
TREE: 
2:a 3:b 4:c 
STRING: a b c 
Fig.6 - Rules SI,S2 to define the correspondence in fig.2 
165 
The informal discussion in this paper gives 
the motivation and some idea of the formal defini- 
tion of the String-Tree Correspondence Grammar. 
The grammar stresses not only the fact that one can 
express string-tree correspondence like the ones 
we have discussed, but also that it can be done in 
a 'natural' way using the formalism - meaning the 
structures and correspondence are explicit in the 
rule, and not implicit and dependent on the combi- 
nation of grammar rules applied (as in derivation 
trees). The inclusion of the substring-subtree 
correspondence is also another characteristic of 
the grammar formalism. One also sees that the 
grammar is declarative in nature, and thus it is 
interpretable both for analysis and for generation 
(for example, by ~nterpretlng the rules as tree 
rewriting rules with variables}. 
In an effort to demonstrate the principal 
properties of the formalism, the STCG presented in 
this paper is in a simple form, ie. treating trees 
with each node having a single label. In its gene- 
ral form, the STCG deals with labels having 
considerable internal structure (lists of features, 
etc.). Furthermore, one can also express 
constraints on the features in the nodes - on indi- 
vidual nodes or between different nodes. 
As mentioned, the concepts of direct derivation 
(=>) and derlvatzon (->), as well as the derivation 
tree are also defined for the STCG. (Note that the 
rules with properties similar to the rule $2 entail 
a definition of direct derivation which is more 
complex than the classical definition). The set of 
rules in a grammar forms a formal grammar, ie. it 
defines a language, in fact two languages, one of 
strings and the other of trees. 
At the moment, there is no large applications 
of the STCG, but as the STCG derives its formal 
properties from the Static Grammar, it would be 
quite a simple process to transfer applications in 
the Static Grammar into STCG applications. Like the 
Static Grammar, the STCG is basically a formalism 
for specification, but given its formal nature, one 
also aims for direct interpretability by a machine. 
Though still incomplete, work has begun to build 
such an interpreter \[Zajac 86\]. 
ACKNOWLEDGEMENTS 
I would like to thank Christian Boitet who 
had been a great help in the formulation of the 
ideas presented here. My gratitude also to Hans 
Karlgren and Eva Haji~ova for their remarks and 
criticisms on earlier versions of the paper. 
REFERENCES 
\[Boitet-et-al-82\] 
Ch. Boitet, P. Guillaume, M. Quezel-Ambrunaz 
"Implementation and conversational environmerL 
of ARIANE-T8.4". 
Proceedings of COLING-82, Prague. 
\[Boitet 86\] 
Ch. Boitet 
"The French National Project : technical orga- 
nization and translation results of CALLIOPE- 
AERO". 
IBM Conference on machine translation, 
Copenhagen, August 1986. 
\[Chappuy 83\] 
S. Chappuy 
"Formalisation de la description des niveaux 
d'interpr~tation des langues naturelles. 
Etude men~e en rue de l'analyse et de la g~n~- 
ration au moyen de transducteurs". 
Thase 3ame Cycle, Juillet 1983, INPG, Grenoble. 
\[Pullum 84\] 
G.K. Pullum 
"Syntactic and semantic parsability". 
Proceedings of COLING-84, Stanford. 
\[Tong 86\] 
Tong L.C. 
"English-Malay translation system : a labora- 
tory prototype". 
Proceedings of COLING-86, Bonn. 
\[Vauquois 78\] 
B. Vauquois 
"Description de la structure interm~diaire". 
Communication pr~sent~e au Colloque de 
Luxembourg, Avril 1978, GETA dot., Grenoble. 
\[Vauquois & Chappuy 85\] 
B. Vauquois, S. Chappuy 
"Static Grammars : a formalism for the des- 
cription of linguistic models". 
Proceedings of the conference on theoretical 
and methodological issues in machine transla- 
tion of natural languages, COLGATE Universit~ 
New York, August 1985. 
\[Zaharin 86\] 
Zaharin Y. 
"Strategies and heuristics in the analysis of 
a natural language in machine translation". 
PhD thesis, Universiti Sains Malaysia, Penang, 
March 1986. (Research conducted under the 
GETA-USM cooperation - GETA doc., Grenoble). 
\[Zaharin 87\] 
Zaharin Y. 
"The linguistic approach at GETA : a synopsis~' 
GETA document, January 1987, Grenoble. 
\[Zajac 86\] 
R, Zajac 
"SCSL : a linguistic specification language 
for MT". 
Proceedings of COLING-86, Bonn. 
166 
