Proceedings of EACL '99 
Chinese Numbers, MIX, Scrambling, 
and 
Range Concatenation Grammars 
Pierre Boullier 
INRIA-Rocquencourt 
Domaine de Voluceau 
B.P. 105 
78153 Le Chesnay Cedex, FRANCE 
Pierre.Boullier@inria.fr 
Abstract 
The notion of mild context-sensitivity 
was formulated in an attempt to express 
the formal power which is both neces- 
sary and sufficient to define the syntax 
of natural languages. However, some 
linguistic phenomena such as Chinese 
numbers and German word scrambling 
lie beyond the realm of mildly context- 
sensitive formalisms. On the other hand, 
the class of range concatenation gram- 
mars provides added power w.r.t, mildly 
context-sensitive grammars while keep- 
ing a polynomial parse time behavior. In 
this report, we show that this increased 
power can be used to define the above- 
mentioned linguistic phenomena with a 
polynomial parse time of a very low de- 
gree. 
1 Motivation 
The notion of mild context-sensitivity originates 
in an attempt by \[Joshi 85\] to express the for- 
mal power needed to define the syntax of nat- 
ural languages (NLs). We know that context- 
free grammars (CFGs) are not adequate to de- 
fine NLs since some phenomena are beyond their 
power (see \[Shieber 85\]). Popular incarnations 
of mildly context-sensitive (MCS) formalisms are 
tree adjoining grammars (TAGs) \[Vijay-Shanker 
87\] and linear context-free rewriting (LCFR) sys- 
tems \[Vijay-Shanker, Weir, and Joshi 87\]. How- 
ever, there are some linguistic phenomena which 
are known to lie beyond MCS formalisms. Chi- 
nese numbers have been studied in \[Radzinski 91\] 
where it is shown that the set of these numbers is 
not a LCFR language and that it appears also not 
to be MCS since it violates the constant growth 
property. Scrambling is a word-order phenomenon 
which also lies beyond LCFR systems (see \[Becket, 
Rambow, and Niv 92\]). 
On the other hand, range concatenation gram- 
mar (RCG), presented in \[Boullier 98a\], is a 
syntactic formalism which is a variant of sim- 
ple literal movement grammar (LMG), described 
in \[Groenink 97\], and which is also related to the 
framework of LFP developed by \[Rounds 88\]. In 
fact it may be considered to lie halfway between 
their respective string and integer versions; RCGs 
retain from the string version of LMGs or LFPs 
the notion of concatenation, applying it to ranges 
(couples of integers which denote occurrences of 
substrings in a source text) rather than strings, 
and from their integer version the ability to han- 
dle only (part of) the source text (this later feature 
being the key to tractability). RCGs can also be 
seen as definite clause grammars acting on a flat 
domain: its variables are bound to ranges. This 
formalism, which extends CFGs, aims at being a 
convincing challenger as a syntactic base for vari- 
ous tasks, especially in natural language process- 
ing. We have shown that the positive version of 
RCGs, as simple LMGs or integer indexing LFPs, 
exactly covers the class PTIME of languages rec- 
ognizable in deterministic polynomial time. Since 
the composition operations of RCGs are not re- 
stricted to be linear and non-erasing, its languages 
(RCLs) are not semi-linear. Therefore, RCGs are 
not MCS and are more powerful than LCFR sys- 
tems, while staying computationally tractable: its 
sentences can be parsed in polynomial time. How- 
ever, this formalism shares with LCFR systems 
the fact that its derivations are CF (i.e. the choice 
of the operation performed at each step only de- 
pends on the object to be derived from). As in 
the CF case, its derived trees can be packed into 
polynomial sized parse forests. For a CFG, the 
components of a parse forest are nodes labeled by 
couples (A, p) where A is a nonterminal symbol 
and p is a range, while for an RCG, the labels 
have the form (A, p-') where # is a vector (list) of 
ranges. Besides its power and efficiency, this for- 
malism possesses many other attractive proper- 
53 
Proceedings of EACL '99 
ties. Let us emphasize in this introduction the fact 
that RCLs are closed under intersection and com- 
plementation 1, and, like CFGs, RCGs can act as 
syntactic backbones upon which decorations from 
other domains (probabilities, logical terms, fea- 
ture structures) can be grafted. 
The purpose of this paper is to study whether 
the extra power of RCGs Cover LCFR systems) is 
sufficient to deal with Chinese numbers and Ger- 
man scrambling phenomena. 
2 Range Concatenation Grammars 
This section introduces the notion of RCG and 
presents some of its properties, more details ap- 
pear in \[Boullier 98a\]. 
Definition 1 A positive range concatenation 
grammar (PRCG) G = (N,T, V,P,S) is a 5-tuple 
where N is a finite set o\] predicate names, T and 
V are finite, disjoint sets of terminal symbols and 
variable symbols respectively, S E N is the start 
predicate name, and P is a finite set of clauses 
¢0 --* ¢1-.-Cm 
where m >_ 0 and each o\]¢0,¢1,... ,era is a pred- 
icate of the form 
A(al,..., ap) 
where p >_ 1 is its arity, A E N and each of ai E 
(T U V)*, 1 < i < p, is an argument. 
Each occurrence of a predicate in the RHS of a 
clause is a predicate call, it is a predicate defini- 
tion if it occurs in its LHS. Clauses which define 
predicate A are called A-clauses. This definition 
assigns a fixed arity to each predicate name. The 
arity of S, the start predicate name, is one. The 
arity k of a grammar (we have a k-PRCG), is the 
maximum arity of its predicates. 
Lower case letters such as a, b, c,... will denote 
terminal symbols, while late occurring upper case 
letters such as T, W, X, Y, Z will denote elements 
of V. 
The language defined by a PRCG is based on 
the notion of range. For a given input string w = 
al...an a range is a couple (i,j), 0 < i < j _< n 
of integers which denotes the occurrence of some 
substring ai+l.., aj in w. The number i is its 
lower bound, j is its upper bound and j - i is its 
size. If i = j, we have an empty range. We will 
1 Since this closure properties can be reached with- 
out changing the structure (grammar) of the con- 
stituents (i.e. we can get the intersection of two gram- 
mars G1 and G2 without changing neither G1 nor G2), 
this allows for a form of modularity which may lead to 
the design of libraries of reusable grammatical compo- 
nents. 
use several equivalent denotations for ranges: an 
explicit dotted notation like wl * w2 * w3 or, if w2 
extends from positions i + 1 through j, a tuple 
notation (i..j)~, or (i..j) when w is understood 
or of no importance. Of course, only consecutive 
ranges can be concatenated into new ranges. In 
any PRCG, terminals, variables and arguments in 
a clause are supposed to be bound to ranges by 
a substitution mechanism. An instantiated clause 
is a clause in which variables and arguments are 
consistently (w.r.t. the concatenation operation) 
replaced by ranges; its components are instanti- 
ated predicates. 
For example, A( (g..h), (i..j), (k..1) ) --* 
B((g+l..h), (i+l..j-1), (k..l-1)) is an instantiation 
of the clause A(aX, bYc, Zd) --* B(X, \]7, Z) 
if the source text al...an is such that 
ag+l = a,a~+l = b, aj = c and al = d. In 
this case, the variables X, Y and Z are bound to 
(g+l..h), (i+l..j-t) and (k..l-1) respectively. 2 
For a grammar G and a source text w, a derive 
relation, denoted by =~, is defined on strings of G,w 
instantiated predicates. If an instantiated pred- 
icate is the LHS of some instantiated clause, it 
can be replaced by the RHS of that instantiated 
clause. 
Definition 2 The language of a PRCG G = 
(N, T, V, P, S) is the set 
z::(G) = I G,w 
An input string w = al...an is a sentence if 
and only if the empty string (of instantiated pred- 
icates) can be derived from S((0..n)), the instan- 
tiation of the start predicate on the whole source 
text. 
The arguments of a given predicate may denote 
discontinuous or even overlapping ranges. Fun- 
damentally, a predicate name A defines a notion 
(property, structure, dependency,... ) between its 
arguments, whose ranges can be arbitrarily scat- 
tered over the source text. PRCGs are therefore 
well suited to describe long distance dependen- 
cies. Overlapping ranges arise as a consequence of 
the non-linearity of the formalism. For example, 
the same variable (denoting the same range) may 
occur in different arguments in the RHS of some 
clause, expressing different views (properties) of 
the same portion of the source text. 
2Often, for a variable X, instead of saying the range 
which is bound to X or denoted by X, we will say, the 
range X, or even instead of the string whose occur- 
rence is denoted by the range which is bound to X, we 
will say the string X. 
54 
Proceedings of EACL '99 
Note that the order of RI-IS predicates in a 
clause is of no importance. 
As an example of a PRCG, the following set of 
clauses describes the three-copy language {www \[ 
w • {a,b}*} which is not a CFL and even lies 
beyond the formal power of TAGs. 
S(XYZ) ~ A(X,Y,Z) 
A(aX, aY, aZ) --* A(X, Y, Z) 
A(bX, bY, bZ) --* A(X, Y, Z) 
A(c, ~, e) --* e 
Definition 3 A negative range concatenation 
grammar (NRCG) G = (N, T, V, P, S) is a 5- 
tuple, like a PRCG, except that some predicates 
occurring in RHS, have the form A(al,..., ctp). 
A predicate call of the form A(al,...,ap) is 
said to be a negative predicate call. The intuitive 
meaning is that an instantiated negative predicate 
succeeds if and only if its positive counterpart (al- 
ways) fails. The idea is that the language defined 
by A(al,...,ap) is the complementary w.r.t T* 
of the language defined by A(ax,...,ap). More 
formally, the couple A(p-') =~ e is in the derive 
relation if and only if /SA(p") ~ e. Therefore 
this definition is based on a "negation by failure" 
rule. However, in order to avoid inconsistencies 
occurring when an instantiated predicate is de- 
fined in terms of its negative counterpart, we pro- 
hibit derivations exhibiting this possibility. 3 Thus 
we only define sentences by so called consistent 
derivations. We say that a grammar is consistent 
if all its derivations are consistent. 
Definition 4 A range concatenation grammar 
(RCG) is a PRCG or a NRCG. 
The PRCG (resp. NRCG) term will be used to 
underline the absence (resp. presence) of negative 
predicate calls. 
3As an example, consider the NRCG G with two 
clauses S(X) --* S(X) and S(e) --* e and the source 
text w = a. Let us consider the sequence S(•a.) G,w 
S(•a•) ~ e. If, on the one hand, we consider this G,w 
sequence as a (valid) derivation, this shows, by defini- 
tion, that a is a sentence, and thus (S(•a•),e)   ~. G,w 
This last result is in contradiction with our hypothe- 
sis. On the other hand, if this sequence is not a (valid) 
derivation, and since the second clause cannot produce 
a (valid) derivation for S(•a•) either, we can conclude 
that we have S(•a•) =~ e. Since, by the first clause, G,zv 
for any binding p of X we have S(p) ~ S(p), we con- G,w 
clude that, in contradiction with our hypothesis, the 
initial sequence is a derivation. 
In \[Boullier 98a\], we presented a parsing algo- 
rithm which, for an RCG G and an input string 
of length n, produces a parse forest in time poly- 
nomial with n and linear with IGI. The degree of 
this polynomial is at most the maximum number 
of free (independent) bounds in a clause. Intu- 
itively, if we consider an instantiation of a clause, 
all its terminal symbols, variable, arguments are 
bound to ranges. This means that each position 
(bound) in its arguments is mapped onto a source 
index, a position in the source text. However, at 
some times, the knowledge of a basic subset of 
couples (bound, source index) is sufficient to de- 
duce the full mapping. 4 We call number of free 
bounds, the minimum cardinality of such a basic 
subset. 
In the sequel we will assume that the predicate 
names len, and eq are defined: s 
* len(l, X) checks that the size of the range de- 
noted by the variable X is the integer l, and 
• eq(X, Y) checks that the substrings selected 
by the ranges X and Y are equal. 
3 Chinese Numbers &: RCGs 
The number-name system of Chinese, specifically 
the Mandarin dialect, allows large number names 
to be constructed in the following way. The name 
for 1012 is zhao and the word for five is wu. The 
sequence uru zhao zhao wu zhao is a well-formed 
Chinese number name (i.e. 5 1024 + 5 1012) al- 
though wu zhao wu zhao zhao is not: the number 
4If XaY is some argument, if X • aY denotes a po- 
sition in this argument, and if (XoaY, i) is an element 
of the mapping, we know that (Xa • Y, i + 1) must be 
another element. Moreover, if we know that the size 
of the range X is 3 and that the sizes of the ranges 
X and Y are (always) equal (see for example the sub- 
sequent predicates len and eq), we can conclude that 
(•XaY, i - 3) and (XaY., i + 4) are also elements of 
the mapping. 
SThe current implementation of our prototype sys- 
tem predefines several predicate names including len, 
and eq. It must be noted that these predefined predi- 
cates do not increase the formal power of RCGs since 
each of them can be defined by a pure RCG. For 
example, len(1,X) can be defined by lenl(t) --* c 
which is a clause schema over all terminals t E T. 
Their introduction is not only justified by the fact that 
they are more efficiently implemented than their RCG 
defined counterpart but mainly because they convey 
some static information about the length of their ar- 
guments which can be used, as already noted, to de- 
crease the number of free bounds and thus lead to an 
improved parse time. In particular, the parse times 
for Chinese numbers, MIX, and German scrambling 
which are given in the next sections rely upon this 
statement. 
55 
Proceedings of EACL '99 
of consecutive zhao's must strictly decrease from 
left to right. All the well-formed number names 
composed only of instances of wu and zhao form 
the set 
{ wu zhao kl wu zhao k2 ... wu zhao kp I 
kl>k2>... >kp>0} 
which can be abstracted as 
CN -= {abklabk2...abkp l 
kl>ks>... >kp>0} 
These numbers have been studied in \[Radzinski 
91\], where it is shown that CN is not a LCFR 
language but an Indexed Language (IL) \[Aho 68\]. 
Radzinski also argued that CN also appears not 
to be MCS and moreover he says that he fails "to 
find a well-studied and attractive formalism that 
would seem to generate Numeric Chinese without 
generating the entire class of ILs (or some non- 
ILs)". 
We will show that CN is defined by the RCG in 
Figure 1. 
1 : S(aX) --* A(X, aX, X) 
2: A(W, TX, bY) --, len(1,T) A(W,X,Y) 
3 : A(WaY, X, aY) --* len(O, X) A(Y, W, Y) 
4 : A(W, X, ~) --* len(O, X) len(O, W) 
Figure 1: RCG of Chinese numbers. 
Let's call b k~ the i th slice. The core of this RCG 
is the predicate A of arity three. The string de- 
noted by its third argument has always the form 
bk~-labk'+l..., it is a suffix of the source text, 
its prefix ab k~ ...abk~-lab I has already been ex- 
amined. The property of the second argument is 
to have a size which is strictly greater than ki - l, 
the number of leading b's in the current slice still 
to be processed. The leading b's of the third ar- 
gument and the leading terminal symbols of the 
second argument are simultaneously scanned (and 
skipped) by the second clause, until either the 
next slice is introduced (by an a) in the third 
clause, or the whole source text is exhausted in 
the fourth clause. When the processing of a slice 
is completed, we must check that the size of the 
second argument is not null (i.e. that ki-1 > ki). 
This is performed by the negative calls len(O, X) 
in the third and fourth clause. However, doing 
that, the i th slice has been skipped, but, in order 
for the process to continue, this slice must be "re- 
built" since it will be used as second argument to 
process the next slice. This reconstruction pro- 
cess is performed with the help of the first argu- 
ment. At the beginning of the processing of a 
new slice, say the i th, both the first and third ar- 
gument denote the same string b k~ab ki+l .... The 
first argument will stay unchanged while the lead- 
ing b's of the third argument are processed (see 
the second clause). When the processing of the 
i th slice is completed, and if it is not the last one 
(case of the third clause), the first and third argu- 
ment respectively denote the strings bk~ab k~+l ... 
and ab k'+l .... Thus, the i th slice b kl can be ex- 
tracted "by difference", it is the string W if the 
first and third argument are respectively WaY 
and aY (see the third clause). Last, the whole 
process is initialized by the first clause. The first 
and third argument of A are equal, since we start 
a new slice, the size of the second argument is 
forced to be strictly greater than the third, doing 
that, we are sure that it is strictly greater than 
kl, the size of the first slice. Remark that the test 
fen(O, W) in the fourth clause checks that the size 
kp of the rightmost slice is not null, as stipulated 
in the language formal definition. The derivation 
for the sentence abbbab is shown in Figure 2 where 
=~ means that clause #p has been applied. 
S(eabbbab•) 
A(a • bbbab*, 
A(a • bbbab., 
2 A(a * bbbab*, 
A(a • bbbab•, 
A(abbba • b•, 
2 A(abbba • be, 
4 ~ g 
oabbbab., a * bbbab*) 
a • bbbab*, ab * bbabe) 
ab * bbab*, abb • bab•) 
abb • babe, abbb • ab• ) 
a • bbb • ab, abbba • b•) 
ab • bb * ab, abbbab • *) 
Figure 2: Derivation for the CN string abbbab. 
If we look at this grammar, for any input string 
of length n, we can see that the maximum number 
of steps in any derivation is n+l (this number is an 
upper limit which is only reached for sentences). 
Since, at each step the choice of the A-clause to 
apply is performed in constant time (three clauses 
to try), the overall parse time behavior is linear. 
Therefore, we have shown that Chinese num- 
bers can be parsed in linear time by an RCG. 
56 
Proceedings of EACL '99 
4 MIX 8z RCGs 
Originally described by Emmon Bach, the MIX 
language consists of strings in {a, b, c}* such that 
each string contains the same number of occur- 
rences of each letter. MIX is interesting because 
it has a very simple and intuitive characteriza- 
tion. However, Gazdar reported 6 that MIX may 
well be outside the class of ILs (as conjectured 
by Bill Marsh in an unpublished 1985 ASL pa- 
per). It has turned out to be a very difficult prob- 
lem. In \[Joshi, Vijay-Shanker, and Weir 91\] the 
authors have shown that MIX can be defined by 
a variant of TAGs with local dominance and lin- 
ear precedence (TAG(LD/LP)), but very little is 
known about this class of grammars, except that, 
as TAGs, they continue to satisfy the constant 
growth property. Below, we will show that MIX 
is an RCL which can be recognized in linear time. 
1: S(X) ~ M(X,X,X) 
2: M(aX, bY, cZ) --* M(X,Y,Z) 
3: M(TX, Y,Z) --. len(1,T) a(T) 
M(X, Y, Z) 
4: M(X, TY, Z) -.-, len(1,T) b(T) 
M(X, Y, Z) 
5 : M(X,Y, TZ) ~ len(1,T) c(T) 
M(X, Y, Z) 
6 : M(e,¢,¢) --* ¢ 
7: a(a) --* ¢ 
8: b(b) ~ ¢ 
9: c(c) ~ ¢ 
generalization to any number of letters. In the 
case where the three leading letters are respec- 
tively a, b and c, they are simultaneously skipped 
(see clause #2) and the clause #6 is eventually in- 
stantiated if and only if the input string contains 
the same number of occurrences of each letter. 
The leading steps in the derivation for the sen- 
tence baccba are shown in Figure 4 where =~ means 
that clause #p is applied and :~ means that clause 
#q cannot be applied, and thus implies the valida- 
tion of the corresponding negative predicate call. 
S(•baccba•) 
M(obaccba., obaccba*, obaccba.) 
a( ob • accba ) 
M ( b • accba• , obaccbao , *baccba. ) 
M(b • accba*, obaccba•, •baccbao) 
=~ c(ob • accba) 
M ( b • accba•, •baccba•, b • accba* ) 
g M(b * accba*, •baccba•, b • accba•) 
5 =V c(b • a • accba ) 
M ( b • accba., •baccba., ba * ccba• ) 
M (b • accba*, •baccba., ba • ccba• ) 
M (ba • ccba•, b • accba•, bac • cba• ) 
Figure 3: RCG of MIX. 
Consider the RCG in Figure 3. The source text 
is concurrently scanned three times by the three 
arguments of the predicate M (see the predicate 
call M(X, X, X) in the first clause). The first, sec- 
ond and third argument of M respectively only 
deal with the letters a, b and c. If the leading 
letter of any argument (which at any time is a 
suffix of the source text) is not the right letter, 
this letter is skipped. The third clause only pro- 
cess the first argument of M (the two others are 
passed unchanged), and skips any letter which is 
not an a. The analogous holds for the fourth and 
fifth clauses which respectively only consider the 
second and third argument of M, looking for a 
leading b or c. Note that the knowledge that a 
letter is not the right one is acquired via a nega- 
tive predicate call because this allows for an easy 
6See http://www.ccl.kuleuven.ac.be/LKR/dtr/ 
mixl.dtr. 
Figure 4: Derivation for the MIX string baccba. 
It is not difficult to see that the length of any 
derivation is linear in the length of the correspond- 
ing input string, and that the choice of any step 
in this derivation takes a constant time. There- 
fore, the parse time complexity of this grammar 
is linear. 
Of course, we can think of several generaliza- 
tions of MIX. We let the reader devise an RCG in 
which the relation between the number of occur- 
rences of each letter is not the equality, instead, 
we will study here the case where, on the one 
hand, the number of letters in T is not limited 
to three, and, on the other hand, all the letters 
in T do not necessarily appear in a sentence. If 
T = (bl,...,bq} is its terminal vocabulary, and 
if 7r is a permutation, the permutation language 
k .@)}, with ai E T, n = {w I w = 
0<p<qandi#j ~ai#aj, can be defined 
by the set of clauses in Figure 5. 
57 
Proceedings of EACL '99 
E 
S(TX) ~ len(1,T) 
A(T, TX, TX) 
A(T,W, T1X) -* len(1,T1) 
M, (T, W, T,, W) 
A(T,W,X) 
A(T, W, ¢) --* ¢ 
M4(T,T'X, T1,T~Y) -* eq(T,T') eq(T1,T~) 
M4(T,X,T~,Y) 
M4(T,T'X, T1,Y) ---* len(1,T') eq(T,T') 
M4 (T, X, T~, Y) 
M4(T,X, T1,T~Y) ---* len(1,T~) eq(T1,T~) 
M4(T,X, T1,Y) 
M4(T,s,TI,¢) -'* 
Figure 5: RCG of the permutation language H. 
The basic idea of this grammar is the following. 
In a source text w = tl...tm...tn, we choose a 
reference position r, 1 < r < n (for example, if 
r = 1, we choose the first position which corre- 
sponds to the leading letter tl), and a current po- 
sition c, 1 < c < n, and we check that the number 
of occurrences of the current terminal to, and the 
number of occurrences of the reference terminal 
tr are equal. Of course, if this check succeeds for 
all the current positions c and for one reference 
position r, the string w is in H. This check is per- 
formed by the predicate M4(T1, X, T2, Y) of arity 
four. Its first and third arguments respectively 
denote the reference position and the current po- 
sition (:/'1 and T2 are bound to ranges of size one 
which refer to tr and tc respectively) while the 
second and fourth arguments denote the strings 
in which the searches are performed: the occur- 
rences of the reference terminal G are searched 
in X and the occurrences of the current terminal 
tc are searched in Y. A call to M4 succeeds if 
and only if the number of occurrences of tr in X 
is equal to the number of occurrences of t¢ in Y. 
The S-clauses select the reference position (r -- 1, 
if w is not empty). The purpose of the A-clauses 
is to select all the current positions c and to call 
M4 for each such c's. Note that the variable W is 
always bound to the whole source text. We can 
easily see that the complexity of any predicate call 
M4(T1,X, T2,Y) is linear in \]X\[ + \[Y\[, and since 
the number of such calls from the third clause is 
n, we have a quadratic time RCG. 
5 Scrambling &: RCGs 
Scrambling is a word-order phenomenon which 
occurs in several languages such as German, 
Japanese, Hindi, ... and which is known to be 
beyond the formal power of TAGs (see \[Becker, 
Joshi, and Rainbow 91\]). In \[Becker, Ram- 
bow, and Niv 92\], the authors even show that 
LCFR systems cannot derive scrambling. This 
is of course also true for multi-components TAGs 
(see \[Rambow 94\]). In \[Groenink 97\], p. 171, the 
author said that "simple LMG formalism does not 
seem to provide any method that can be immedi- 
ately recognized as solving such problems". We 
will show below that scrambling can be expressed 
within the RCG framework. 
Scrambling can be seen as a leftward movement 
of arguments (nominal, prepositional or clausal). 
Groenink notices that similar phenomena also oc- 
cur in Dutch verb clusters, where the order of 
verbs (as opposed to objects) can in some case 
be reversed. 
In \[Becket, Rambow, and Niv 92\], from the fol- 
lowing German example 
... dab \[dem Kunden\]i \[den Kuehlschrank\]j 
... that the client (DAT) the refrigerator (ACC) 
bisher noch niemand 
so far yet no-one (NOM) 
ti \[\[tj zu reparieren\] zu versuchen\] 
to repair to try 
versprochen hat. 
promised has. 
• .. that so far no-one has promised the client to 
try to repair the refrigerator. 
the authors argued that scrambling may be "dou- 
bly unbounded" in the sense that: 
• there is no bound on the distance over which 
each element can scramble; 
there is no bound on the number of un- 
bounded dependencies that can occur in one 
sentence• 
They used the language {zr(nl ... n,~) vl ... Vm } 
where 7r is a permutation, as a formal representa- 
tion for a subset of scrambled German sentences, 
where it is assumed that each verb vi has exactly 
one overt nominal argument ni. 
However, in \[Becket, Joshi, and Rambow 91\], 
we can find the following example 
... dag \[des Verbrechens\]k \[der Detektiv\]i 
... that the crime (GEN) the detective (NOM) 
\[den VerdEchtigen\]j dem Klienten 
58 
Proceedings of EACL '99 
the suspect (ACC) the client (DAT) 
\[PRO/tj tk zu iiberfiihren\] versprochen hat. 
to indict promised has. 
... that the detective has promised the client to 
indict the suspect of the crime. 
where the verb of the embedded clause sub- 
categorizes for three NPs, one of which is an 
empty subject (PRO). Thus, the scrambling phe- 
nomenon can be abstracted by the language 
SCR = {~(nl...np) vl...vq}. We assume that 
the set T of terminal symbols is partitioned into 
the noun part .M = {nx,... ,nt} and the verb part 
Y = {vl,... ,v,~}, and that there is a mapping h 
from .M onto \]; which indicates, when v = h(n), 
that the noun n is an argument for the verb v. 
If h is an injective mapping, we describe the case 
where each verb has exactly one overt nominal 
argument, if h is not injective, we describe the 
case where several nominal arguments can be at- 
tached to a single verb. To be a sentence of SCR, 
the string ~r(nl ... n~... np)vl ... vj ... vq must be 
such that0<p<l, 0<q<_m, niE.M, vj EI;, 
i ¢ i' ==~ ni # ne, j ¢ j' =:=v vj ¢ vj,, Vn/3 W 
and Vvj3ni s.t. vj = h(ni), and r is a permuta- 
tion. The RCG in Figure 6 defines SCR. 
Of course, the predicate names .M, Y and h re- 
spectively define the set of nouns .M, the set of 
verbs \]; and the mapping h between .h\]" and V. 
The purpose of the predicate name .M+)2 + is to 
split any source text w in a prefix part which only 
contains nouns and a suffix part which only con- 
tains verbs. This is performed by a left-to-right 
scan of w during which nouns are skipped (see the 
first .M+V+-clause). When the first verb is found, 
we check, by the call Y*(Y), that the remaining 
suffix Y only contains verbs. Then, the predicates 
.Ms and ~;s are both called with two identical ar- 
guments, the first one is the prefix part and the 
second is the suffix part. Note how the prefix part 
X can be extracted by the predicate definition 
.M+lZ+(XTY, TY) from the first argument (which 
denotes the whole source text) in using the second 
argument TY. The predicate name.Ms (resp. Ys) 
is in charge to check that each noun ni of the pre- 
fix part (resp. each verb vj of the suffix part) has 
both a single occurrence in its own part, and that 
there is a verb vj in the suffix part (resp. a noun 
ni in the prefix part) such that h(ni,vj) is true. 
The prefix part is examined from left-to-right un- 
til completion by the .Ms-clauses. For each noun 
T in this prefix part, the single occurrence test 
is performed by a negative calls to TinT*(T, X), 
and the existence of a verb vj in the suffix part s.t. 
s(w) -~ 
.M+ V+(W, TY) 
.M+ ~;+(XTY, TY) 
.Ms(T X, Y) 
.Ms (e:, Y) -~ 
.Min lZ+ ( T, T'Y ) 
.MinY+(T, TIY --, 
Vs(X, TY) -~ 
Vs(X,e) 
~)in.M + (T, T'Y --* 
l)in.M + ( T, T'Y 
TinT*(T, T'Y) --* 
TinT*(T, T'Y) 
V*(TX) --, 
V*(~) -~ 
.M(nl ) --~ 
.M(nl) --* 
V(vl) -~ 
v(,,,.) 
h(nl, vx ) --* 
h(nt, vm) 
.M+v+ (w, w) 
len(1, T) .M(T) 
.M+ v+(w, Y) 
len(1,T) ~;(T) V*(Y) 
.Ms(X, TY) \];s(X, TY) 
fen(l, T) TinT*(T, X) 
.Min)2+(T, Y) .Ms(X, Y) 
len(1, T') h(T, T') 
.Min Y+ (T, Y) 
len(1, T') h(T, T') 
len(1, T) TinT*(T, Y) 
~;in.M+(T, X) )2s(X, Y) 
c 
fen(l, T') h(T', T) 
Yin.M+(T, Y) 
fen(l, T') h(T', T) 
len(1, T) eq(T, T') 
TinT*(T, Y) 
len(1, T) eq(T, T') 
len(1,T) 1;(T) \];*(X) 
e: 
Figure 6: RCG of scrambling. 
h(T, W), is performed by the.MinY+(T, Y) call. A 
call TinT*(T, X) is true if and only if the terminal 
symbol T occurs in X. The .MinV+-clauses spell 
from left-to-right the suffix part. If the noun T is 
not an argument of the verb T' (note the nega- 
tive predicate call), this verb is skipped, until an 
h relation between T and T' is eventually found. 
Of course, an analogous processing is performed 
for each verb in the suffix part. We can easily see 
that, the cutting of each source text w in a prefix 
part and a suffix part, and the checking that the 
suffix part only contains verbs, takes a time lin- 
ear in Iw\[. For each noun in the prefix part, the 
unique occurrence check takes a linear time and 
the check that there is a corresponding verb in 
the suffix part also takes a linear time. Of course, 
the same results hold for each verb in the suffix 
part. Thus, we can conclude that the scrambling 
phenomenon can be parsed in quadratic time. 
59 
Proceedings of EACL '99 
6 Conclusion 
The class of RCGs is a syntactic formalism which 
seems very promising since it has many interesting 
properties among which we can quote its power, 
above that of LCFR systems; its efficiency, with 
polynomial time parsing; its modularity; and the 
fact that the output of its parsers can be viewed 
as shared parse forests. It can thus be used as 
is to define languages or it can be used as an in- 
termediate (high-level) representation. This last 
possibility comes from the fact that many popu- 
lar formalisms can be translated into equivalent 
RCGs, without loosing any efficiency. For exam- 
ple, TAGs can be translated into equivalent RCGs 
which can be parsed in O(n 6) time (see \[Boullier 
985\]). 
In this paper, we have shown that this extra for- 
mal power can be used in NL processing. We turn 
our attention to the two phenomena of Chinese 
numbers and German scrambling which are both 
beyond the formal power of MCS formalisms. To 
our knowledge, Chinese numbers were only known 
to be an IL and it was not even known whether 
scrambling can be described by an IG. We have 
seen that these phenomena can both be defined by 
RCGs. Moreover, the corresponding parse time is 
polynomial with a very low degree. During this 
work we have also classified the famous MIX lan- 
guage, as a linear parse time RCL. 
References 
\[Aho 68\] Alfred Aho. 1968. Indexed grammars - 
an extension of context-free grammars. In Jour- 
nal of the ACM, Vol. 15, pages 647-671. 
\[Becker, Joshi, and Rambow 91\] Tilman Becket, 
Aravind Joshi, and Owen Rambow. 1991. Long 
distance scrambling and tree adjoining gram- 
mars. In Proceedings of the fifth Conference of 
the European Chapter of the Association for 
Computational Linguistics (EACL'91), pages 
21-26. 
\[Becker, Rambow, and Niv 92\] Tilman Becket, 
Owen Rambow, and Michael Niv. 1992. The 
Derivational Generative Power of Formal 
Systems or Scrambling is Beyond LCFRS. In 
Technical Report IRCS-92-38, Institute for 
Research in Cognitive Science, University of 
Pennsylvania, Philadelphia, PA. 
\[Boullier 98a\] Pierre Boullier. 1998. Proposal 
for a Natural Language Processing Syntactic 
Backbone. In Research Report No 3342 at 
http ://www. inria, fr/RRRT/RR-3342, html, 
INRIA-Rocquencourt, France, Jan. 1998, 41 
pages. 
\[Boullier 98b\] Pierre Boullier. 1998. A Generaliza- 
tion of Mildly Context-Sensitive Formalisms. In 
Proceedings of the Fourth International Work- 
shop on Tree Adjoining Grammars and Related 
Frameworks (TAG÷4), University of Pennsyl- 
vania, Philadelphia, PA, pages 17-20. 
\[Groenink 97\] Annius Groenink. 1997. SUR- 
FACE WITHOUT STRUCTURE Word order and 
tractability issues in natural language analysis. 
PhD thesis, Utrecht University, The Nether- 
lands, Nov. 1977, 250 pages. 
\[Joshi 85\] Aravind Joshi. 1985. How much 
context-sensitivity is necessary for characteriz- 
ing structural descriptions -- Tree Adjoining 
Grammars. In Natural Language Processing 
-- Theoretical, Computational and Psycholog- 
ical Perspective, D. Dowty, L. Karttunen, and 
A. Zwicky, editors, Cambridge University Press, 
New-York, NY. 
\[Joshi, Vijay-Shanker, and Weir 91\] Aravind 
Joshi, K. Vijay-Shanker, and David Weir. 1991. 
The convergence of mildly context-sensitive 
grammatical formalisms. In Foundational 
Issues in Natural Language Processing, P. Sells, 
S. Shieber, and T. Wasow editors, MIT Press, 
Cambridge, Mass. 
\[Radzinski 91\] Daniel Radzinski. 1991. Chinese 
Number-Names, Tree Adjoining Languages, 
and Mild Context-Sensitivity. In Computa- 
tional Linguistics, 17(3), pages 277-299. 
\[Rainbow 94\] Owen Rainbow. 1994. Formal and 
Computational Aspects of Natured Language 
Syntax. In PhD Thesis, University of Pennsyl- 
vania, Philadelphia, PA. 
\[Rounds 88\]'William Rounds. 1988. LFP: A Logic 
for Linguistic Descriptions and an Analysis of 
its Complexity. In ACL Computational Lin- 
guistics, Vol. 14(4), pages 1-9. 
\[Shieber 85\] Stuart Shieber. 1985. Evidence 
against the context-freeness of natural lan- 
guage. In Linguistics and Philosophy, Vol. 8, 
pages 333-343. 
\[Vijay-Shanker 87\] K. Vijay-Shanker. 1987. A 
study of tree adjoining grammars. PhD thesis, 
University of Pennsylvania, Philadelphia, PA. 
\[Vijay-Shanker, Weir, and Joshi 87\] K. Vijay- 
Shanker, David Weir, and Aravind Joshi. 1987. 
Characterizing Structural Descriptions Pro- 
duced by Various Grammatical Formalisms. In 
Proceedings of the 25th Meeting of the Associa- 
tion for Computational Linguistics (ACL'87), 
Stanford University, CA, pages 104-111. 
60 
