In: Proceedings of CoNLL-2000 and LLL-2000, pages 61-66, Lisbon, Portugal, 2000. 
Modeling the Effect of Cross-Language Ambiguity on 
Human Syntax Acquisition 
William Gregory Sakas 
Department of Computer Science 
Hunter College and The Graduate Center 
City University of New York 
New York, NY 10021 
sakasOhunter, cuny. edu 
Abstract 
A computational framework is presented which 
is used to model the process by which human 
 learners acquire the syntactic compo- 
nent of their native . The focus is fea- 
sibility -- is acquisition possible within a rea- 
sonable amount of time and/or with a reason- 
able amount of work? The approach abstracts 
away from specific linguistic descriptions in or- 
der to make a 'broad-stroke' prediction of an 
acquisition model's behavior by formalizing fac- 
tors that contribute to cross-linguistic ambigu- 
ity. Discussion centers around an application 
to Fodor's Structural Trigger's Learner (STL) 
(1998) 1 and concludes with the proposal that 
successful computational modeling requires a 
parallel psycholinguistic investigation of the dis- 
tribution of ambiguity across the domain of hu- 
man s. 
1 Principles and Parameters 
Chomsky (1981) (and elsewhere) has proposed 
that all natural s share the same in- 
nate universal principles (Universal Grammar 
-- UG) and differ only with respect to the set- 
tings of a finite number of parameters. The syn- 
tactic component of a grammar in the principles 
and parameters (henceforth P&P) framework, 
is simply a collection of parameter values -- one 
value per parameter. (Standardly, two values 
are available per parameter.) The set of human 
1The STL is an acquisition model in the principles 
and parameters paradigm. The results presented here 
are not intended to forward an argument for or against 
the model, or for that matter, for or against the prin- 
ciples and parameters paradigm. Rather, the results 
axe presented to point out (the possibly not-so- earth- 
shattering observation) that the acquisition mechanism 
can be extremely sensitive to ambiguity. 
grammars is the set of all possible combinations 
of parameter values (and lexicon). 
The P&P framework was motivated to a large 
degree by psycholinguistic data demonstrating 
the extreme efficiency of human  acqui- 
sition. Children acquire the grammar of their 
native  at an early age -- generally ac- 
cepted to be in the neighborhood of five years 
old. In the P&P framework, even if the lin- 
guistic theory delineates over a billion possible 
grammars, a learner need only determine the 
correct 30 values that correspond to the gram- 
mar that generates the sentences of the tar- 
get . 2 given that a successful syntac- 
tic theory must provide for an efficient acquisi- 
tion mechanism, and since, prima facie, param- 
eter values seem transparently learnable, it is 
not surprising that parameters have been incor- 
porated into current generative syntactic the- 
ories. However, the exact process of parame- 
ter setting has been studied only recently (e.g. 
Clark (1992), Gibson and Wexler (1994), Yang 
(1999), Briscoe (2000), among others) and al- 
though it has proved linguistically fruitful to 
construct parametric analyses, it turns out to 
be surprisingly difficult to construct a workable 
model of parameter-value acquisition. 
1.1 Parametric Ambiguity 
A sentence is parametrically ambiguous if it is 
licensed by two or more distinct combinations 
of parameter values. Ambiguity is a natural en- 
emy of efficient  acquisition. The prob- 
lem is that, due to ambiguity, there does not 
exist a one-to-one correspondence between the 
linear 'word-order' surface strings of the input 
sample and the correct parameter values that 
generate the target . Clearly, if ev- 
230 binary parameters entails approximately a billion 
grammars (230 = 1,073,741,824.) 
61 
ery sentence of the target  triggers one 
and only one set of parameter values (i.e. ev- 
ery sentence is completely unambiguous) and 
the learner, upon encountering an input, can 
determine what those values are, the parameter 
setting process is truly transparent. Unfortu- 
nately, not all natural s sentences are 
so distinctively earmarked by their parametric 
signatures. However, if there exists some degree 
of parametric unambiguity in a learner's input 
sample, a learner can set parameters by: 1) de- 
coding the parametric signature of an input sen- 
tence, 2) determining if ambiguity exists, and 3) 
using the input to guide parameter setting only 
in the case that the sentence is parametrically 
unambiguous. The motto of such a learner is: 
Don't learn .from ambiguous input and learning 
efficiency can be measured by the number of 
sentences the learner has to wait for usable, un- 
ambiguous inputs to occur in the input stream. 3 
2 The Structural Triggers Learner 
One recent model of human syntax acquisi- 
tion, The Structural Triggers £earner 
(ST£) (Fodor, 1998), employs the human 
parsing mechanism to determine if an input is 
parametrically ambiguous. Parameter values 
are viewed as bits of tree structure (treelets). 
When the learner's current grammar is insuffi- 
cient to parse the current input sentence, the 
treelets may be utilized during the parsing pro- 
cess in the same way as any natural  
grammar would be applied; no unusual parsing 
activity is necessary. The treelets are adopted 
as part of the learner's current grammar hy- 
pothesis when: 1) they are required for a suc- 
cessful parse of the current input sentence and 
2) the sentence is unambiguous. The STL thus 
learns only from fully unambiguous sentences. 4 
3Of course, the extent to which such unambiguous 
sentences exist in the domain of human s is an 
empirical issue. This is an important open research ques- 
tion which is the focus of a recent research endeavor here 
at CUNY. Our approach involves tagging a large, cross- 
linguistic set of child-directed sentences, drawn from the 
CHILDES database, with each sentence's parametric sig- 
nature. By cross-tabulating the shared parameter values 
against different s, the study should shed some 
light as to the shape of ambiguity in input samples typ- 
ically encountered by children. 
4This is actually the strategy employed by just one 
of several different STL variants, some of which are de- 
signed to manage domains in which unambiguous sen- 
AA 
,xA Parameter value 
treelets 
• .-.....~,,~.....,,,m~,~. Sentence 
Sente~cce 
Current grammar 
Figure 1: An example of how the STL acquires 
new parameter values. 
See Figure 1. 
3 The Feasibility of the STL 
The number of input sentences consumed by the 
STL before convergence on the target grammar 
can be derived from a relatively straightforward 
Markov analysis. Importantly, the formulation 
most useful to analyze performance does not re- 
quire states which represent the grammars of 
the parameter space (contra Niyogi and Berwick 
(1996)). Instead, each state of the system de- 
picts the number of parameters that have been 
set, t, and the state transitions represent the 
probability that the STL will adopt some num- 
ber of new parameter values, w, on the basis of 
the current state and whatever usable paramet- 
ric information is revealed by the current input 
sentence. See Figure 2. 
The following factors (described in detail be- 
low) determine the transition probabilities: 
• the number of parameters that have been 
set (t) 
• the number of relevant parameters (r) 
• the expression rate (e) 
• the effective expression rate (e I) 
Not all parameters are relevant parameters. 
Irrelevant parameters control properties of phe- 
nomena not present in the target , such 
as clitic order in a  without clitics. For 
tences are rare or nonexistent. 
62 
Figffre 2: A transition diagram for the STL per- 
forming in a parameter space of three parame- 
ters. Nodes represent the current number of 
parameters that have been correctly set. Arcs 
indicate a change in the number that are cor- 
rectly set. In this diagram, after each input is 
consumed, 0, 1 or 2 new parameters may be set. 
Once the learner enters state 3, it has converged 
on the target. 
our purposes, the number of relevant parame- 
ters, r, is the total number of parameters that 
need to be set in order to license all and only 
the sentences of the target . 
Of the parameters relevant to the target lan- 
guage as a whole, only some will be relevant to 
any given sentence. A sentence expresses those 
parameters for which a specific value is required 
in order to build a parse tree, i.e. those pa- 
rameters which are essential to the sentence's 
structural description. For instance, if a sen- 
tence does not have a relative clause, it will not 
express parameters that concern only relative 
clauses; if it is a declarative sentence, it won't 
express the properties peculiar to questions; and 
so on. The expression rate, e, for a , 
is the average number of parameters expressed 
by its input sentences. Suppose that each sen- 
tence, on average, is ambiguous with respect to 
a of the parameters it expresses. The effective 
expression rate, e ~, is the mean proportion of 
expressed parameters that are expressed unam- 
biguously (i.e. e' = (e - a)/e). It will also be 
useful to consider a' = (1 - e~). 
3.1 Derivation of a Transition 
Probability Function 
To present the derivation of the probability that 
the system will change from an arbitrary state 
St to state St+w, (0 < w < e) it is useful to set 
ambiguity aside for a moment. In order to set all 
r parameters, the STL has to encounter enough 
batches of e parameter values, possibly overlap- 
ping with each other, to make up the full set of 
r parameter values that have to be established. 
Let H(wlt, r,e) be the probability that an ar- 
bitrary input sentence expresses w new (i.e. as 
yet unset) parameters, out of the e parameters 
expressed, given that the learner has already set 
t parameters (correctly), for a domain in which 
there are r total parameters that need to be set. 
This is a specification of the hypergeometric 
distribution and is given in Equation 1. 
H(wlt,r,e)= (~t)(~-t~°) (i) (:) 
Now, to deal with ambiguity, the effective 
rate of expression, e t, is brought into play. Re- 
call that e ~ is the proportion of expressed pa- 
rameters that are expressed unambiguously. It 
follows that the probability that any single pa- 
rameter is expressed unambiguously is also e ~ 
and the probability that all of the expressed, 
but as yet unset parameters are expressed un- 
ambiguously is e ~w. That is, the probability that 
an input is effectively unambiguous and hence 
usable for learning is equal to e ~w. 
f H(w\[t,r,e) e w, O<w<e 
t H(O\[t'r'e)+ i_~l H(ilt'r'e)(1-e'i)'w=O 
(2) 
Equation (2) can be used to calculate the prob- 
ability of any possible transition of the Markov 
system that models STL performance. One 
method to determine the number of sentences 
expected to be consumed by the STL is to sum 
the number of sentences consumed in each state. 
Let E(Si) represent the expected number of sen- 
tences that will be consumed in state Si. E is 
given by the following recurrence relation: 5 
E(So) = 1/e'¢ 
E(Zn) = II(1-P(Sn----~Sn)) ~ P(Si-'-~Sn)(Si) 
i~rt~e 
(5) 
The expected total is simply: 
r--1 Etot=E(So)+ ~ E(Si) 
(4) 
i=e 
which is equal to the expected number to be 
consumed before any parameters have been set 
5The functional E is derived from basic properties 
of Markov Chains. See Taylor and Karlin (1994) for a 
general derivation. 
63 
e a~(%) r =15 r =20 r =25 r =30 
1 20 62 90 119 150 
40 83 120 159 200 
60 124 180 238 300 
80 249 360 477 599 
5 20 15 22 29 36 
40 34 46 59 73 
60 144 176 210 245 
80 3,300 3,466 3,666 3,891 
10 20 14 18 23 28 
40 174 187 203 221 
60 9,560 9,621 9,727 9,878 
80 9,765,731 9,766,375 9,768,376 9,772,740 
15 20 28 32 37 41 
40 2,127 2,136 2,153 2,180 
60 931,323 931,352 931,479 931,822 
80 ...over l0 billion ... 
20 20 87 91 95 
40 27,351 27,361 27,383 
60 90,949,470 90,949,504 90,949,728 
80 ...in the trillions ... 
Table 1: Average number of inputs consumed 
by the waiting-STL before convergence. Fixed 
rate of expression. 
(= E(So)) plus the number expected to be con- 
sumed after the first successful learning event 
(at which point the learner will be in state Se) 
summed with the number of sentences expected 
to be consumed in every other state up to the 
state just before the target is attained (St-l). 
Etot can be tractably calculated using dynamic 
programming. 
3.2 Some Results 
Table 1 presents numerical results derived by 
fixing different values of r, e, and e ~. In order 
to make assessments of performance across dif- 
ferent situations in terms of increasing rates of 
ambiguity, a percentage measure of ambiguity, 
a ~, is employed which is directly derived from 
er: a ~ = 1 - e ~, and is presented in Table 1 as a 
percent (the proportion is multiplied by 100). 
Notice that the number of parameters to be 
set (r) has relatively little effect on convergence 
time. What dominates learning speed is am- 
biguity and expression rates. When a ~ and e 
are both high, the STL is consuming an unrea- 
sonable number of input sentences. However, 
the problem is not intrinsic to the STL model 
of acquisition. Rather, the problem is due to 
a too rigid restriction present in the current 
formulation of the input sample. By relaxing 
the restriction, the expected performance of the 
STL improves dramatically. But first, it is in- 
formative to discuss why the framework, as pre- 
sented so far, leads to the prediction that the 
STL will consume an extremely large number 
of sentences at rates of ambiguity and expres- 
sion approaching natural . 
By far the greatest amount of damage in- 
flicted by ambiguity occurs at the very earliest 
stages of learning. This is because before any 
learning takes place, the STL must wait for the 
occurrence of a sentence that is fully unambigu- 
ous. Such sentences are bound to be extremely 
rare if the expression rate and the degree of am- 
biguity is high. For instance, a sentence with 20 
out of 20 parameters unambiguous will virtually 
never occur if parameters are ambiguous on av- 
erage 99% of the time (the probability would be 
(1/100)2°). 
After learning gets underway, STL perfor- 
mance improves tremendously; the generally 
damaging effect of ambiguity is mitigated. Ev- 
ery successful learning event decreases the num- 
ber of parameters still to be set. Hence, the 
expression rate of unset parameters decreases 
as learning proceeds. And to be usable by the 
STL, the only parameters that need to be ex- 
pressed unambiguously are those that have not 
yet been set. For example, if 19 parameters 
have already been set and e = r = 20 as in the 
example above, the probability of encountering 
a usable sentence in the case that parameters 
are ambiguous on average 99% of the time and 
the input sample consists of sentences express- 
ing 20 parameters, is only (1/100) 1 = 1/100. 
This can be derived by plugging into Equation 
(2): w = 1, t = 19, e ---- 20, and r = 20 which is 
equal to: H(1119, 20, 20)(1/100) 1 = (1)(1/100). 
Clearly~ the probability of seeing usable in- 
puts increases rapidly as the number of param- 
eters that are set increases. All that is needed, 
therefore, is to get parameter setting started, so 
that the learner can be quickly be pulled down 
into more comfortable regions of parametric ex- 
pression. Once parameter setting is underway, 
the STL is extremely efficient. 
3.3 Distributed Expression Rate 
So far e has been conveniently taken to be fixed 
across all sentences of the target . In 
which case, when e = 10, the learner will have 
to wait for a sentence with exactly 10 unam- 
biguously expressed parameters in order to get 
started on learning, and as discussed above, it 
can be expected that this will be a very long 
wait. However, if one takes the value of e to be 
uniformly distributed (rather than fixed) then 
the learner will encounter some sentences which 
64 
express fewer than 10 parameters, and which 
are correspondingly more likely to be fully un- 
ambiguous and hence usable for learning. 
In fact, any distribution of e can be incorpo- 
rated into the framework presented so far. Let 
Di(x) denote the probability distribution of ex- 
pression of the input sample. That is, the prob- 
ability that an arbitrarily chosen sentence from 
the input sample I expresses x parameters. For 
example, if Di imposes a uniform distribution, 
then DI(x) = 1/emax where every sentence ex- 
presses at least 1 parameter and emax is the 
maximum number of parameters expressed by 
any sentence. Given Di, a new transition prob- 
ability P'(St '+ St+w) = P'(wlt, r, emax, e') can 
be formulated as: 
¢maz P'(w\[t,r,e 
.... et)-- - ~ Di(i)P(wlt,r,i,e' ) (5) 
where P is defined in (2) above and emaz repre- 
sents the maximum number of parameters that 
a sentence may express instead of a fixed num- 
ber for all sentences. 
To see why Equation (5) is valid, consider 
that to set w new parameters at least w must 
be expressed in the current input sentence. Also 
usable, are sentences that express more param- 
eters (w + I, w + 2, w + 3,..., emax). Thus, the 
probability of setting w new parameters is sim- 
ply the sum of the probabilities that a sentence 
expressing a number of parameters, i, from w 
to emax, is encountered by the STL (= Di(i)), 
times the probability that the STL can set w ad- 
ditional parameters given that i are expressed. 
By replacing P with P' in Equation 3 and mod- 
ifying the derivation of the base case, 6 the to- 
tal expected number of sentences that will be 
consumed by the STL given a distribution of 
expression Di(x) can be calculated. 
Table 2 presents numerical results derived by 
fixing r and a' and allowing e to vary uniformly 
from 0 to emax. As in Table 1, a percentage 
measure of ambiguity, a', is employed. 
The results displayed in the table indicate 
a striking decrease in the number of sentences 
that the that the STL can be expected to con- 
sume compared to those obtained with a fixed 
expression rate in place. As a rough compari- 
son, with the ambiguity rate (a') at 80%: when 
6E(So) = 1/(1 -- P'(So --+ So)) 
emaz a'(%) r = 15 r = 20 r = 25 r = 30 
1 20 124 180 238 300 
40 166 240 318 399 
60 249 360 477 599 
80 498 720 954 1198 
20 28 40 53 67 
40 46 65 86 107 
60 89 124 161 199 
80 235 324 417 511 
10 20 17 24 32 40 
40 40 55 70 86 
60 102 137 173 209 
80 323 430 538 648 
15 20 
40 
60 
80 
20 20 
40 
60 
80 
15 21 27 33 
46 62 77 93 
134 176 219 262 
447 586 726 868 
20 26 32 
74 91 109 
223 275 327 
755 931 1108 
Table 2: Average number of inputs consumed 
by the STL before convergence. Uniformly dis- 
tributed rate of expression. 
e varies uniformly from 0 to 10, the STL re- 
quires 430 sentences (from Table 2); when e is 
fixed at 5, the number of sentences required is 
3,466 (from Table 1). 
4 Discussion 
Although presented as a feasibility analysis of 
parameter-setting -- specifically of STL perfor- 
mance, it should be clear that the relevant fac- 
tors e', e, r, etc. can be applied to shape an 
abstract input domain for almost any learning 
strategy. This is important because questions of 
a model's feasibility have proved difficult to an- 
swer in spaces of a linguistically plausible size. 
Recent attempts necessarily rely on severely 
small, highly circumscribed  domains 
(e.g. Gibson and Wexler (1994), among others). 
These studies frequently involve the construc- 
tion of an idealized  sample which is 
(at best) an accurate subset of sentences that 
a child might hear. A simulated learner is let 
loose on the input space and results consist of 
either the structure of the grammar(s) acquired 
or the specific circumstances under which the 
learner succeeds or fails to attain the target. 
Without question, this research agenda is valu- 
able and can bring to light interesting charac- 
teristics of the acquisition process. (cf. Gibson 
and Wexler's (1994) argument for certain de- 
fault parameter values based on the potential 
success or failure of verb-second acquisition in 
a three-parameter domain. And, for a different 
perspective, Elman et al.'s (1996) discussions of 
65 
English part-of-speech and past-tense morphol- 
ogy acquisition in a connectionist framework.) 
I stress that my point here is not to give a 
full accounting of STL performance. Substan- 
tim work has been completed towards this end 
(Sakas (2000), Sakas and Fodor (In press.)), 
as well as development of a similar frame- 
work to other models (See Sakas and Demner- 
Fushman (In prep.) for an application to Gib- 
son and Wexler's Triggering Learning Algo- 
rithm). Rather, I intend to put forth the conjec- 
ture that syntax acquisition is extremely sensi- 
rive to the distribution of ambiguity, and, given 
this extreme sensitivity, suggest that simulation 
studies need to be conducted in conjunction 
with a broader analysis which abstracts away 
from whatever linguistic particulars are neces- 
sary to bring about the sentences required to 
build the input sample that feeds the simulated 
learner. 
Ultimately, whether a particular acquisition 
model is successful is an empirical issue and de- 
pends on the exact conditions under which the 
model performs well and the extent to which 
those favorable conditions are in line with the 
facts of human . Thus, I believe a 
three-fold approach to validate a computational 
model of acquisition is warranted. First, an 
abstract analysis (similar to the one presented 
here) should be constructed that can be used 
to uncover a model's sweet spots -- where the 
shape of ambiguity is favorable to learning per- 
formance. Second, a computational psycholin- 
guistic study should be undertaken to see if the 
model's sweet spots are in line with the distri- 
bution of ambiguity in natural . And 
finally, a simulation should be carried out. 
Obviously, this a huge proposal requiring 
years of person-hours and coordinated planning 
among researchers with diverse skills. But if 
computational modeling is going to eventually 
lay claim to a model which accurately mir- 
rors the human process of  acquisition, 
years of fine grinding are necessary. 
Acknowledgments. This work was supported 
in part by PSC-CUNY-30 Research Grant 61595- 
00-30. The three-fold approach is at the root of a 
project we have begun at The City University of New 
York. Much thanks to my collaborators Janet Dean 
Fodor and Virginia Teller for many useful discussions 
and input, as well as to two anonymous reviewers for 
their helpful comments. 

References 
E.J. Briscoe. 2000. Grammatical Acquisition: 
Inductive Bias and Coevolution of Language 
and the Language Acquisition Device. Lan- 
guage, 76(2). 
N. Chomsky. 1981. Lectures on Government 
and Binding. Foris. Dordrecht 
R. Clark. 1992. The selection of syntactic 
knowledge. Language Acquisition, 2(2):83- 
149. 
J.L. Elman, E. Bates, M.A. Johnson, 
A. Karmiloff-Smith, D. Parisi, and K. Plun- 
kett. Rethinking Innateness: A Connection- 
ist Perspective on Development. MIT Press, 
Cambridge, MA. 
J.D. Fodor. 1998. Unambiguous triggers. Lin- 
guistic Inquiry, 29(1):1-36. 
E. Gibson and K. Wexler. 1994. Triggers. Lin- 
guistic Inquiry, 25(3):407-454. 
P. Niyogi and R.C. Berwick. 1996. A  
learning model for finite parameter spaces. 
Cognition, 61:161-193. 
W.G. Sakas. 2000. Ambiguity and the Compu- 
tational Feasibility of Syntax Acquisition. Un- 
published Ph.D. dissertation, City University 
of New York. 
W.G. Sakas and D. Demner-Fushman. In Prep. 
Simulating Parameter Setting Performance in 
Domains with a Large Number of Parameters: 
A Hybrid Approach. 
W.G. Sakas and J.D. Fodor. In Press. The 
Structural Triggers Learner. In Stefano 
Bertolo, editor, Parametric Linguistics and 
Learnability: A Self-contained Tutorial for 
Linguists. Cambridge University Press, Cam- 
bridge,UK. 
H.M. Taylor and S. Karlin. 1994. An Introduc- 
tion to Stochastic Modeling. Academic Press, 
San Diego, CA. 
C.D. Yang. 1999. A selectionist theory of lan- 
guage acquisition. In Proceedings of the 37th 
Annual Meeting of the A CL. Association for 
Computational Linguistics. 
