II 
i 
Hi 
HI 
m 
i 
m 
i 
m 
ill 
IH 
i 
IH 
IH 
HI 
I 
I 
I 
I 
i 
I 
Linguistic Theory in Statistical Language Learning 
Christer Samuelsson 
Bell Laboratories, Lucent Technologies 
600 Mountain Ave, Room 2D-339, 
Murray Hill, NJ 07974, USA 
christer@research, bell-labs, tom 
Abstract 
This article attempts to determine what ele- 
ments of linguistic theory are used in statisti- 
cal language learning, and why the extracted 
language models look like they do. The study 
indicates that some linguistic elements, such as 
the notion of a word, are simply too useful to 
be ignored. The second most important factor 
seems to be features inherited from the origi- 
nal task for which the technique was used, for 
example using hidden Markov models for part- 
of-speech tagging, rather than speech recogni- 
tion. The two remaining important factors are 
properties of the runtime processing scheme 
employing the extracted language model, and 
the properties of the available corpus resources 
to which the statistical learning techniques are 
applied. Deliberate attempts to include lin- 
guistic theory seem to end up in a fifth place. 
1 Introduction 
What role does linguistics play in statistical lan- 
guage learning? "None at all!" might be the answer, 
if we ask hard-core speech-recognition professionals. 
But even the most nonlinguistic language model, for 
example a statistic word bigram model, actually re- 
lies on key concepts integral to virtually all linguistic 
theories. Words, for example, and the notion that 
sequences of words form utterances. 
Statistical language learning is applied to some set 
of data to extract a language model of some kind. 
This language model can serve a purely decorative 
purpose, but is more often than not used to pro- 
cess data in some way, for example to aid speech 
recognition. Anyone working under the pressure of 
producing better results, and who employs language 
models to this purpose, such a researchers in the 
field of speech recognition, will have a high incen- 
tive of incorporating useful aspects of language into 
his or her language models. Now, the most useful, 
and thus least controversial ways of describing lan- 
guage will, due to their usefulness, find their way 
into most linguistic theories and, for the very same 
reason, be used in models that strive to model lan- 
guage successfully. 
So what do the linguistic theories underlying vari- 
ous statistical language models look like? And why? 
It may be useful to distinguish between those aspects 
of linguistic theory that are incidentally in the lan- 
guage model, and those that are there intentionally. 
We will start our tour of statistical language learn- 
ing by inspecting language models with "very little" 
linguistic content, and then proceed to analyse in- 
creasingly more linguistic models, until we end with 
models that are entirely linguistic, in the sense that 
they are pure grammars, associated with no statis- 
tical parameters. 
2 Word N-gr_~m Models 
Let us return to the simple bigram word model, 
where the probability of each next word is deter- 
mined from the current one. We already noted that 
this model relies on the notion of a word, the notion 
of an utterance, and the notion that an utterance is 
a sequence of words. 
The way this model is best visualized, and as it 
happens, best implemented, is as a finite-state au- 
tomaton (FSA), with arcs and states both labelled 
with words, and transition probabilities associated 
with each arc. For example, there will be one state 
labelled The with one arc to each other state, for 
example to the state Cat, and this arc will be la- 
belled cat. The reason for labelling both arcs and 
states with words is that the states constitute the 
only memory device available to an FSA. To re- 
member that the most recent word was "cat", all 
arcs labelled cat must fall into the same state Cat. 
The transition probability from the state The along 
the unique arc labelled cat to the state Cat will be 
the probability of the word "cat" following the word 
"the", P(cat l the). 
More generally, we enumerate the words 
Samuelsson 83 Linguistic Theory 
Christer Samuelsson, Bell Laboratories (1998) Linguistic Theory in Statistical Language Learning. In D.M.W. Powers (ed.) 
NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 83-89. 
{Wa,...,wlv} and associate a state Si with 
each word wl. Now the automaton has the states 
{$1,..., SN} and from each state Si there is an arc 
labelled wj to state Sj with transition probability 
P(wj I wi), the word bigram probability. To 
establish the probabilities of each word starting or 
finishing off the utterance, we introduce the special 
state So and special word w0 that marks the end 
of the utterance, and associate the arc from So to 
Si with the probability of wi starting an utterance, 
and the arc from Si to So with the probability of 
an utterance ending with word wl. 
If we want to calculate the probability of a word 
sequence wil ... wi,,, we simply multiply the bigram 
probabilities: 
P(wi~ . ..wi,~) = 
= P(wi, I WO)" P(wi2 I wi,)..... P(wo I Wi.) 
We now recall something from formal language 
theory about the equivalence between finite-state 
automata and regular languages. What does the 
equivalent regular language look like? Let's just first 
rename So S and, by stretching it just a little, let the 
end-of-utterance marker wo be ~, the empty string. 
S ~ wiSi P(wi \[ e) 
S~ ~ wjSj P(wj \[wi) 
Si -* e P(~ \[ wi) 
Does this give us any new insight? Yes, it does! Let's 
define a string rewrite in the usual way: cA7 =~ a~7 
if the rule A -+ fl is in the grammar. We can then 
derive the string Wil ... wl, from the top symbol S 
in n+l steps: 
S ::~ WilSil ~ WilWi2Si2 :=~ ... 
Wi 1 • . . Wi n 
Now comes the clever bit: if we define the deriva- 
tion probability as the product of the rewrite proba- 
bilities, and identify the rewrite and the rule proba- 
bilities, we realize that the string probability is sim- 
ply the derivation probability. This illustrates one 
of the most central aspects of probabilistic parsing: 
String probabilities are defined in terms o\] 
derivation probabilities. 
So the simple word bigram model not only em- 
ploys highly useful notions from linguistic theory, 
it implicitly employs the machinery of rewrite rules 
and derivations from formal language theory, and it 
also assigns string probabilities in terms of deriva- 
tion probabilities, just like most probabilistic pars- 
ing schemes around. However, the heritage from 
finite-state automata results in simplistic models of 
interword dependencies. 
General word N-gram models, of which word bi- 
gram models are a special case with "N" equal to 
two, can be accommodated in very much the same 
way by introducing states that remember not only 
the previous word, but the N-1 previous words. This 
generalization is purely technical and adds little or 
no linguistic fuel to the model from a theoretical 
point of view. From a practical point of view, the 
gain in predictive power using more conditioning in 
the probability distributions is very quickly over- 
come by the difficulty in estimating these probabil- 
ity distributions accurately from available training 
data; the perennial sparse-data problem. 
So why does this model look like it does? We 
conjecture the following explanations: Firstly, it is 
directly applicable to the representation used by an 
acoustic speech recognizer, and this can be done ef- 
ficiently as it essentially involves intersecting two 
finite-state automata. Secondly, the model parame- 
ters -- the word bigram probabilities -- Can be es- 
timated directly from electronically readable texts, 
and there is a lot of that a~ilable. 
3 Tag N-gram Models 
Let us now move on to a somewhat more linguisti- 
cally sophisticated language model, the tag N-gram 
model. Here, the interaction between words is me- 
diated by part-of-speech (PoS) tags, which consti- 
tute linguistically motivated labels that we assign to 
each word in an utterance. For example, we might 
look at the basic word classes adjectives, adverbs, 
articles, conjunctions, nouns, numbers, prepositions, 
pronouns and verbs, essentially introduced already 
by the ancient Greek Dionysius Thrax. We imme- 
diately realise that this gives us the opportunity to 
include a vast amount of linguistic knowledge into 
our model by selecting the set of PoS tags appro- 
priately; consequently, this is a much debated and 
controversial issue. 
Such a representation can be used for disambigua- 
tion, as in the case of the well-known, highly ambigu- 
ous example sentence "Time flies like an arrow". We 
can for example prescribe that "Time" is a noun, 
"flies" is a verb, "like" is a preposition (or adverb, 
according to your taste), "an" is an article, and that 
"arrow" is a noun. In effect, a label, i.e., a part- 
of-speech tag, has been assigned to each word. We 
realise that words may be assigned different labels 
in different context, or in different readings; for ex- 
ample, if we instead prescribe that '2\]ies" is a noun 
and "like" is a verb, we get another reading of the 
sentence. 
Samuelsson 84 Linguistic Theory 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
II 
m 
What does this language model look like in more 
detail? We can actually recast it in virtually the 
same terms as the word bigram model, the only dif- 
ference being that we interpret each state Si as a 
PoS tag (in the bigram case, and as a tag sequence 
in the general N-gram case): 
S -+ wkSi P(Si ~ wk \[ S) 
S~ -+ ~kSj P(Sj ~ wk I S~) 
S~ -+ e P(e I &) 
Note that we have now separated the words Wk from 
the states Si and that thus in principle, any state 
can generate any word. This is actually a slightly 
more powerful formalism than the standard hidden- 
markov model (HMM) used for N-gram PoS tagging 
(5). We recast it as follows: 
S ~ TiS~ P(S~ I s) 
s~ -+ TjSj P(s# I&) 
Si -+ e P(e I S i) 
Ti -~ w~ P(wk \[ Ti) 
Here we have the rules of the form Si ~ TjSj, with 
the corresponding probabilities P(Sj \[ Si), encoding 
the tag N-gram statistics. This is the probability 
that the tag Tj will follow the tag Ti, (in the bigram 
case, or the sequence encoded by Si in the general 
N-gram case). The rules Ti -+ wk with probabilities 
P(Wk \[ Ti) are the lexical probabilities, describing 
the probability of tag Ti being realised as word wk. 
The latter probabilities seem a bit backward, as we 
would rather think in terms of the converse probabil- 
ity P(Ti \[ Wk) of a particular word wk being assigned 
some PoS tag Ti, but one is easily recoverable form 
the other using Bayesian inversion: 
P(Ti I wk) " e(wk) P(wk IT i) 
= P(T~) 
We now connect the second formulation with the 
first one by unfolding each rule Tj ---> wk into each 
rule Si -+ TjSj. This lays bare the independence 
assumption 
P(sj & ~k I s~) = P(% I si) . P(Wk I Tj) 
As should be clear from this correspondence, the 
HMM-based PoS-tagging model can be formulated 
as a (deterministic) FSA, thus allowing very fast pro- 
cessing, linear in string length. 
The word string wkl ... wk, can be derived from 
the top symbol S in 2n+l steps: 
S ~ Ti, Si a =:~ wkxSi, =~ WklTi2Si2 ::~ 
Wkx Wk2 Si2 =~ • • • =~ Wkx • • • Wk, 
Samuelsson 85 
The interpretation of this is that we start off in the 
initial state S, select a PoS tag Tia at random, ac- 
cording to the probability distribution in state S, 
then generate the word wkl at random according 
to the lexical distribution associated with tag Til, 
then draw a next PoS tag Ti2 at random according 
to the transition probabilities associated with state 
Si~, hop to the corresponding state Si2, generate the 
word wk2 at random according to the lexical distri- 
bution associated with tag Ti2, etcetera. 
Another general lesson can be learned from this: If 
we wish to calculate the probability of a word string, 
rather than of a word string with a particular tag 
assbciated with each word, as the model does as it 
stands, it would be natural to sum over the set of 
possible ways of assigning PoS tags to the words of 
the string. This means that: 
The probability of a word string is the sum 
of its derivation probabilities. 
The model parameters P(Sj \[ Si) and P(wk I Tj) 
can be estimated essentially in two different ways. 
The first employs manually annotated training data 
and the other uses unannotated data and some rees- 
timation technique such as Baum-Welch reestima- 
tion (1). In both cases, an optimal set of parame- 
ters is sought, which will maximize the probability 
of the training data, supplemented with a portion 
of the black art of smoothing. In the former case, 
we are faced with two major problems: a shortage 
of training data, and a relatively high noise level in 
existing data, in terms of annotation inconsistencies. 
In the latter case, the problems are the instability of 
the resulting parameters as a function of the initial 
lexieal bias required, and the fact that the chances 
of finding a global optimum using any computation- 
ally feasible technique rapidly approach zero as the 
size of the model (in terms of the number of tags, 
and N) increases. Experience as shown that, despite 
the noise level, annotated training data yields better 
models. 
Let us take a step back and see what we have got: 
We have notion of a word, the notion of an utterance, 
the notion that an utterance is a sequence of words, 
the machinery of rewrite rules and derivations, and 
string probabilities are defined as the sum of the of 
derivation probabilities. In addition to this, we have 
the possibility to include a lot of linguistic knowl- 
edge into the model by selecting an appropriate set 
of PoS tags. We also need to somehow specify the 
model parameters P(Sj \[ Si) and P(wk I Tj). Once 
this is done, the model is completely determined. 
In particular, the only way that syntactic relations 
are modelled are by the probability of one PoS tag 
Linguistic Theory 
given the previous tag (or in the general N-gram 
case, given the previous N-1 tags). And just as in 
the case of word N-grams, the sparse data problem 
sets severe bounds on'N, effectively limiting it to 
about three. 
We conjecture that the explanation to why this 
model looks like it does is that it was imported 
wholesale from the field of speech recognition, and 
proved to allow fast, robust processing at accuracy 
level that until recently were superior to, or on par 
with, those of hand-crafted rule-based approaches. 
4 Stochastic Grammar Models 
To gain more control over the syntactic relation- 
ships between the words, we turn to stochastic 
context-free grammars (SCFGs), originally proposed 
by Booth and Thompson (4). This is the framework 
in which we have already discussed the N-gram mod- 
els, and it has been the starting point for many ex- 
cursions into probabilistic-parsing land. A stochas- 
tic context-free grammar is really just a context- 
free grammar where each grammar rule has been 
assigned a probability. If we keep the left-hand-side 
(LHS) symbol of the rule fix, and sum these proba- 
bilities over the different RHSs, we get one, since the 
probabilities are conditioned on the LHS symbol. 
The probability of a particular parse tree is the 
probability of its derivation, which in turn is the 
product of the probability of each derivation step. 
The probability of a derivation step is the proba- 
bility of rewriting a given symbol using some gram- 
mar rule, and equals the rule probability. Thus, the 
parse-tree probability is the product of the rule prob- 
abilities. Since the same parse tree can be derived 
in different ways by first rewriting some symbol and 
then another, or vice versa, we need to specify the or- 
der in which the nonterminal symbols of a sentential 
form are rewritten. We require that in each deriva- 
tion step, the leftmost nonterminal is always rewrit- 
ten, which yields us the leftmost derivation. This es- 
tablishes a one-to-one correspondence between parse 
trees and derivations. 
We now have plenty of opportunity to include lin- 
guistic theory into our model by the choice of syn- 
tactic categories, and by the selection of grammar 
rules. The probabilistic limitations of the model mir- 
ror the expressive power of context-free grammars, 
as the independence assumptions exactly match the 
compositionality assumptions. For this reason, there 
is an efficient algorithm for finding the most proba- 
ble pares tree, or calculating the string probability 
under an SCFG. The algorithm is a variant of the 
Cocke-Kasami-Younger (CKY) algorithm (17), but 
can also be seen as an incarnation of a more gen- 
eral dynamic-programming scheme, and it is cubic 
in string length and grammar size. We conjecture 
that exactly the properties of SCFGs discussed in 
this paragraph explain why the model looks like it 
does. 
We again have the choice between training the 
model parameters, the rule probabilities, on anno- 
tated data, or use unannotated data and some rees- 
timation method like the inside-outside algorithm, 
which is the natural generalization of the Baum- 
Welch method of the previous section. If the chances 
of finding a global optimum were slim using the 
Baurn-Welch algorithm, they're virtually zero us- 
ing the inside-outside algorithm. There is also very 
much instability in terms of what set of rule prob- 
abilities one arrives at as a function of the initial 
assignment of rule probabilities in the reestimation 
process. The other option, training on annotated 
data, is also problematic, as there is precious little 
of it available, and what exist is quite noisy. A cor- 
pus of CFG-analysed sentences is known as a tree 
bank, and tree banks will be the topic of the next 
section. 
As we have been stressing, the key idea is to assign 
probabilities to derivation steps. If we instead look 
at the rightmost derivation in reverse, as constructed 
by an LR parser, we can take as the derivation prob- 
ability the probability of the action sequence, i.e., 
the product of the probabilities of each shift and re- 
duce action in it. This isn't exactly the same think 
as an SCFG, since the probabilities are typically not 
conditioned on the LHS symbol of some grammar 
rule, but on the current internal state and the cur- 
rent lookahead symbol. As observed by Fernando 
Pereira (12), this gives us the possibility to throw in 
a few psycho-linguistic features such as right associ- 
ation and minimal attachment by preferring shift ac- 
tions to reductions, and longer reductions to shorter 
ones, respectively. So if these features are present in 
language, they should show up in our training data, 
and thus in our language model. Whether these fea- 
tures are introduced or incidental is debatable. 
We can take the idea of derivational stochastic 
grammars one step further and claim that a parse 
tree constructed by any sequence of derivation ac- 
tions, regardless of what the derivation actions are, 
should be assigned the product of the probabilities 
of each derivation step, appropriately conditioned. 
This idea will be crucial for the various extensions 
to SCFGs discussed in the next section. 
5 Models Using Tree Banks 
As previously mentioned, a tree bank is.a corpus of 
CFG-annotated sentences, i.e., a collection of parse 
Samuelsson 86 Linguistic Theory 
!1 
!i 
Ii 
II 
II 
!1 
!1 
II 
II 
II 
Ii 
Ii 
II 
II 
II 
II 
!1 
i 
ill 
ill 
HI 
i 
HI 
m 
m 
II 
HI 
Igl 
HI 
Hi 
E 
trees. The mere existence of a tree bank actu- 
ally inspired a statistic language model, namely the 
data-oriented parsing (DOP) model (3) advocated 
by Remko Scha and Rens Bod. This model parses 
not only with the entire tree bank as its grammar, 
but with a grammar consisting of each subtree of 
each tree in the tree bank. One interesting conse- 
quence of this is that there will in general be many 
different leftmost derivations of any given parse tree. 
This can most easily be seen by noting that there is 
one leftmost derivation for each way of cutting up a 
parse tree into subtrees. Therefore, the parse prob- 
ability is defined as the sum of the derivation prob- 
abilities, which is the source to the NP-hardness of 
finding the most probable parse tree for a given in- 
put sentence under this model, as demonstrated by 
Khalil Sima'an (15). 
There aren't really that many tree banks around, 
and the by far most popular one for experiment- 
ing with probabilistic parsing is the Penn Treebank 
(11). This leads usto the final source of influence on 
the linguistic theory employed in statistical language 
learning: the available training and testing data. 
The annotators of the Penn Treebank may have 
overrated the minimal-attachment principle, result- 
ing in very fiat rules with a minimum of recursion, 
and thus in very many rules. In fact, the Wall- 
Street-Journal portion of it consists of about a mil- 
lion words analysed using literally tens of thousands 
of distinct grammar rules. For example, there is one 
rule of the form 
NP ~ Det Noun (, Noun )n Conj Noun 
for each value of n seen in the corpus. There is 
not even close to enough data to accurately estimate 
the probabilities of most rules seen in the training 
data, let alone to achieve any type of robustness for 
unseen rules. This inspired David Magerman and 
subsequently Michael Collins to instead generate the 
RHS dynamically during parsing. 
Magerman (10) grounded this in the idea that a 
parse tree is constructed by a sequence of gener- 
alized derivation actions and the derivation prob- 
ability is the parse probability, a framework that is 
sometimes referred to as history-based parsing (2), 
at least when decision trees are employed to deter- 
mine the probability of each derivation action taken. 
More specifically, to allow us to assemble the RI-ISs 
as we go along, any previously constructed syntac- 
tic constituent is assigned the role of the leftmost, 
rightmost, middle or single daughter of some other 
constituent with some probability. It may or may 
not also be the syntactic head of the other con- 
stituent, and here we have another piece of highly 
Samuelsson 87 
useful linguistic theory incorporated into a statis- 
tical language model: the grammatical notion of a 
syntactic head. The idea here is to propagate up the 
lexical head to use (amongst other things) lexical 
collocation statistics on the dependency level to de- 
termine the constituent boundaries and attachment 
preferences. 
Collins (6; 7) followed up on these ideas and added 
further elegance to the scheme by instead generat- 
ing the head daughter first, and then the rest of the 
daughters as two zero-order Markov processes, one 
going left and one going right from it. He also man- 
aged to adapt essentially the standard SCFG pars- 
ing scheme to his model, thus allowing polynomial 
processing time. It is interesting to note that al- 
though the conditioning of the probabilities are top- 
down, parsing is performed bottom-up, just as is 
the case with SCPGs. This allows him to condi- 
tion his probabilities on the word string dominated 
by the constituent, which he does in terms of a dis- 
tance between the head constituent and the current 
one being generated. This in turn makes it possible 
to let phrase-boundary indicators such as punctua- 
tion marks influence the probabilities, and gives the 
model the chance to infer preferences for, e.g., right 
association. 
In addition to this, Collins incorporated the no- 
tion of lexical complements and wh-movement ~ la 
Generalized Phrase-Structure Grammar (GPSG) (8) 
into his probabilistic language model. The former is 
done by knocking off complements from a hypoth- 
esised complement list as the Markov chain of the 
siblings of the head constituent are generated. The 
latter is achieved by adding hypothesised NP gaps 
to these lists, requiring that they be either matched 
against an NP on the complement list, or passed on 
to one of the sibling constituents or the head con- 
stituent itself, thus mimicking the behavior of the 
"slash feature" used in GPSG. The model learns the 
probabilities for these rather sophisticated deriva- 
tion actions under various conditionings. Not bad 
for something that started out as a simple SCFG! 
6 A Non-Derivational Model 
The Constraint Grammar framework (9) introduced 
by Fred Karlsson and championed by Atro Vouti- 
lainen is a grammar formalism without derivations. 
It's not even constructive, but actually rather de- 
structive. In fact, most of it is concerned with de- 
stroying hypotheses. Of course, you first have to 
have some hypotheses if you are going to destroy 
them, so there are a few components whose task it 
is to generate hypotheses. The first one is a lexicon, 
which assigns a set of possible morphological read- 
Linguistic Theory 

References
L. E. Baum. 1972. An inequality and assiciated maximization technique in statistical estimation of probabilistic functions of Markov processes. Inequalities 3.
Ezra Black, Fred Jelinek, John Lafferty, David M. Magerman, Robert Mercer, and Salim Roukos. 1993. Towards History-based grammars: using richer models for probabilistic parsing. ACL.
Rens Bod. 1995. Enriching linguistics with statistics: performance models of natural language. ILLC Dissertation Series.
T. Booth and R. Thompson. 1973. Applying probability measures to abstract languages. IEEE Transactions on Computers.
Kenneth W. Church. 1998. A stochastic parts program and noun phrase parser for unrestricted text. ANLP.
Michael Collins. 1996. A new statistical parser based on bigram lexical dependencies. ACL.
Michael Collins. 1997. Three Generative lexicalized models for statistical parsing. ACL.
Gerald Gazdar, Ewan Klein, Geoffrey Pullum, and Ivan Sag. 1985. Generalized phrase structure grammar. Harvard Univ. Press.
Fred Karlsson, Atro Voutilainen, Juha Heikkila and Arto Anttila, eds. 1995. Constraint grammar. A language-independent system for parsing unrestricted text. Mouton de Gruyter.
David M. Magerman. 1995. Statistical decision-tree models for parsing. ACL.
Mitchell P. Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank". Computational Linguistics 19(2), pp. 313-330. ACL. 
\[12\] Fernando Pereira. 1985. "A New Character- 
ization of Attachment Preferences". In Natural 
Language Parsing, pp. 307-319. Cambridge Uni- 
versity Press. 
\[13\] Christer Samuelsson and Atro Voutilainen. 
1997. "Comparing a Linguistic and a Stochas- 
tic Tagger". In Procs. Joint 85th Annual Meet- 
ing of the Association for Computational Linguis- 
tics and 8th Conference of the European Chapter 
of the Association for Computational Linguistics, 
pp. 246-253. ACL. 
\[14\] Christer Samuelsson, Pasi Tapanainen and 
Atro Voutilainen. 1996. "Inducing Constraint 
Grammars". In Grammatical Inference: Learn- 
ing Syntaz from Sentences, pp. 146-155, Springer 
Verlag. 
\[15\] Khalil Sima'an. 1996. "Computational Corn- 
plexity of Probabilistic Disambiguations by means 
of Tree-Grammars". In Procs. 16th International 
Conference on Computational Linguistics, at the 
very end. ICCL. 
\[16\] Pasi Tapanainen. 1996. The Constraint Gram- 
mar Parser CG-~. Publ. 27, Dept. General Lin- 
guistics, University of Helsinki. 
\[17\] David H. Younger 1967. "Recognition and 
Parsing of Context-Free Languages in Time n 3". 
In Information and Control 10(2), pp. 189-208. 
Samuebson 89 Linguistic Theory 
