Parsing with Principles and Probabilities 
Andrew Fordham 
SCS Research Group 
Department of Sociology 
University of Surrey 
Guildford 
Surrey, GU2 5XH, UK 
aj fesoc, surrey, ac. uk 
Abstract 
This paper is an attempt to bring together two ap- 
proaches to language analysis. The possible use of prob- 
abilistic information in principle-based grammars and 
parsers is considered, including discussion on some the- 
oretical and computational problems that arise. Finally 
a partial implementation of these ideas is presented, 
along with some preliminary results from testing on a 
small set of sentences. 
Introduction 
Both principle-based parsing and probabilistic methods 
for the analysis of natural language have become pop- 
ular in the last decade. While the former borrows from 
advanced linguistic specifications of syntax, the latter 
has been more concerned with extracting distributional 
regularities from language to aid the implementation of 
NLP systems and the analysis of corpora. 
These symbolic and statistical approaches axe begin- 
ning to draw together as it becomes clear that one can- 
not exist entirely without the other: the knowledge of 
language posited over the years by theoretical linguists 
has been useful in constraining and guiding statistical 
approaches, and the corpora now available to linguists 
have resurrected the desire to account for real language 
data in a more principled way than had previously been 
attempted. 
This paper falls directly between these approaches, 
using statistical information derived from corpora anal- 
ysis to weight syntactic analyses produced by a 'prin- 
ciples and parameters' parser. The use of probabilistic 
information in principle-based grammars and parsers 
is considered, including discussion on some theoretical 
and computational problems that arise. Finally a pax- 
tial implementation of these ideas is presented, along 
with some preliminary results from testing on a small 
set of sentences. 
Government.-Binding Theory 
The principles and paxameters paradigm in linguistics is 
most fully realised in the Government-Binding Theory 
(GB) of Chomsky \[Chomsky1981, Chomsky19861 and 
others. The grammar is divided into modules which 
Matthew Crocker 
Centre for Cognitive Science 
University of Edinburgh 
2 Buccleuch Place 
Edinburgh, EH8 9LW 
Scotland 
mwc@cogsc±, ed. ac. uk 
filter out ungrammatical structures at the various levels 
of representation; these levels are related by general 
transformations. A sketch of the organisation of GB 
(the 'T-model') is shown in figure I. 
D-Structure ( X'-theory, lexicsI insertion, 
O-criterion) 
S-Struaure (Case Theory, Subjacency) Pf// k-movement (move.a) 
Phonetic Form Logical Form 
(Empty Category Principle, Bindino Theoq) 
Figure 1: The T-model of grammar 
Little work has been done on the complexity of al- 
gorithms used to parse with a principle-based gram- 
mar, since such grammars do not exist as accepted 
mathematically well-defined constructs. It has been 
estimated that in general, principle-based parsing can 
only be accomplished in exponential time, i.e. 0(2") 
\[Berwick and Weinberg1984, Weinberg19881. 
A feature of principle-based grammars is their po- 
tential to assign some meaningful representation to 
a string which is strictly ungrammatical. It is an 
inherent feature of phrase structure grammars that 
they classify the strings of words from a language 
into two (infinite) sets, one containing the grammat- 
ical strings and the other the ungrammatical strings. 
Although attempts have been made to modify PS gram- 
mars/parsers to cope with extragrammatical input, 
e.g. \[Carbonell and Hayes1983, Douglas and Dale1992, 
Jensen et al.1983, Mellish1989\], this is a feature which 
has to be 'added on' and tends to affect the statement 
of the grammar. 
Due to the lack of an accepted formalism for the 
37 
specification of prindple-based grammars, Crocker and 
Lewi, \[Crocker and Lewin1992\] define the declarative 
'Proper Branch' formalism, which can be used with a 
number of different parsing methods. 
A proper branch is a set of three nodes -- a mother 
and two daughters -- which are constructed by the 
parser, using a simple mechanism such as a shift-reduce 
interpreter, and then 'licensed' by the principles of 
grammar. A complete phrase marker of the input string 
can then be constructed by following the manner in 
which the mother node from one proper branch is used 
as a daughter node in a dominating proper branch. 
Eadl proper branch is a binary branching struc- 
ture, and so all grammatical constraints will need to 
be encoded locally. Crocker \[Crocker19921 develops 
"a 'representational' reformulation of the transforma- 
tional model which decomposes syntactic analysis into 
sew,.ral representation types -- including phrase struc- 
ture, chains, and coindexation -- allowing one to main- 
tain the strictly local characterisation of prindples 
with respect to their relevant representation types," 
\[Crocker and Lewin1992, p. 511\]. 
By using the proper branch method of axiomatis- 
ing the grammar, the structure building section of the 
parser is only constrained in that it must produce 
proper branches; it is therefore possible to experiment 
with different interpreters (i.e. structure proposing en- 
gines) while keeping the grammar constant. 
The Grammar and Parser 
A small principle-based parser was built, follow- 
ing the proper branch formalism developed in 
\[Crocker and Lewin1992\]. Although the grammar is 
very limited, the use of probabilities in ranking the 
\[mrscr's output can be seen as a first step towards im- 
plementing a principle-based parser using a more fully 
specified collection of grammar modules. 
The grammar is loosely based on three modules 
taken from Government-Binding Theory -- X-bar the- 
ory, Theta Theory and Case Theory. Although these 
embody the spirit of the constraints found in Chore- 
sky \[Chomsky1981\] they are not intended to be entirely 
faithful to this specification of syntactic theory. There 
is also only a single level of representation (which is 
explicitly constructed for output purposes but not con- 
sulted by the parser). This representation is interpreted 
as S-structure. 
Explanations of the knowledge contained within each 
grammar principle is given in the following sections. 
Theory 
X-bar Theory uses a set of schemata to license local 
subtmes. We use a parametrised version of the X-bar 
schemata, similar to that of Muysken \[Muysken1983\], 
but employing features which relate to the state of the 
head word's theta grid to give five schemata (figure 2) . 
A .ode includes the following features (among others): 
1. X= 
2. xS+ 
3. X_ s 
4. X s _ 
5. X s _ 
Figure 
~Y2 X + _ 
~X~ Y+ 
~XS+ y_- 
---,X s Y= 
~y_- X s - 
2: The X-bar Schemata 
1. Category: the standard category names are em- 
ployed. 
2. Specifier (SPEC): this feature spedfies whether the 
word at the head of the phrase being built requires a 
spech%r. 
3. Complement (COMP): the complement feature is re- 
dundant in that the information used to derive it's 
value is already present in a word's there grid, and 
will therefore be checked for well-formedness by the 
theta criterion. Since this information is not refer- 
enced until later, the COMP feature is used to limit 
the number of superfluous proper-branches generated 
by the parser. 
4. The head (i.e. lexical item) of a node is carried on 
each projection of that node along with its theta grid. 
The probabilities for occurrences of the X-bar schema 
were obtained from sentences from the preliminary 
Penn Tmebank corpus of the Wall Street Journal, cho- 
sen because of their length and the head of their verb 
phrase (i.e. the main verbs were all from the set for 
which theta role data was obtained); the examples were 
manually parsed by the authors. 
The probabilities were calculated using the following 
equation, where X~: --~ Y~## Z~s~ is a specific schema, 
X is the set of X-bar schemata and A and B and C 
are variables over category, SPEC and COMP feature 
bundles: 
c, z&) z P(X~ 
• zS'Ixcb, ) = C(A ~ B C) (I) 
This is different to manner in which probabilities are 
collected for stochastic context-free grammars, where 
the identity of the mother node is taken into account, 
as in the equation below: 
c(x : r s' - c,Z&) P(xcS: ~ YcI' 
ZcS: Ix) = C(X~: --~ B (2) C) 
This would result in misleading probabilities for the X- 
bar schemata since the use of schemata (3), (4), and 
(5) would immediately bring down the probability of 
a parse compared to a parse of the same string which 
happened to use only (1) and (2).* 
*The probabilities for (1) and (2) would be I as they have 
unique mothers. 
38 
The overall (X-bar) likelihood of a parse can then be 
computed by multiplying together all the probabilities 
obtaim:d from each application of the schemata, in a 
manner analogous to that used to obtain the probabil- 
ity of a phrase marker generated by an SCFG. Using 
the schemata in this way suggests that the building of 
structure is category independent, i.e. it is just as likely 
that a verb will have a (filled) specifier position as it is 
for a noun. The work on stochastic context-free gram- 
mars suggests a different set of results, in that the spe- 
cific categories involved in expansions are all important. 
While SCFGs will tend to deny that all categories ex- 
pand in certain ways with the same probabilities, they 
make this claim while using a homogeneous grammar 
formalism. When a more modular theory is employed, 
the source of the supposedly category specific informa- 
tion is not as obvious. The use of lexical probabilities 
on specifier and complement co-occurrence with specific 
heads (i.e. lexical items) could exihibit properties that 
appear to be category specific, but are in fact caused 
by common properties which are shared by lexical items 
of the same category. 2 Since it can be argued that the 
probabilistic information on lexical items will be needed 
independently, there is no need to use category specific 
information in assigning probabilities to syntactic con- 
figurations. 
Theta Theory 
Theta theory is concerned with the assignment of an 
argument structure to a sentence. A verb has a number 
of the thematic (or 'theta') roles which must be assigned 
to its arguments, e.g. a transitive verb has one theta 
role to 'discharge' which must be assigned to an NP. 
If a binary branching formalism is employed, or in- 
deed any formalism where the arguments of an item 
and the item itself are not necessarily all sisters, the 
problem of when to access the probability of a theta 
application is presented. The easiest method of obtain- 
ing and applying theta probabilities will be with refer- 
ence to whole theta grids. Each theta grid for a word 
will be assigned a probability which is not dependent 
on any particular items in the grid, but rather on the 
occurrence of the theta grid as a whole. 
A preliminary version of the Penn Treebenk brack- 
eted corpus was analysed to extract information on the 
sisters of particular verbs. Although the Penn Tree- 
bank data is unreliable since it does not always dis- 
tinguish complements from adjuncts, it was the only 
suitable parsed corpus to which the authors had access. 
Although the distinction between complements and ad- 
juncts is a theoretically interesting one, the process of 
determining which constructions fill which functional 
roles in the analysis of real text often creates a number 
of problems (see \[Hindle and Rooth1993\] for discussion 
2It is of course possible to store these cross-item similar- 
ities as lexical rules \[Bresnan1978\], but this alone does not 
entail that the properties axe specific to a category, cff. the 
theta grids of verbs and their ~related' nouns. 
on this issue regarding output of the Fidditch parser 
\[Hindle1993\]). 
The probal)ilities for em'h of tim verbs' thcta t;l'hls 
were calculated using the equati~ m bch Jw, w her(, I '(s, It,) 
is the probability of the theta grid st occurring with thc 
verb v, (v, si) is an occurrence of the items in si being 
licensed by v, and S ranges over all theta gr!ds for v: 
C(v,s,) 
PCsdv) = CCv,S) (3) 
Case Theory 
In its simplest form, Case theory invokes the Case filter 
to ensure that all noun phrases in a parse are assigned 
(abstract) case. Case theory differs from both X-bar 
and Theta theory in that it is category specific: only 
NPs require, or indeed can be assigned, abstract case. 
If we are to implement a probabilistic version of a mod- 
ular grammar theory incorporating a Case component, 
a relevant question is: are there multiple ways of as- 
signing Case to noun phrases in a sentence? i.e. can 
ambiguity arise due to the presence of two candidate 
Case assigners? 
Case theory suggests that the answer to this is neg- 
ative, since Case assignment is linked to theta theory 
via visibility, and it is not possible for an NP to receive 
more than one theta role. As a result, the use of Case 
probabilities in a parser would be at best unimportant, 
since some form of ambiguity is needed in the module, 
i.e. it is possible to satisfy the Case filter in more than 
one way, for probabilities associated with the module 
to be of any use. While having a provision for using 
probabilities deduced from Case information, the im- 
plemented parser does not in fact use Case in its parse 
ranking operations. 
Local Calculation 
The use of a heterogeneous grammar formalism and 
multiple probabilities invokes the problem of their com- 
bination. There are at least two ways in which each 
mother's probabilities can be calculated; firstly, the 
probability information of the same type can be used: 
the daughters' X-bar probabilities alone could be used 
in calculating the mother's X-bar probability. Alterna- 
tively, a combination of some or all of the daughters' 
probability features could be employed, thus making, 
e.g., the X-bar probability of the mother dependent 
upon all the stochastic information from the daughters, 
including theta and Case probabilities, etc. 
The need for a method of combining the daughter 
probabilities into a useful figure for the calculation of 
the mother probabilities is likely to involve trial and er- 
ror, since theory thus far has had nothing to say on the 
subject. The former method, using only the relevant 
daughter probabilities, therefore seems to be the most 
fruitful path to follow at the outset, since it does not 
require a way of integrating probabilities from differ- 
ent modules while the parse is in progress, nor is it ~m 
computationally expensive. 
39 
Global Calculation 
The manner in which the global probability is calct!- 
latcd will be partly dependent upon the information 
~'ontained in the local probability calculations. 
If the probabilities for partial analyses have been cal- 
culated using only probabilities of the same types from 
the subanalyses -- e.g. X-bar, Theta -- the probabil- 
ities at the top level will have been calculated using 
informationally distinct figures. This has the advan- 
tage of making 'pure' probabilities available, in that 
the X-bar probability will reflect the likelihood of the 
structure alone, and will be 'uncontaminated' by any 
other information. It should then be possible to exper- 
iment with different methods of combining these prob- 
abilities, other than the obvious 'multiplying them to- 
gether' techniques, which could result in one type of 
pr~babililty emerging as the most important. 
On the other hand, if probabilities calculated dur- 
ing the parse take all the different types of probabilities 
into account at each calculation -- i.e. the X-bar, theta, 
(~tc. probabilities on daughters are all taken into account 
when calculating the mother's X-bar probability -- the 
probabilities at the top level will not be pure, and a lot 
of the information contained in them will be redundant 
since they will share a large subset of the probabilities 
used in their separate calculations. It will not therefore 
be c~asy to gain theoretical insight using these statis- 
tics, and their most profitable method of combination 
is likely tt~ be more haphazard affair than when more 
pure probabilities are used. 
The parser used in testing employed the first method 
and therefore produced separate module probabilities 
for each node. For the lack of a.better, theoretically mo- 
tivated method for combining these figures, the product 
of the probabilities was taken as the global probability 
for each parse. 
Testing the Parser 
The parser was tested using sixteen sentences contain- 
ing verbs for which data had been collected from the 
Penn Treebank corpus. The sentences were created by 
the authors to exhibit at least a degree of ambiguity 
when it came to attaching a post-verbal phrase as an 
adjunct or a complement. In order to force the choice of 
the 'l)est' parse on to the verb, the probabilities of theta 
grids for nouns, prepositions, etc. was kept constant. 
Of these 16 highest ranked parses, 7 are the expected 
parse, with the other 9 exhibiting some form of mis- 
attachment. The fact that each string received multi- 
pie parses (the mean number of analyses being 9.135, 
~md the median, 6) suggests that the probabilistic in- 
formation did favourably guide the selection of a single 
amdysis. 
It is not really possible to say from these results how 
successful the whole approach of probabilistic principle- 
based parsing would be if it were fully implemented. 
The inconclusive nature of the results obtained was due 
to a number of limiting factors of the implementation 
including the simplicity of the grammar and the lack of 
available data. 
Discussion 
Limitations of the Grammar 
The grammar employed is a partial characterisation of 
Chomsky's Government-Binding theory \[Chomsky1981, 
Chomsky1986\] and only takes account of very local con- 
stralnts (i.e. X-bar, Theta and Case); a way of encod- 
ing all constraints in the proper branch formalism (e.g. 
\[Crocker1992\]) will be needed before a grammar of suf- 
ficient coverage to be useful in corpora analysis can be 
formulated. The problem with using results obtained 
from the implementation given here is that the gram- 
mar is sufficiently underspecified and so leaves too great 
a task for the probabilistic information. 
This approach could be viewed as putting the cart be- 
fore the horse; the usefulness of stochastic information 
in parsers presumes that a certain level of accuracy can 
be achieved by the grammar alone. While GB is an el- 
egant theory of cognitive syntax, it has yet to be shown 
that such a modular characteristion can be successfully 
employed in corpus analysis. 
Statistical Data and their Source 
The use of the preliminary Penn Treebank corpus for 
the extraction of probabilities used in the implementa- 
tion above was a choice forced by lack of suitable mate- 
rials. There are still very few parsed corpora available, 
and none that contain information which is specified to 
the level required by, e.g., a GB grammar. While this 
is not an absolute limitation, in that it is theoretically 
possible to extract this information manually or semi- 
automatically from a corpus, time constraints entailed 
the rejection of this approach. 
It would be ultimately desirable if the use of probabil- 
ities in principle-based parsing could be used to mirror 
the way that a syntactic theory such as Government- 
Binding handles constructions -- various modules of 
the grammar conspire to rule out illegal structures or 
derivations. It would be an elegant result if a construc- 
tion such as the passive were to use probabilities for 
chains, Case assignment etc. to select a parse that re- 
flected the lexical changes that had been undergone, e.g. 
the greater likelihood of an NP featuring in the verb!s 
theta grid. It is this property of a number of modules 
working hand in hand that needs to be carried over into 
the probabilistic domain. 
The objections that linguists once held against sta- 
tistical methods are disappearing slowly, partly due to 
results in corpora analysis that show the inadequacy of 
linguistic theory when applied to naturally occurring 
data. It is also the case that the rise of the connection- 
ist phoenix has brought the idea of weighted (though 
not strictly probabilistic) functions of cognition back to 
the fore, freeing the hands of linguists who believe that 
while an explanatorily adequate theory of grammar is 
40 
an elegant construct, its human implementation, and its 
usage in computational linguists may not be straight 
forward. This paper has hopefully shown that an in- 
tegration of statistical methods and current linguistic 
theory is a goal worth pursuing. 

References 
\[Berwick and Weinberg1984\] Robert Cregar Berwick 
and Amy S. Weinberg. 1984. The Grammatical Basis 
of Linguistic Performance: Language Use and Ac- 
quistion. MIT Press. 
\[Bresnan1978\] Joan. W. Bresnan. 1978. A realistic 
transformational grammar. In M. Halle, J. Bresnan, 
and G. Miller, editors, Linguistic Theory and Psy- 
chological Reality. MIT Press, Cambridge, MA. 
\[CarboneU and Hayes1983\] Jaime G. Carbonell and 
Philip J. Hayes. 1983. Recovery strategies for pars- 
ing extragrammatical language. American Journal 
of Computational Linguistics, 9(3-4):123-146, July- 
December. 
\[Chomsky1981\] Noam Chomsky. 1981. Lectures on 
Government and Binding. Studies in Generative 
Grammar No. 9. Foris, Dordrecht. 
\[Chomsky1986\] Noam Chomsky. 1986. Knowledge of 
Language: Its Nature, Origin, and Use. Convergence. 
Praeger, New York. 
\[Crocker and Lewin1992\] Matthew Walter Crocker and 
Ian Lewin. 1992. Parsing as deduction: Rules versus 
principles. In B. Neumann, editor, ECAI 92. lOth 
European Conference on Artificial Intelligence, pages 
508-512. John Wiley and Sons, Ltd. 
\[Crocker1992\] Matthew Walter Crocker. 1992. A Log- 
ical Model of Competence and Performance in the 
Human Sentence Processor. Ph.D. thesis, Dept. Ar- 
tificial Intelligence, University of Edinburgh. 
\[Douglas and Dale1992\] Shona Douglas and Robert 
Dale. 1992. Towards robust PATR. In Ch. Boitet, ed- 
itor, COLING-9~, Proceedings of the fifteenth Inter- 
national Conference on Computational Linguistics, 
pages 468-474. 
\[Hindle and Rooth1993\] Donald Hindle 
and Mats Rooth. 1993. Structural ambuiguity and 
lexical relations. Computational Linguistics, 19(1). 
\[Hindle1993\] Donald Hindle. 1993. A parser for text 
corpora. In 13. T. S. Atkins and A. Zampolli, editors, 
Computational Approaches to the Lexicon. 
\[Jensen et al.1983\] K. Jensen, G. E. Heidborn, L. A. 
Miller, and Y. R~vin. 1983. Parse fitting and prose 
fixing: Getting a hold on ill-formedness. American 
Journal of Computational Linguistics, 9(3-4):147- 
160, July-December. 
\[Mellish1989\] Christopher S. Mellish. 1989. Some 
chart-based techniques for parsing ill-formed input. 
In Proceedings of the $Tth Annual Meeting of the As- 
sociation for Computational Linguistics, pages 102- 
109. 
\[Muysken1983\] Pieter Muysken. 1983. Paramctcrizing 
the notion of head. Journal of Linguistic Research, 
2:57-76. 
\[Weinberg1988\] Amy S. Weinberg. 1988. Mathematical 
properties of grammars. In Frederick J. Newtncycr, 
editor, Linguistics: the Cambridge Survey, Voi. 1, 
Linguistics Theory: Foundations, chapter 15, pages 
415-429. Cambridge University Press. 
