An Unsupervised Model for Statistically Determining Coordinate 
Phrase Attachment 
Miriam Goldberg 
Central High School & 
Dept. of Computer and Information Science 
200 South 33rd Street 
Philadelphia, PA 19104-6389 
University of Pennsylvania 
miriamgOunagi, cis. upenn, edu 
Abstract 
This paper examines the use of an unsuper- 
vised statistical model for determining the at- 
tachment of ambiguous coordinate phrases (CP) 
of the form nl p n2 cc n3. The model pre- 
sented here is based on JAR98\], an unsupervised 
model for determining prepositional phrase at- 
tachment. After training on unannotated 1988 
Wall Street Journal text, the model performs 
at 72% accuracy on a development set from 
sections 14 through 19 of the WSJ TreeBank 
\[MSM93\]. 
1 Introduction 
The coordinate phrase (CP) is a source of struc- 
tural ambiguity in natural language. For exam- 
ple, take the phrase: 
box of chocolates and roses 
'Roses' attaches either high to 'box' or low to 
'chocolates'. In this case, attachment is high, 
yielding: 
H-attach: ((box (of chocolates)) (and roses)) 
Consider, then, the phrase: 
salad of lettuce and tomatoes 
'Lettuce' attaches low to 'tomatoes', giving: 
L-attach: (salad (of ((lettuce) and 
(tomatoes))) 
\[AR98\] models. In addition to these, a corpus- 
based model for PP-attachment \[SN97\] has been 
reported that uses information from a semantic 
dictionary. 
Sparse data can be a major concern in corpus- 
based disambiguation. Supervised models are 
limited by the amount of annotated data avail- 
able for training. Such a model is useful only 
for languages in which annotated corpora are 
available. Because an unsupervised model does 
not rely on such corpora it may be modified for 
use in multiple languages as in \[AR98\]. 
The unsupervised model presented here 
trains from an unannotated version of the 1988 
Wall Street Journal. After tagging and chunk- 
ing the text, a rough heuristic is then employed 
to pick out training examples. This results in 
a training set that is less accurate, but much 
larger, than currently existing annotated cor- 
pora. It is the goal, then, of unsupervised train- 
ing data to be abundant in order to offset its 
noisiness. 
2 Background 
The statistical model must determine the prob- 
ability of a given CP attaching either high (H) 
or low (L), p( attachment I phrase). Results 
shown come from a development corpus of 500 
phrases of extracted head word tuples from the 
WSJ TreeBank \[MSM93\]. 64% of these phrases 
attach low and 36% attach high. After further 
development, final testing will be done on a sep- 
arate corpus. 
The phrase: 
Previous work has used corpus-based ap- 
proaches to solve the similar problem of prepo- 
sitional phrase attachment. These have in- 
cluded backed-off \[CB 95\], maximum entropy 
\[RRR94\], rule-based \[HR94\], and unsupervised 
(busloads (of ((executives) and (their wives))) 
gives the 6-tuple: 
L busloads of executives and wives 
610 
where, a = L, nl = busloads, p = of, n2 = 
executives, cc = and, n3 = wives. The CP at- 
tachment model must determine a for all (nl 
p n2 cc n3) sets. The attachment decision is 
correct if it is the same as the corresponding 
decision in the TreeBank set. 
The probability of a CP attaching high is 
conditional on the 5-tuple. The algorithm pre- 
sented in this paper estimates the probability: 
regular expressions that replace noun and quan- 
tifier phrases with their head words. These head 
words were then passed through a set of heuris- 
tics to extract the unambiguous phrases. The 
heuristics to find an unambiguous CP are: 
• wn is a coordinating conjunction (cc) if it 
is tagged cc. 
• w,~_~ is the leftmost noun (nl) if: 
I5 = (a l nl,p, n2, cc, n3) 
The parts of the CP are analogous to those 
of the prepositional phrase (PP) such that 
{nl,n2} - {n,v} and n3 - p. JAR98\] de- 
termines the probability p(v,n,p,a). To be 
consistent, here we determine the probability 
p(nl, n2, n3, a). 
3 Training Data Extraction 
A statistical learning model must train from un- 
ambiguous data. In annotated corpora ambigu- 
ous data are made unambiguous through classi- 
fications made by human annotators. In unan- 
notated corpora the data themselves must be 
unambiguous. Therefore, while this model dis- 
ambiguates CPs of the form (nl p n2 cc n3), it 
trains from implicitly unambiguous CPs of the 
form (n ccn). For example: 
- Wn-x is the first noun to occur within 
4 words to the left of cc. 
-no preposition occurs between this 
noun and cc. 
- no preposition occurs within 4 words 
to the left of this noun. 
• wn+x is the rightmost noun (n2) if: 
- it is the first noun to occur within 4 
words to the right of cc. 
- No preposition occurs between cc and 
this noun. 
The first noun to occur within 4 words to the 
right of cc is always extracted. This is ncc. Such 
nouns are also used in the statistical model. For 
example, the we process the sentence below as 
follows: 
dog and cat 
Because there are only two nouns in the un- 
ambiguous CP, we must redefine its compo- 
nents. The first noun will be referred to as nl. 
It is analogous to nl and n2 in the ambiguous 
CP. The second, terminal noun will be referred 
to as n3. It is analogous to the third noun in 
the ambiguous CP. Hence nl -- dog, cc --- and, 
n3 = cat. In addition to the unambiguous CPs, 
the model also uses any noun that follows acc. 
Such nouns are classified, ncc. 
We extracted 119629 unambiguous CPs and 
325261 nccs from the unannotated 1988 Wall 
Street Journal. First the raw text was fed into 
the part-of-speech tagger described in \[AR96\] 1. 
This was then passed to a simple chunker as 
used in \[AR98\], implemented with two small 
IBecause this tagger trained on annotated data, one 
may argue that the model presented here is not purely 
unsupervised. 
Several firms have also launched busi- 
ness subsidiaries and consulting arms 
specializing in trade, lobbying and 
other areas. 
First it is annotated with parts of speech: 
Several_JJ firms__NNS have_VBP 
also_RB launched_VBN business.aNN 
subsidiaries_NNS and_CC consult- 
ing_VBG armsANNS specializing_VBG 
in_IN tradeANN ,_, lobbying_NN 
and_CC other_JJ areas_NNS ._. 
From there, it is passed to the chunker yield- 
ing: 
firmsANNS have_VBP also_RB 
launched_VBN subsidiaries_NNS 
and_CC consulting_VBG armsANNS 
specializing_VBG in_IN tradeANN ,_, 
Iobbying_.NN and_CC areas_NNS ._. 
611 
Noun phrase heads of ambiguous and unam- 
biguous CPs are then extracted according to the 
heuristic, giving: 
subsidiaries and arms 
and areas 
where the extracted unambiguous CP is 
{nl = subsidiaries, cc = and, n3 = arms} and 
areas is extracted as a ncc because, although 
it is not part of an unambiguous CP, it occurs 
within four words after a conjunction. 
4 The Statistical Model 
First, we can factor p(a, nl, n2, n3) as follows: 
p(a, nl,n2, n3) = p(nl)p(n2) 
, p(alnl ,n2) 
, p(n3 I a, nl,n2) 
The terms p(nl) and p(n2) are independent 
of the attachment and need not be computed. 
The other two terms are more problematic. Be- 
cause the training phrases are unambiguous and 
of the form (nl cc n2), nl and n2 of the CP 
in question never appear together in the train- 
ing data. To compensate we use the following 
heuristic as in JAR98\]. Let the random variable 
¢ range over (true, false} and let it denote the 
presence or absence of any n3 that unambigu- 
ously attaches to the nl or n2 in question. If 
¢ = true when any n3 unambiguously attaches 
to nl, then p(¢ = true \[ nl) is the conditional 
probability that a particular nl occurs with an 
unambiguously attached n3. Now p(a I nl,n2) 
can be approximated as: 
p(a = H lnl, n2) p(true l nl) Z(nl,n2) 
p(a = L \[nl,n2) ~ p(true In2) 
" Z(nl, n2) 
where the normalization factor, Z(nl,n2) = 
p(true I nl) + p(true I n2). The reasoning be- 
hind this approximation is that the tendency of 
a CP to attach high (low) is related to the ten- 
dency of the nl (n2) in question to appear in 
an unambiguous CP in the training data. 
We approximate p(n3la, nl, n2) as follows: 
p(n3 I a = H, nl, n2) ~ p(n3 I true, nl) 
p(n3 I a = L, nl, n2) ~ p(n3 I true, n2) 
The reasoning behind this approximation is 
that when generating n3 given high (low) at- 
tachment, the only counts from the training 
data that matter are those which unambigu- 
ously attach to nl (n2), i.e., ¢ = true. Word 
statistics from the extracted CPs are used to 
formulate these probabilities. 
4.1 Generate ¢ 
The conditional probabilities p(truelnl) and 
p(true I n2) denote the probability of whether 
a noun will appear attached unambiguously to 
some n3. These probabilities are estimated as: 
{ $(.~1,true) iff(nl,true) >0 f(nl) 
p(truelnl) = .5 otherwise 
{ /(n2,~r~,e) if f(n2, true)> 0 /(n2) 
p(true\[n2) = .5 otherwise 
where f(n2, true) is the number of times n2 
appears in an unambiguously attached CP in 
the training data and f(n2) is the number of 
times this noun has appeared as either nl, n3, 
or ncc in the training data. 
4.2 Generate n3 
The terms p(n3 I nl, true) and p(n3 I n2, true) 
denote the probabilies that the noun n3 appears 
attached unambiguously to nl and n2 respec- 
tively. Bigram counts axe used to compute these 
as follows: 
f(nl,n3,true) p(n3 \[ true, nl) = l\](nl, TM) if I(nl,n3,true)>O 
otherwise 
f(n2,n3,true) p(n3 l true, n2) 
= 11(n2, TM) if f(n2,n3,true)>O 
otherwise 
where N is the set of all n3s and nets that 
occur in the training data. 
5 Results 
Decisions were deemed correct if they agreed 
with the decision in the corresponding Tree- 
Bank data. The correct attachment was chosen 
612 
72% of the time on the 500-phrase development 
corpus from the WSJ TreeBank. Because it is 
a forced binary decision, there are no measure- 
ments for recall or precision. If low attachment 
is always chosen, the accuracy is 64%. After fur- 
ther development the model will be tested on a 
testing corpus. 
When evaluating the effectiveness of an un- 
supervised model, it is helpful to compare its 
performance to that of an analogous supervised 
model. The smaller the error reduction when 
going from unsupervised to supervised models, 
the more comparable the unsupervised model 
is to its supervised counterpart. To our knowl- 
edge there has been very little if any work in the 
area of ambiguous CPs. In addition to develop- 
ing an unsupervised CP disambiguation model, 
In \[MG, in prep\] we have developed two super- 
vised models (one backed-off and one maximum 
entropy) for determining CP attachment. The 
backed-off model, closely based on \[CB95\] per- 
forms at 75.6% accuracy. The reduction error 
from the unsupervised model presented here to 
the backed-off model is 13%. This is compa- 
rable to the 14.3% error reduction found when 
going from JAR98\] to \[CB95\]. 
It is interesting to note that after reducing 
the volume of training data by half there was 
no drop in accuracy. In fact, accuracy remained 
exactly the same as the volume of data was in- 
creased from half to full. The backed-off model 
in \[MG, in prep\] trained on only 1380 train- 
ing phrases. The training corpus used in the 
study presented here consisted of 119629 train- 
ing phrases. Reducing this figure by half is not 
overly significant. 
6 Discussion 
In an effort to make the heuristic concise and 
portable, we may have oversimplified it thereby 
negatively affecting the performance of the 
model. For example, when the heuristic came 
upon a noun phrase consisting of more than one 
consecutive noun the noun closest to the cc was 
extracted. In a phrase like coffee and rhubarb 
apple pie the heuristic would chose rhubarb as 
the n3 when clearly pie should have been cho- 
sen. Also, the heuristic did not check if a prepo- 
sition occurred between either nl and cc or cc 
and n3. Such cases make the CP ambiguous 
thereby invalidating it as an unambiguous train- 
ing example. By including annotated training 
data from the TreeBank set, this model could 
be modified to become a partially-unsupervised 
classifier. 
Because the model presented here is basically 
a straight reimplementation of \[AR98\] it fails to 
take into account attributes that are specific to 
the CP. For example, whereas (nl ce n3) -- (n3 
cc nl), (v p n) ~ (n p v). In other words, there 
is no reason to make the distinction between 
"dog and cat" and "cat and dog." Modifying 
the model accordingly may greatly increase the 
usefulness of the training data. 
7 Acknowledgements 
We thank Mitch Marcus and Dennis Erlick for 
making this research possible, Mike Col\]in.~ for 
his guidance, and Adwait Ratnaparkhi and Ja- 
son Eisner for their helpful insights. 

References 
~\[CB95\] M. Collins, J. Brooks. 1995. Preposi- 
tional Phrase Attachment through a Backed- 
Off Model, A CL 3rd Workshop on Very Large 
Corpora, Pages 27-38, Cambridge, Mas- 
sachusetts, June. 
\[MG, in prep\] M. Goldberg. in preparation. 
Three Models for Statistically Determining 
Coordinate Phrase Attachment. 
\[HR93\] D. Hindle, M. Rooth. 1993. Structural 
Ambiguity and Lexical Relations. Computa- 
tional Linguistics, 19(1):103-120. 
\[MSM93\] M. Marcus, B. Santorini and M. 
Marcinkiewicz. 1993. Building a Large Anno- 
tated Corpus of English: the Penn Treebank, 
Computational Linguistics, 19(2):313-330. 
\[RRR94\] A. Ratnaparkhi, J. Reynar and S. 
Roukos. 1994. A Maximum Entropy Model 
for Prepositional Phrase Attachment, In Pro- 
ceedings of the ARPA Workshop on Human 
Language Technology, 1994. 
\[AR96\] A. Ratnaparkhi. 1996. A Maximum En- 
tropy Part-Of-Speech Tagger, In Proceedings 
of the Empirical Methods in Natural Lan- 
guage Processing Conference, May 17-18. 
\[AR98\] A. Ratnaparkhi. 1998. Unsupervised 
Statistical Models for Prepositional Phrase 
Attachment, In Proceedings of the Seven- 
teenth International Conference on Compu- 
tational Linguistics, Aug. 10-14, Montreal, 
Canada. 
\[SN97\] J. Stetina, M. Nagao. 1997. Corpus 
Based PP Attachment Ambiguity Resolution 
with a Semantic Dictionary. In Jou Shou and 
Kenneth Church, editors, Proceedings o\] the 
Fifth Workshop on Very Large Corpora, pages 
66-80, Beijing and Hong Kong, Aug. 18-20. 
