AUTOMATICALLY ACQUIRING PHRASE 
STRUCTURE USING DISTRIBUTIONAL ANALYSIS 
Eric Brill and Mitchell Marcus* 
Department of Computer Science 
University of Pennsylvania 
Philadelphia, Pa. 19104 
brill~unagi.cis.upenn.edu, mit ch@unagi.cis.upenn.edu 
ABSTRACT 
In this paper, we present evidence that the acquisition of the 
phrase structure of a natural language is possible without su- 
pervision and with a very small initial grammar. We describe 
a language learner that extracts distributional information 
from a corpus annotated with parts of speech and is able to 
use this extracted information to accurately parse short sen- 
tences. The phrase structure learner is part of an ongoing 
project to determine just how much knowledge of language 
can be learned solely through distributional analysis. 
1. INTRODUCTION 
This paper is an exploration into the possibility of auto- 
matically acquiring the phrase structure of a language. 
We use distributional analysis techniques similar to the 
techniques originally proposed by Zellig Harris \[5\] for 
structural linguists to use as an aid in uncovering the 
structure of a language. Harris intended his techniques 
to be carried out by linguists doing field work, as a 
substitute for what he perceived as unscientific informa- 
tion gathering by linguists at the time. The procedures 
Harris describes are intended to uncover "regularities 
\[...\] in the distributional relations among the features 
of speech in question" (page 5). To use distributional 
analysis to determine empirically whether boy and girl 
are in the same word class, the linguist would need to 
determine whether the two words are licensed to occur 
in the same environments. Harris presented algorithms 
linguists could use to detect distributionally similar en- 
tities. 
Harris did not intend the procedures he proposed to be 
used as a model of child language acquisition or as a tool 
for computerized language learning. This would not be 
feasible because the method Harris describes for deter- 
mining distributional similarity does not seem amenable 
to unsupervised acquisition. One way of determining 
whether boy and girl are in the same word class is to see 
whether it is the case that for all sentences that boy oc- 
curs in, the same sentence with girl substituted for boy 
is an allowable sentence. To do this automatically from 
*This work was supported by DARPA and AFOSR jointly un- 
der grant No. AFOSR-90-0066, and by ARO grant No. DAAL 
03-89-C0031 PRI. 
text, one would need a prohibitively large corpus. This 
lack of sufficient data does not arise in field work be- 
cause the linguist has access to informants, who are in 
effect infinite corpora. If one hears the boy finished the 
homework, the informant can be queried whether the girl 
finished the homework is also permissible. 
The procedures Harris outlines for the linguist to use to 
discover linguistic structure could be used to automat- 
ically acquire grammatical information if it were possi- 
ble to do away with the need for a human informant. 
It is possible that a variation of these procedures could 
extract information by observing distributional similari- 
ties in a sufficiently large corpus of unparsed text. In an 
earlier paper \[2\], we demonstrated that simple distribu- 
tional analysis over a corpus can lead to the discovery of 
word classes. In this paper, we describe work in which 
we apply distributional analysis in an attempt to auto- 
matically acquire the phrase structure of a language. 
We describe a system which automatically acquires En- 
glish phrase structure, given only the tagged Brown Cor- 
pus \[4\] as input. The system acquires a context-free 
grammar where each rule is assigned a score. Once the 
grammar is learned, it can be used to find and score 
phrase structure analyses of a string of part of speech 
tags. The nonterminal nodes of the resulting phrase 
structure tree are not labelled. The system is able to 
assign a phrase structure analysis consistent with the 
string of part of speech tags with high accuracy. 
There have been several other recent proposals for au- 
tomatic phrase structure acquisition based on statistics 
gathered over large corpora. In \[1, 9\], a statistic based 
on mutual information is used to find phrase boundaries. 
\[11\] defines a function to score the quality of parse trees, 
and then uses simulated annealing to heuristically ex- 
plore the entire space of possible parses for a given sen- 
tence. A number of papers describe results obtained us- 
ing the Inside-Outside algorithm to train a probabilistic 
context-free grammar \[10, 6, 8\]. Below we describe an 
alternate method of phrase structure acquisition. 
155 
2. HOW IT WORKS 
The system automatically acquires a grammar of scored 
context-free rules, where each rule is binary branching. 
Two sources of distributional information are used to 
acquire and score the rules. The score for the rule tag~ 
tagy tagz is a function of: 
1. The distributional similarity of the part of speech 
tag tagx and the pair of tags tagy tagz. 
2. A comparison of the entropy of the environment 
tagy _ and tagy tagz --. The entropy of environ- 
ment tag~ _ is a measure of the randomness of the 
distribution of tags occurring immediately after tag~ 
in the corpus. 
2.1. Substitutability 
The system is based upon the assumption that if two 
adjacent part of speech tags are distributionally similar 
to some single tag, then it is probable that the two tags 
form a constituent. If tag~: is distributionally similar to 
tagy tagz, then tags can be substituted for tagy tagz in 
many environments. If a single tag is substitutable for 
a pair of adjacent tags, it is highly likely that that pair 
of tags makes up a syntactically significant entity, i.e. a 
phrase. 
For example, words labelled with the tag Pronoun and 
words labelled with the tag pair Determiner Noun are 
distributionally similar. Distributionally, Pronoun can 
occur in almost all environments in which Determiner 
Noun can occur. In the tag sequence Determiner Noun 
Verb, we could discover that Determiner Noun is a con- 
stituent and Noun Verb is not, since no single lexical item 
has distributional behavior similar to the pair of tags 
Noun Verb. Once we know these distributional facts, 
as well as the fact that the single tag Verb and the tag 
pair Pronoun Verb distribute similarly (eat fish :: we 
eat fish), we can find the structure of the tag sequence 
Determiner Noun Verb by recursively substituting single 
part of speech tags for pairs of tags. This would result in 
the structurally correct (ignore the nonterminal labels): 
Verb 
Determiner Noun Verb 
To carry out the above analysis, we made use of our 
knowledge of the language to determine that the tag 
Pronoun is distributionally similar to (substitutable for) 
the pair of tags Determiner Noun. Unfortunately, the 
system does not have access to such knowledge. How- 
ever, an approximation to this knowledge can be learned. 
For each possible context-free rule tagx ~ tagu tagz, the 
system assigns a value indicating the distributional sim- 
ilarity of tagx to the pair of tags tagy tagz. The measure 
used to compute the similarity of tag~ to tagy tagz is 
known as divergence \[7\]. 
Let P1 and P2 be two probability distributions over en- 
vironments. The relative entropy between P1 and P2 
is: 
D(PiIIP2) = ~ Px(x) • tog Pa(x) 
Relative entropy D(PIIIP2) is a measure of the amount of 
extra information beyond Pz needed to describe P1. The 
divergence between P1 and P2 is defined as D(PIlIP2) + 
D(P21IP1), and is a measure of how difficult it is to distin- 
guish between the two distributions. Two entities will be 
considered to distribute similarly, and therefore be sub- 
stitutable, if the divergence of their probability distribu- 
tions over environments is low. In part, this work is an 
attempt to test the claim that a very local definition of 
environment is sufficient for determining distributional 
similarity. 1 
We will now describe how we can use the distributional 
similarity measure to extract a binary context-free gram- 
mar with scored rules from a corpus. Statistics of the 
following form are collected: 
1. word1 tag~ word2 number 
2. word1 tagy tagz word2 number 
where in (1), numberis the number of times in the corpus 
the word between words word1 and word2 is tagged with 
tagx, and in (2), number is the number of times that the 
pair of words between word1 and word2 is tagged with 
tagy,tag~. For instance, in the Brown Corpus, the part 
of speech tag NP 2 appears between the words gave and 
a three times, and the tags AT NN ° occur six times in 
this environment. 
1Evidence that this claim is valid for word class discovery is 
presented in \[1, 2, 3\]. 
2NP = proper noun. 
3AT = article, NN = sing. noun. 
156 
From this, we obtain a set of context-free rules tag~ 
tagy tags, scored by the distributional similarity of tag~ 
and tagy tags . The score given to the rule is the diver- 
gence between the probability distributions of tag~ and 
tagy tagz over environments, where an environment is of 
the form word --- word. 
Below are the five single tags found to be distributionally 
most similar to the pair of tags AT NN, found by mea- 
suring divergence of distributions over the environments 
word -- word: . 
1. NP (Proper Noun) 
2. CD (Number) 
3. NN (Sing. Noun) 
4. NNS (Plural Noun) 
5. PPO (Object Personal Pronoun) 
2.2. Adjusting Scores 
The scored CFG described above works fairly well, but 
makes a number of errors. There are a number of cases 
where a phrase is posited when the pair of symbols do 
not really constitute a phrase. For instance, VBD and 
VBD IN 4 have similar distributional behavior. (John 
and Mary kissed/VBD in/IN the car vs. John and 
Mary bought/VBD the car). If we had access to lexi- 
cal information, this would not be a problem. The prob- 
lem results from discarding the lexical items and replac- 
ing them with their part of speech tags. If we are to 
continue our analysis on part of speech tags, a different 
information source is needed to recognize problematic 
rules such as VBD ~ VBD IN which are incorrectly 
given a good score. We extract more n-gram statistics, 
this time of the form: 
1. tagx tagy number 
2. tag~ tagy tagz number 
Of all rules with AT NN on the right hand side, the rule 
NP ~ AT NN would be given the best score. Below 
are the five tag pairs found to be closest to the single tag 
NP. Of all rules with NP on the left hand side, NP 
NP NP is given the best score. 
1. NP NP (Robert/NP Snodgrass/NP) 
which is a file of pairs and triples of part of speech tags 
and the number of times the tag strings occur in the 
corpus. The entropy of the position after tags in the 
corpus is a measure of how constrained that position is. 
This entropy (H) is computed as: 
H(tag=_) = - ~ p(tagy I tag=)*log2p(tagy I tag=) 
tag~ETagSe~ 
2. PP$ NN (his/PP$ staff/NN) 
3. NN NNS (city/NN employees/NNS) 
4. NP$ NN (Gladden's/NP$ wife/NN) 
5. AT iN (the/AT man/NN) 
Once the scored context-free grammar is learned, there 
are a number of ways to use that grammar to search for 
the correct phrase structure analysis of a sentence. For 
the results reported at the end of the paper, we used the 
simplest method: find the best set of rules that allow 
the part of speech string to be reduced to a single part 
of speech. The best set is that set of rules whose scores 
sum to the lowest number. In other words, we search for 
the set of rules with the lowest total divergence between 
the pair of tags on the right hand side of the rule and 
the single tag these two tags will be reduced to. The 
structure assigned by this set of rules, ignoring nonter- 
minal labels, is output as the structural description of 
the sentence. 
Likewise, we can compute the entropy of the position 
following the pair of tags tag~ and tagy. If tag, tagy is 
indeed a constituent, we would expect: 
H(tagx --) < H(tagx tagy _) 
This is because a phrase internal position in a sentence 
is more constrained as to what can follow than a phrase 
boundary position. We can use this information to read- 
just the scores in the grammar. The score of each rule 
of the form tagz ---~ tagx tagy is multiplied by a function 
of Entropy(tag~ tagy _) - Entropy(tag~ _), to reward 
those rules for which the entropy-based metric indicates 
that they span a true constituent and to penalize those 
involving nonconstituents. For instance, the measure 
Entropy(tags tagy _) - Entropy(tag~ _) has a value of 
1.4 for the pair of tags AT NN 5, and a value of -0.8 for 
the pair of tags VBD IN, the troublesome tag pair men- 
tioned above. 
4VBD = past verb, IN = preposition. 
5AT NN = Determiner Noun - a true phrase. 
157 
At this point the learner makes one major mistake on 
short sentences. Sometimes, but not always, the subject 
or some part of the subject is joined to the verb before 
the object is. For example, the system assigns a slightly 
better score to the parse ((PPS VBD) PPO) 6 than to 
the correct parse (PPS (VBD PPO)). To remedy this, 
we need a rule specifying that a matrix verb must join 
with its object before joining with its subject. 
3. RESULTS 
After running this learning procedure on the Brown Cor- 
pus, a grammar of 41,000 rules was acquired. We took a 
subset of these rules (about 7,500), choosing the fifteen 
best scoring rules for all tag pairs appearing on the right 
hand side of some rule. 
The parser is given a string of part of speech tags as 
input and uses its automatically acquired grammar to 
output an unlabelled binary-branching syntactic tree for 
the string. Since lexical information is thrown away, a 
correct answer is considered to be an analysis that is con- 
sistent with the tag set. The goal of this work is to auto- 
matically create from a tagged corpus a corpus of simple 
sentences annotated with phrase structure. In the next 
phase of the project, we plan to extract a richer gram- 
mar from the corpus of trees. Therefore, we were not 
concerned when no answer was returned by the parser, 
as long as this did not happen with high probability. If 
the parser fails to parse a sentence, that sentence would 
not be present in the corpus of trees. However, if the 
parser incorrectly parses a sentence, the error will be en- 
tered into the corpus. The higher the error rate of this 
corpus, the more difficult the next stage of acquisition 
would be. 
The table below shows the results obtained by testing 
the system on simple sentences. A simple sentence is de- 
fined as a sentence with between five and fourteen words, 
containing no coordinates, quotations, or commas. 
Correct 
No Unparsed Sents 71% 
With Unparsed Sents 62% 
Close Wrong 
11% 18% 
10% \[ 28% 
Table 1: Summary of Acquisition and Parsing Accuracy 
In the table, correct means that the parse was a valid 
parse for the string of tags, close means that by perform- 
ing the operation of moving one bracket and then bal- 
ancing brackets, the parse can be made correct. Wrong 
6PPS = subject pers. pron., VBD = past verb, PPO = obj. 
pets. pron. 
means that the parse was more than one simple oper- 
ation away from being correct. Of all test sentences, 
15% were not parsed by the system. Of those sentences, 
many failed because the beam search we implemented 
to speed up parsing does not explore the entire space of 
parses allowed by the grammar. Presumably, many of 
these sentences could be parsed by widening the beam 
when a sentence fails to parse. 
One question that remains to be answered is whether 
there is a way to label the nonterminals in the trees 
output by the system. The tree below was given the best 
score for that particular part of speech tag sequence. 
VB 
AT JJ NN VBD PPO 
The daring boy chased him 
If all part of speech tags are assigned a particular non- 
terminal label (PPS and NN would be classed as NP. 
VB, VBD would be classed as VP) 7 and replaced the 
tags with their nonterminal labels, we would get a prop- 
erly labelled tree for the above structure. It remains 
to be seen whether this idea can be extended to accu- 
rately assign nonterminal labels to the trees output by 
the parser. 
4. CONCLUSION 
We believe that these results are evidence that automatic 
phrase structure acquisition is feasible. In addition to 
the problem of labelling nonterminals, we are currently 
working on expanding the learner so it can handle more 
complex sentences and take lexical information into ac- 
count when parsing a sentence. 
~PPS = 3rd sing. nom. pi'onoun, NN = sing. noun, VB = 
verb, VBD = past verb 
158 
References 
1. Brill, E., Magerman, D., Marcus, M., and Santorini, B. 
(1990) Deducing linguistic structure from the statistics 
of large corpora. In Proceedings of the DARPA Speech 
and Natural Language Workshop, Morgan Kaufmann, 
1990. 
2. BriU, Eric. (1991) Discovering the lexical features of a 
language. In Proceedings off the 29th Annual Meeting of 
the Association for Computational Linguistics, Berkeley, 
CA. 
3. Brown, P., Della Pietra, V., Della Pietra, S. and Mer- 
cer, R. (1990) Class-based n-gram models of natural lan- 
guage. In Proceedings of the IBM Natural Language ITL, 
pp. 283-298, Paris, France. 
4. Francis, W. Nelson and Ku~era, Henry, Frequency anal- 
ysis of English usage. Lexicon and grammar. Houghton 
Mifflin, Boston, 1982. 
5. Harris, Zelfig. (1951) Structural Linguistics. Chicago: 
University of Chicago Press. 
6. Jelinek, F., Lafferty, J., and Mercer, R. (1990) Basic 
methods of probabifistic context free grammars. Techni- 
cal Report RC 16374 (72684), IBM, Yorktown Heights, 
New York 10598. 
7. Kullback, Solomon. (1959) Information Theory and 
Statistics. New York: John Wiley and Sons. 
8. Lari, K. and Young, S. (1990) The estimation of stochas- 
tic context-free grammars using the inside-outside algo- 
rithm. Computer Speech and Language, 4:35-56. 
9. Magerman, D. and Marcus, M. (1990) Parsing a natural 
language using mutual information statistics, Proceed- 
ings, Eighth National Conference on Artificial Intelli- 
gence (AAA1 90), 1990. 
10. Pereira, F. and Schabes, Y. (1992) Inside-outside reesti- 
mation from partially bracketed corpora. Also in these 
proceedings. 
11. Sampson, G. (1986) A stochastic approach to parsing. 
In Proceedings of COLING 1986, Bonn. 
159 
