I 
I 
I 
I 
I, 
I 
l 
Grammar Acquisition Based on Clustering Analysis 
and Its Application to Statistical Parsing 
Thanaruk Theersr~unkong Manabu Okumura 
Japan Advanced Institute of Science and Technology 
1-1 As~h~dai Tatsunokuchi Nomi Ishikawa 923-12 Japan 
{ping, o~u}©j aist. a¢. jp 
Abstract 
This paper proposes a new method for learning a context-sensitive conditional probability 
context-free grammar from an unlabeled bracketed corpus based on clustering analysis and de- 
scribes a natural language parsing model which uses a probability-based scoring function of the 
grammar to rank parses of a sentence. By grouping brackets in s corpus into a number of sire;far 
bracket groups based on their local contextual information, the corpus is automatically labeled 
with some nonterm~=a\] labels, and consequently a grammar with conditional probabilities is ac- 
quired. The statistical parsing model provides a framework for finding the most likely parse of 
a sentence based on these conditional probabilities. Experiments using Wall Street Journal data 
show that our approach achieves a relatively high accuracy: 88 % recaJ1, 72 % precision and 0.7 
crossing brackets per sentence for sentences shorter than 10 words, and 71 ~ recall, 51 ~0 precision 
and 3.4 crossing brackets for sentences between 10-19 words. This result supports the assump- 
tion that local contextual statistics obtained from an unlabeled bracketed corpus are effective for 
learnln~ a useful grammar and parsing. 
1 Introduction 
Most natural language processing systems utilize grammars for parsing sentences in order to 
recognize their structure and finally to understand their meaning. Due to the ,l~mculty and 
complexity of constructing a grammar by hand, there were several approaches developed for 
a, uton~tically training grammars from a large corpus with some probabilistic models. These 
methods can be characterized by properties of the corpus they used, such as whether it includes 
information of brackets, lexical \]abels, nontermlnsl labels and so on. 
Recently several parsed corpora which include full bracketing, tagging and nonterm~l labels 
have been available for researchers to use for constructing a probaMlistic grammar\[Mag91, Bla92, 
Mag95, Co196\]. Most researches on these grammars calcuLzte statistics of a grammar from a fully- 
parsed corpus with nonterm;nal lsbeis and apply them to rank the possible parses of a sentence. 
While these researches report some promising results, due to the cost of corpus construction, it still 
seems worth inferring a probabilistic grammar from corpora with less information, such as ones 
without bracketing and/or nonterm~al labels, and use it for parsing. Unlike the way to annotate 
bracketings for corpora by hand, the hand-annotation of nonterm~nal labels need a process that a 
corpus builder have to determine types of nonterm~nal labels and their number. This process is, in 
some senses, arbitrary and most of such corpora occupy a set of very coarse-grained nonterminsl 
labels. Moreover, compared with corpora including nonterm~sl labels, there are more existing 
corpora which include bracketings without nonterm~nal labels such as EDR corpus\[EDR94\] and 
ATIS spoken language corpns\[Hem90\]. The well-known standard method to infer a prohabilistic 
conte.xt-free grammar from a bracketed/unbracketed corpus without nonterminal labels is so-called 
31 
I 
inside-outside algorithm which w~ originally proposed by Baker\[Bak79\] and was implemented as 
applications for speech and language in ~Lar90\], \[Per92\] and \[Sch93\]. Although encouraging results 
were shown in these works, the derived grammars were restricted to Chomsky normal-form CFGs 
and there were problems of the small size of acceptable trai=~ng corpora and the relatively high 
computation time required for training the grandams. 
Towards the problems, this paper proposes a new method which can learn a standard CFG 
with less computational cost by adopting techniques of clustering analysis to construct a context- 
sensitive probab'distic grammar from a bracketed corpus where nontermlnal labels are not an- 
notated. Another claim of this paper is that statistics from a large bracketed corpus without 
nonterminal labels combined with clustering techniques can help us construct a probabilistic 
grammar which produces an accurate natural language statistical parser. In this method, nonter- 
minal labels for brackets in a bracketed corpus can be automatically assigned by making use of 
local contextual information which is defined as a set of category pairs of left and right words of a 
constituent in the phrase structure of a sentence. In this research, based on the assumption that 
not all contexts are useful in every case, effectiveness of contexts is also investigated. By using 
only effective contexts, it is possible for us to improve training speed and memory space without 
a sacrifice of accuracy. Finally, a statistical parsing model bawd on the acquired grammar is 
provided and the performance is shown through some experiments using the WSJ corpus. 
I 
I 
I 
I 
I 
I 
2 Grammar Acquisition as Clustering Process I 
In the past, Theeramunkong\[The96\] proposed a method of grouping brackets in a bracketed corpus 
(with lexical tags but no nonterminal labels), according to their local contextual information, as 
a first step towards the automatic acquisition of a context-free grammar. The basic idea is to 
apply clustering analysis to find out a number of groups of s;m;\]ar brackets in the corpus and then 
to ~sign each group with a same nonterminal label. Clustering analysis is a generic name of a 
variety of mathematical methods that can be used to find out which objects in a set are s;mi\]sr. 
Its applications on natural language processing are varied such as in areas of word classification, 
text categorization and so on \[Per93\]\[Iwa95\]. However, there is still few researches which apply 
clustering analysis for grammar inference and parsing~Vior95\]. This section gives an explanation 
of grammar acquisition based on clustering analysis. In the first place, let us consider the following 
example of the parse strnctures of two sentences in the corpus in figure 1. 
In the parse structures, leaf nodes are given tags while there is no label for intermedLzte nodes. 
Note that each node corresponds to a bracket in the corpus. With this corpus, the grammar 
learning task corresponds to a process to determ~=e the label for each intermediate node. In other 
words, this task is concerned with the way to cluster the brackets into some certain groups based 
on their similarity and give each group a label. For instance, in figure 1, it is reasonable to classify 
the brackets (c2),(c4) and (c5) into a same group and give them a same label (e.g., NP(noun 
phrase)). As the result, we obtain three grammar rules: NP ~ (DT)(NN), NP ~ (PR.P$)(NN) 
and NP ~ (DT)(cl). To do this, the grammar acquisition algorithm operates in five steps as 
follows. 
I 
I 
I 
I 
I 
I 
1. Assign a unique label to each node of which lower nodes are assigned labels. At the initial 
step, such node is one whose lower nodes are lexical categories. For example, in figure 1, there 
are three unique labels derived: cl ~ (JJ)(NN), c2 ~ (DT)(NN) and ~ ~ (PRP.~)(NN). 
This process is performed throughout all parse trees in the corpus. 
2. Calculate the similarity of every pair of the derived labels. 
3. Merge the most ~m~lar pair to a single new label(i.e., a label group) and recalculate the 
slmilarity of this new label with other labels. 
4. Repeat (3) until a termination condition is detected. Finally, a certain set of label groups is 
derived. 
I 
I 
I 
I 
32 I 
I 
Sentence (1) : A big man slipped on the ice. 
Parse Tree (1) (((DT,"a")C(J3,"big")(NN,"man')))((VB,"slipped")((IN,"on") 
I : ((DT,'the') (NN,'ice'))))) 
Sentence (2) : The boy dropped his wallet somewhere. 
Parse Tzee (2) : (((DT,"the")(NN,"tx~yn))(((VB,'dropp ed~) 
I ((PI~P$," his") (NN,"wallet"))) (RB,"somewhere ~))) 
I 
! 
I , 
: = : ", : : : : ". : 
DT JJ NN VB IN DT NN DT NN VB PRP$ NN RIB 
A big man slipped on the ice The boy dropped his wallet somewhere 
i 
I 
I 
I 
I 
I 
Figure 1: The graphical representation of the parse structures of a big man slipped 
on the ice and the boy dropped his wallet somewhere 
5. Replace labels in each label group with a new label in the corpus. For example, if (DT)(NN) 
and (PRP$)(NN) are in the same label group, we replace them with a new label (such as 
NP) in the whole corpus. 
6. Repeat (1)-(5) until all nodes in the corpus are assigned labels. 
To compute the similarity of labels, the concept of local contextual information is applied. 
In this work, the local contextual information is defined as categories of the words immediately 
before and after a label. This information is shown to be powerful for acquiring phrase structures 
in a sentence in \[Bri92\]. In our prp|iminary experiments, we also found out that the information 
are potential for characterizing constituents in a sentence. 
I 
I 
I 
I 
I 
2.1 Distributional Sim;larity 
While there are a number of measures which can be used for representing the sir-ilarity of labels in 
the step 2, measures which make use of relative entropy (Kullback-Leibler distance) are of practical 
interest and scientific. One of these measures is divergence which has a symmetrical property. Its 
application on natural language processing was firstly proposed by Harris\[Hat51\] and was shown 
successfully for detecting phrase structures in \[Bri92\]\[Per93\]. Basically, divergence, as well as 
relative entropy, is not exactly s'nnilarity measure instead it indicates distributional dissimilarity. 
That means the large value it gets, the less similarity it means. The detail of divergence is 
iUustrated below. 
Let P¢I and Pc= be two probability distributions of labels cI and ~ over contexts, CT The 
relative entropy between P¢~ and P©= is: 
D(P,,.,IIP,==) = ~ pCel,--.:,.) × log pCelc') ,~c~ pCelc=) 
I 33 
Relative entropy D(Pc~ \[\[Pc2) is a measure of the amount of extra information beyond P©~ needed 
to describe Pc2- The divergence between Poe and P¢2 is defined as D(Pc~ \]lPc~)+D(Pc~\]lPcz), and is 
a measure of how di~icult it is to distinguish between the two distributions. The context is defined 
as a pair of words immediately before and after a label(bracket). Any two labels are considered to 
be identical when they are distributionally siml\]~.r, i.e., the divergence is low. From the practical 
point view, this measure addresses a problem of sparseness in limited data. Particularly, when 
p(eJcz) is zero, we cannot calculate the divergence of two probability distributions because the 
denomi-ator becomes zero. To cope with this problem, the original probability can be modified 
by a popular technique into the following formula. 
;~(~,e) (I I pCel~) 
= /v'(~) + - ;910~,1 
where, N(~) and N(c~, e) are the occurrence frequency of ~ and (~, e), respectively. IOrl is the 
number of possible contexts and A is an interpolation coefficient. As defin~-g contexts by the left 
and right lexical categories, \[CT\[ is the square of the number of existing lexical categories. In the 
formula, the first term means the original estimated probability and the second term expresses a 
uniform distribution, where the probability of all events is estimated to a fixed --~form number. 
is applied as a balancing weight between the observed distribution and the -=iform distribution. 
In our experimental results, A is assigned with ~ value of 0.6 which seems to make a good estimate. 
2.2 Termination Condition 
During iteratively merging the most slm~l~r labels, all labels will finally be gathered to a single 
group. Due to this, a criterion is needed for determining whether this merging process should be 
continued or terminated. In this section, we describe ~ criterion named differential entropy which 
is a measure of entropy (perplexity) fluctuation before and after merging a pah- of labels. Let 
cl and c2 be the most similar pair of labels. Also let cs be the result label p(e\[cl), p(e\[c2) and 
p(e\]c3) are probability distributions over contexts e of cl, c2 and ~, respectively, p(cl), p(c2) and 
p(c3) are estimated probabilities of cl, c2 and ca, respectively. The differential entropy (DE) is 
defined as follows. 
DE" = Consequence E~troFg - Previous E~t~opy 
= - pCc~) × ~p(elc~)logp(elc~) 
£ 
+ pCcl) × ~pCelcl)logpCelcl) + pCc2) × ~pCelc~)logpCe\[~) 
e 
where ~ep(elc/) log P(elc/) is the total entropy over various contexts of label c~. The larger DE 
is, the larger the information fluctuation before and after merging becomes. In general, a small 
fluctuation is preferred to s larger one because when DE is large, the current merging process 
introduces a large amount of information fluctuation and its reliability becomes low. 
3 Local Context Effectiveness 
As the s~ml\]~rity of any two labels is estimated based on local contextual information which is 
defined by a set of category pairs of left aad right words, there is an interesting question of which 
contexts are useful for calculation of s~ml\]~rity. In the past, effectiveness of contexts is indicated 
in some previous researches \[Bar95\]. One of suitable measures for representing effectiveness of a 
context is dispersion of the context on labels. This measure expresses that the number of useful 
contexts should be diverse for different labels. From this, the effectiveness (E) of a context (c) 
can be defined using variance as follow: 
~(c) -- ~ C~(a,c)-~(c)) ~ 
.~A I~1 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
! 
I 
I 
I 
I 
I 
I 
34 I 
i 
I 
I 
I 
I 
= IAI 
where A is a set of all labels and a is one of its individual members. N(a, c) is the number of times 
a label a and a context c are cooccurred. N(c) is an averaged value of NCa, c) on a label a. In 
order to take large advantage of context in clustering, it is preferable to choose a context c with 
a high value of E(c) because this context trends to have a high discrlm~nation for characterising 
labels. 1~aD~n~ the contexts by the effectiveness value E, some rank higher contexts are selected 
for elustering the labels instead of all contexts. This enables us to decrease computation time 
and space without sacrificing the accuracy of the clustering results and sometimes also helps us 
to remove some noises due to useless contexts. Some experiments were done to support this 
assumption and their results are shown in the next section. 
I 4 Statistical Parsing Model 
i , 
i 
I 
This section describes a statistical parsing model which takes a sentence as input and produce a 
phrase-structure tree as output: In this problem, there are two components taken into account: a 
statistical model and parsing process. The model assigns a probability to every candidate parse 
tree for a sentence. Formally, given a sentence S and a tree T, the model estimates the conditional 
probability P(T\[S). The most likely parse under the model is argma,zrP(T\[S ) and the parsing 
process is a method to find this parse. ~Vhile a model of a simple probabilistic CFG applies the 
probability of a parse which is defined as the multiplication of the probability of all applied rules, 
however, for the purposes of our model where left and right contexts of a constituent are taken 
into account, the model estimates P(T\[S) by ass-m~-g that each rule are dependent not only on 
the occurrences of the rule but also on its left and right context as follow. l 
P(TIS) P(r,,c,) 
I 
il 
I 
I 
I 
where r~ is an application rule in the tree and ~ is the left and right contexts at the place the rule 
is applied. SimS|at to most probabilistic models and our clustering process, there is a problem 
of low-frequency events in this model. Although some statistical NL applications apply backing- 
off estimation techniques to handle low-frequency events, our model uses a simple interpolation 
estimation by adding a lmlform probability to every event. Moreover, we make use of the geometric 
mean of the probability instead of the original probability in order to ~|imlnate the effect of the 
number of rule applications as done in \[Mag91\]. The modified model is: 
/'(TIS)=C (a*P(r',c4)+Cl-a)*N~N))r~ 
(~,,cDer 
Here, a is a balancing weight between the observed distribution and the uniform distribution and 
it is assigned with 0.95 in our experiments. The applied parsing algorithm is a simple bottom-up 
chart parser whose scoring function is based on this model. The grammar used is the one trained 
by the algorithm described in section 2. A dynamic programming algorithm is used: if there are 
two proposed constituents which span the same set of words and have the same Isbel, then the 
lower probability constituent can be safely discarded. 
I 5 Experimental Evaluation 
To give some support to our su~ested grammar acquisition metllod and statistical parsing model, 
three following evaluation experiments are made. The experiments use texts from the Wall Street 
Journal (WSJ) Corpus and its bracketed version provided by the Penn 'rreebank. Out of nearly 
48,000 sentences(i,222,065 words), we extracted 46,000 sentences(I,172,710 words) as possible 
material source for traiuing a grammar and 2000 sentences(49,355 words) as source for testing. 
35 
The first experiment involves an evaluation of performance of our proposed grammar learning 
method shown in the section 2. In this prp\]imi~ary experiment, only rules which have lexical 
categories as their right hand side are considered and the acquired nontermlnal labels are com- 
pared with those assigned in the WSJ corpus. The second experiment stands for investigating 
effectiveness of contexts described in section 3. The purpose is to find out useful contexts and use 
them instead of all contexts based on the assumption that not all contexts are useful for clustering 
brackets in grammar acquisition. Reducing the number of contexts will help us to improve the 
computation time and space. The last experiment is carried out for evaluating the whole gram- 
mar which is learned based on local contextual information and indicating the performance of our 
statistical parsing model using the acquired grammar. The measures used for this evaJuation are 
bracketing recall, precision and crossing. 
5.1 Evaluation of Clustering in Grammar Acquisition 
This subsection shows some results of our preliminary experiments to confirm effectiveness of 
the proposed grammar acquisition techniques. The grammar is learned from the WSJ bracketed 
corpus where all nonterm~nals are omitted. In this experiment, we focus on only the rules with 
I~ c~egories as th~ ~ght h~d ~de. For ~tance, ci -~ (~J)(~N), c2 -~ (DT)(NN) and 
Cs --~ (P.RP$)(N.N') in figure 1. Due to the reason of computation time and space, we use the 
rule tokens which appear more than 500 times in the corpus. The number of initial rules is 51. 
From these rules, the most similar pair is calculated and merged to a new label The merging 
process is cached out in an iterative way. In each iterative step of the merging process, differential 
entropies are calculated. During the merging process, there are some sharp pealr~ indicating the 
rapid fluctuation of entropy. These sharp peaks can be used as a step to terrnln~te the merging 
process. In the experhnents, a peak with .DE ~> 0.12 is applied. As the result, the process is 
halted up at the 45th step and 6 groups are obtained. 
This result is evaluated by comparing the system's result with nontermlnal symbols given in 
the WSJ corpus. The evaluation method utilizes a contingency table model which is introduced 
in\[Swe69\] and widely used in Information Retrieval and Psychology\[Aga95\]\[lwa95\]. The following 
measures are considered. 
• Positive Recall (PLY) : ~ 
• Positive Precision (PP) : ~----~ 
• NeKative Recall (l~t) : 
• Negative Precision (I~P) : 
* F-measure (FM) • (~2+I)×PP×PR /32 ×PP+ P.R 
where a is the number of the label pairs which the WS3 corpus assigns in the same group and so 
does the system, 5 is the number of the pairs which the ~WSJ corpus does not assign in the same 
group but the system does, c is the number of the pairs which the WSJ assigned but the system 
does not, and d is the number of the pairs which both the WSJ and the system does not assign 
in the same group. The F-measure is used as a combined measure of recall and precision, where 
/3 is the weight of recall relative to precision. Here, we use/5 ---- 1.0, equal weight. 
The result shows 0.93 ~o PR, 0.93 ~o PP, 0.92 ~0 ~ 0.92 % I~P and 0.93 % FM, which 
are all relativeiy good va/ues. Especially, PP shows that almost all same labels in the WSJ are 
assigned in same groups. In order to investigate whether the application of differentia/entropy to 
cut off the merging process is appropriate, we plot values of these measures at all merging steps 
as shown in figure 2. From the graphs, we found out that the best solution is located at around 
44th-45th merging steps. This is consistent with the grouping result of our approach. Moreover, 
the precision equals 100 % from 1st-38nd steps, indicating that the merging process is suitable. 
I 
I 
I 
I 
I 
i 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
36 I 
I 
I 
i 0.8 
I 
I o4 
I. ~ 0.2 
0 
I 
I 
I I " I I I I 
\[- -i -.• -! ..i ......... ! ...... 
Recall i i ~ r'-'k.-. 
II : A~ n-_-a : : : : : ~ ...... i:! ".ji' I,' i iv-r'lu¢.~l~ .... -: ! i i:..~J/ ~ i,~ 
J l IV-Precisioh ..... ~ ~ ~ zi',.~_l, i i :"~ ii 
I.~ ..... ~:-nF~s~r~ .... ~. ........ .~ ......... ~:.l....; ......... ~ ......... ~¢...~.i~. li - : .... "T"---" -: : : :°*. \[ !: : : , %: 
i ..... :/ .... = lv 
: I1"/: : , :; .... . • .../~V....h**~ ~°'° .. • . o °~ . :. 
i i i i i/ i i :ii 
| ~ ~ ~ ~ ~" I~ ~ ~ ~ " ~ 
t..... : : : :._.~° : : : : ; ~ 
: : : .;,, : : : : : • ~:; 
....... : ......... : ......... -. ....... F.-- ........ T .................................... : ...... '~'*==" : : : " : • : ~I ~ 
l : : ~" : __J : ~| 
\[ ........ , , 
0 5 10 15 20 25 30 35 40 45 50 
Merge Step 
Figure 2: The traz~sltion of PR, PP, NP~, NP and FM during the merging process 
i 5.2 Checking Context Effectiveness 
As another experiment, we e,y~mine elfectiveness of contexts in the clustering process in order 
to reduce the computation time and space. Variance is used for expressing effectiveness of a 
context. The assumption is that a context with has the highest variance is the most effective. 
The experiment is done by selecting the top jV of contexts and use it instead of all contexts in 
the clustering process. 
Besides cases of .N = 10, 50, 200, 400 and ali(2401), a case that 200 contexts are randomly 
chosen from all contexts, is taken into account in order to e~arnlne the assumption that vaxiance 
is efficient. In this case, 3 trials are made and the average value is employed. Due to the limit 
of paper space, we show only F-measure in figure 3. The graphs tell us that the case of top 200 
seems superior to the case of 200 random contexts in a11 merging step. This means that variance 
seems to be a good measure for selecting a set of effective contexts in the clustering process. 
l~rthermore, we can observe that a high accuracy can be achieved even if not a11 contexts are 
taXen into account. From this result, the best F-measures are a11 0.93 and the number of groups 
axe 2, 5, 5 and 6 for each case, i.e., 10, 50, 200 and 400. Except of the case of 10, a11 cases show a 
good restdt compared with all contexts (0.93, 6 groups). This resnlt tells us that it is reasonable 
to select contexts with large values of ~raxiance to ones with small v'4riance and a relatively \]axge 
number of contexts are enough for the clustering process. By pr~|im;nary experiments, we found 
out that the following criterion is sufficient for determining the number of contexts. Contexts axe 
selected in the order of their varLznce and a context wi\]1 be accepted when its variance is more 
than 10 % of the average v~iance of the previously selected contexts. 
5.3 Performance of Statistical Parsing Model 
Utilizing top N contexts, we learn the whole grammar based on the algorithm given in section 
2. Brackets(rules) which are occurred more than 40 times in the corpus are considered and the 
number of contexts used is determ;ned by the criterion described in the previous subsection. As 
the result of the grammar acquisition process, 1396 rtfles are acquized. These rules axe attached 
with the conditionalprobability based on contexts (the left and fight categories of the rules). The 
37 
¢0 = 
uL 
0.8 
0.6 
0.4 
0.2 
0 
0 
i ~ ~('J~ i '~ : 
N = 10 ; . ./~:~'.~ ~! 
........ ! ......... i ...... N..'~50"~=~.~; ........ h"~=~.÷:i~'""~~..~. "~ 
i i N ~ 400 ---: i ~_~.i;~./i \] ; i 
........ ~ ......... ~ ....... N. ~ al t~ .... ~ ........ ~ ~ - - . . :. . ~ .~ ....... " - .~.~ : .- . i ......... .; 
• " " : ~. .I • I • N = 200 random .... ~ i~. !; .'\[~" ! i 
........ i ......... ! ......... ~ ......... .:- ....... 4-----~-!-.'i-:----,~ ....... /-.:.- ..... i ....... 
........ : ......... .: ......... ~ ...... ~..~.~-.:/... ~.........=.....-: ......... ! ........ 
5 10 15 20 25 30 35 40 45 50 
Merge Step 
Figure 3: FMs when chsnging the number of context~(N) 
chart parser tries to find the best parse of the sentence. 46,000 sentences are used for training 
a grammar and 2000 sentences are for a test set. To evaluate the performance, the PA.I~.SEVAL 
measures as defined in \[Bla91\] are used: 
Precision = 
number of correct brackets in proposed parses 
number of brackets in proposed parses 
Recall = 
number of correct brackets in proposed parses 
number of brackets in treebank parses 
The parser generates the most likely parse based on context-seusitive condition probability of the 
grammar. Among 2000 test sentences, only 1874 sentences can be parsed owing to two following 
reasons: (1) our algorithm considers rules which occur more than 40 times in the corpus, (2) test 
sentences have different characteristics from training sentences. Table 1 displays the detail results 
of our statistical pexser evaluated against the WSJ corpus. 
93 ~0 of sentences can be parsed with 71 ~ recall, 52 ~0 precision aud 4.5 crossings per sentence. 
For short sentences (3-9 words), the parser achieves up to 88 % recall and 71% precision with 
only 0.71 crossings. For moderately long sentences (10-19 and 20-30 words), it works with 60-71 
% recall and 41-51% precision. ~om this result, the proposed parsing model is shown to succeed 
with high bracketing recalls to some degree. Although our parser cannot achieve good precision, 
it is not so a serious problem because our parser tries to give more detail bracketing for a sentence 
them that given in the WSJ corpus. In the next section, the comparison with other reseaxches will 
be discussed. 
6 Related Works and Discussion 
In this section, our approach is compared with some previous interesting methods. These methods 
can be classified into non-grammar-based and grammar-based approaches. For non-grammaz- 
based approaches, the most successful probabifistic parser named SPATTER is proposed by 
I 
I 
I 
I 
i 
I 
i 
I 
I 
l 
I 
I 
I 
I 
I 
I 
I 
I 
38 I 
LSent n h 
Comparisons 
Avg. Sent. Len. 
TBank Parses 
System's Parses 
Croasings/Sent. 
Sent. cross.= 0 
Sent. cross.< 1 
Sent. cross._< 2 i 
Recall 
Precision 
3--9 13-15 \[10,19 20-.3013-40 
393 988 875 484 1862 
7.0 10.3 = 14.0 24.0 16.33 
4.81 6.90 9.37 15.93 10.85 
10.86 16.58 23.14 40.73 27.18 
0.72 1.89 3.36 7.92 4.52 
56.7% 33.1% 13.6% 2.5% 19.0% 
79.4% 50.4% 25.4% 6.0% 30.3% 
93.4% 67.0% 41.5% 9.5% 41.8% 
88.2% 79.3% 71.2% 59.7% 70.8% 
71.9% 60.6% 51.3% 41.2% 52.1% 
i, 
Table 1: Parsing accuracy using the WSJ corpus 
Magerman\[Mag95\]. The parser is constructed by using decision-tree learning techniques and 
can succeed up to 86-90 % of bracketing accuracy(both recall and precision) when tr~ing with 
the WSJ corpus, a fully-parsed corpus with nontermlnvJ labels. Later Collins\[Col96\] introduced 
a statistical parser which is based on probabilities of bigzam dependencies between head-words 
in a parse tree. At least the same accuracy as SPATTER was acquired for this parser. These 
two methods ufflized a corpus which includes both lexical categories and nontermi~al categories. 
However, it seems a hard task to assign nontermlnsl labels for a corpus and the way to assign a 
nonterminal label to each constituent in the parsed sentence is arduous and arbitrary. It follows 
that it is worth trying to infer a grammar from corpora without nontermlnal labels. 
One of the most promising results of grammar inference based on grammar-based approaches is 
the inside-outside algorithm proposed by Laxi\[Lazg0\] to construct the gr~.mmax from unbracketed 
corpus. This algorithm is an extension of forward-backward algorithm which infers the parameters 
of a stochastic context-free grammar. In this research the acquired grammar is elr~.luated based 
on its entropy or perplexity where the accuracy of parsing is not taken into account. As another 
research, Pereira and Schabes\[Per921\[Sch93 \] proposed a modified method to infer a stochastic 
gran~ar from a partially parsed corpus and evaluated the results with a bracketed corpus. This 
approach gained up to around 90 % bracketing recall for short sentences(0-15 words) but it sut~ered 
with a large amount ambiguity for long ones(20-30) where 70 % recall is gained. The acquired 
gr~mrn~T is normally in Chomsky .normal-form which is a special case of gr~mTnar although he 
claimed that all of CFGs can be in this form. This type of the gr=tmrnar makes all output parses 
of this method be in the form of binary-branrMng trees and then the bracketing precision cannot 
be taken into account because correct parses in the corpus need not be in this form. On the other 
hand, our proposed approach can learn a standard CFG with 88 % recall for short sentences and 
60 % recall for long ones. This result shows that our method gets the same level of accuracy 
as the inside-outside algorithm does. However, our approach can learn a gr~tmm~.~, which is not 
restricted to Chomsky normal-form and performs with leas computational cost compared with the 
approaches applying the inside-outside algorithm. 
7 Conclusion 
In this paper, we proposed a method of applying clustering aaalysis to learn a context-sensitive 
probab'flistic grammar from an unlabeled bracketed corpus. Supported by some experiments, 
local contextual information which is left and right categories of a constituent was shown to 
be useful for acquiring a context-sensitive conditional probability context-free grammar from a 
corpus. A probabilistic parsing model using the acquired grammar was described and its potential 
was eT~m{ned. Through experiments, our parser can achieve high paxsing accuracy to some extent 
compared with other previous approaches with less computational cost. As our further work, there 
39 
are still many possibilities for improvement which are encouraging. For instance, it is possible 
to use lexical information and head information in clustering and constructing a probabilistic 
g~l~Yn Tn ~LY. 

References 
Agarwal, 1~.: Evaluation of Semantic Clusters, in Proceeding off $$rd Annual Meeting off 
the ACL, pp. 284-286, 1995. 
Baker, J.: Traina, ble grarom~rs for speech recognition, in Speech Coraranrdcation Papers 
for the 97th Meeting of the Acoustical Society of America (D.H. Klatt and J.J. Wolff, 
eda.), pp. 547-550, 1979. 
Bartell, B., G. Cottreil, and R. Belew: Representing Documents Using an Explicit Model 
of Their Sjm;|alJties, 3otlrr~al off the Amebean Society for Infformation Science, Vol. 46, 
No. 4, pp. 254-271, 1995. 
Black, E. and et al.: A procedure for quantitatively comparing the syntactic coverage of 
english grammars, in Proc. off the 1991 DARPA Speech and Natural .Language Workshop, 
pp. 306--311, 1991. 
Black, E., F. Jelinek, J. La~erty, D. Magerman, R. Mercer, and S. Roukos: Towards 
History-Based Grammars: Using Richer Models for Proba.bilistic Parsing, in Proc. off 
the 1992 DARPA ,Ypeech and .Natural Language Workshop, pp. 134--189, 1992. 
Brill, E.: Automatically Acquiring Phrase Structure using Distributional Analysis, in 
Proc. off Bpeech and Natural Language Workshop, pp. 155--159, 1992. 
Collins, M. J.: A New Statistical Parser Based on Bigram Lexical Dependencies, in 
Proc. off the 3~th Annual Meeting off the AUL, pp. 184-191, 1996. 
EDR: Japan Electronic Dictionary Research Institute: EDR Electric Dictionary User's 
Mannal (in Japanese), 2.1 edition, 1994. 
Harris, Z.: Structural LinguL~tics, Chicago: University of Chicago Press, 1951. 
Hemphill, C., G. J.J., and G. Doddington: The ATIS spoken lauguage systems pilot 
corpus, in DARPA Speech and .Natural Language Workshop, 1990. 
Iwayamv., M. and T. Tokunaga: Hierarchical Bayesian Clustering for Automatic Text 
Classification, in IJCAI, pp. 1322-1327, 1995. 
Lari, K. and S. Young: =The Estimation of Stochastic Context-free Grammars Using 
the Inside-Outside Algorithm", Computer speech and languagea, Vol. 4, pp. 35--56, 1990. 
Magerman, D. M. and M. P. Marcus: Pearl: A Probabilistic Chart Parser, in Proceedings 
off the Ecropean ACL Conference, 1991. 
Magerman, D. M.: Statistical Decision-Tree Models for Parsing, in Proceeding of $3rd 
Annual Meeting of the ACL, pp. 276-283, 1995. 
Mori, S. and M. Nagso: Parsing without Grammar, in Proc. off the ~th International 
Workshop on Parsing Technologies, pp. 174--185, 1995. 
Pereira, F. and Y. Schabes: Inside-Outside reestimation from partially bracketed cor- 
pora, in Proceedings of 30th Annual Meeting of the ACL, pp. 128--135, 1992. 
Pereira, F., N. Tishby, and L. Lee: Distributional Clustering of English Words, in 
Proceedings of 315~ Annual Meeting off the ACL, pp. 183--190, 1993. 
Schabes, Y., M. Both, and 1t. Osborne: Parsing the Wall Street Journal with the Insider 
Outside Algorithm, in Proc. off 6th European Cfiapter off ACL, pp. 341-347, 1993. 
Swets, J.: Effectiveness of Information Retrieval Methods, American Documentation, 
Vol. 20, pp. 72--89, 1969. 
Theeramunkong, T. and M. Okumura~ Towards Automatic Glr~mmar Acquisition from 
a Bracketed Corpus, in Proc. of the 4th International Workshop on Very Large Corpora, 
pp. 168-177, 1996. 
