Towards Automatic Grammar Acquisition from a Bracketed 
Corpus 
Thanaruk Theeramunkong 
Japan Advanced Institute of 
Science and Technology 
Graduate School of Information Science 
15 Asahidai Tatsunokuchi 
Nomi Ishikawa 923-12 Japan 
ping~j aist. ac. jp 
Manabu Okumura 
Japan Advanced Institute of 
Science and Technology 
Graduate School of Information Science 
15 Asahidai Tatsunokuchi 
Nomi Ishikawa 923-12 Japan 
0ku@j aist. ac. jp 
Abstract 
In this paper, we propose a method to group brackets in a bracketed corpus (with lexical tags), 
according to their local contextual information, as a first step towards the automatic acquisition 
of a context-free grammar. Using a bracketed corpus, the learning task is reduced to the problem 
of how to determine the nonterminal label of each bracket in the corpus. In a grouping process, a 
single nonterminai label is assigned to each group of brackets which are similar. Two techniques, 
distributional analysis and hierarchical Bayesian clustering, are applied to exploit local contextual 
information for computing similarity between two brackets. We also show a technique developed 
for determining the appropriate number of bracket groups based on the concept of entropy analysis. 
Finally, we present a set of experimental results and evaluate the obtained results with a model 
solution given by humans. 
1 Introduction 
Designing and refining a natural language grammar is a diiBcult and time-consuming task and re- 
quires a large amount of skilled effort. A hand-crafted grammar is usually not completely satisfactory 
and frequently fails to cover many unseen sentences. Automatic acquisition of grammars is a solu- 
tion to this problem. Recently, with the increasing availability of large, machine-readable, parsed 
corpora, there have been numerous attempts to automatically acquire a CFG grammar through the 
application of enormous existing corporaILar90\]\[Mi194\]\[Per92\]\[Shi95 \]. 
Lari and Young\[Lar90\] proposed so-called inside-outside algorithm, which constructs a grammar 
from an unbracketed corpus based on probability theory. The grammar acquired by this method 
is assumed to be in Chomsky normal form and a large amount of computation is required. Later, 
Pereira\[Per92\] applied this algorithm to a partially bracketed corpus to improve the computation 
time. Kiyono\[Kiy94b\]\[Kiy94a\] combined symbolic and statistical approaches to extract useful gram- 
mar rules from a partially bracketed corpus. To avoid generating a large number of grammar rules, 
some basic grammatical constraints, local boundaries constraints and X bar-theory were applied. 
Kiyono's approach performed a refinement of an original grammar by adding some additional rules 
while the inside-outside algorithm tries to construct a whole grammar from a corpus based on Max- 
imum Likelihood. However, it is costly to obtain a suitable grammar from an unbracketed corpus 
and hard to evaluate results of these approaches. As the increase of the construction of brack- 
eted corpora, an attempt to use a bracketed (tagged) corpus for grammar inference was made by 
Shiral\[Shi95\]. Shirai constructed a Japanese grammar based on some simple rules to give a name (a 
label) to each bracket in the corpus. To reduce the grammar size and ambiguity, some hand-encoded 
knowledge is applied in this approach. 
In our work, like Shirai's approach, we make use of a bracketed corpus with lexical tags, but 
instead of using a set of human-encoded predefined rules to give a name (a label) to each bracket, we 
introduce some statistical techniques to acquire such label automatically. Using a bracketed corpus, 
the grammar learning task is reduced to the problem of how to determine the nonterminal label of 
each bracket in the corpus. More precisely, this task is concerned with the way to classify brackets 
to some certain groups and give each group a label. We propose a method to group brackets in 
168 
a bracketed corpus (with lexical tags), according to their local contextual information, as a first 
step towards the automatic acquisition of a context-free grammar. In the grouping process, a single 
nontermina\] label is assigned to each group of brackets which are similar. To do this, we apply 
and compare two types of techniques called distributional analysis\[HarSl\] and hierarchical Bayesian 
clustering\[Iwa95\] for setting a measure representing similarity among the bracket groups. We also 
propose a method to determine the appropriate number of bracket groups based on the concept of 
entropy analysis. Finally, we present a set of experimental results and evaluate our methods with a 
model solution given by humans. 
2 Grammar Acquisition with a Bracketed Corpus 
In this section, we give a brief explanation of grammar acquisition using a bracketed corpus. In this 
work, the grammar acquisition utilizes a lexical-tagged corpus with bracketings. An example of the 
parse structures of two sentences in the corpus is shown graphically in Figure 1. 
Sentence (1) 
Parse Tree (1) 
Sentence (2) 
Parse Tree (2) 
: A big man slipped on the ice. 
: (((ART,"a") ((ADJ," big') (NOUN," man"))) 
((VI," slipped")((PREP,"on") ((ART,"the") (NOUN," ice'))))) 
: The boy dropped his wallet somewhere. 
: (((ART,"the") (NOUN,"boy')) 
(((VT,"dropped") ((PROg,"his") (NOUN,"wallet'))) (AVV,"somewhere"))) 
m m • | 
| t i | 
ART ADJ NOUN Vl PREP ART NOUN ART NOUN VT PRON NOUN ADV 
A big man slipped on the ice The boy dropped his wallet somewhere 
Figure 1: The graphical representation of the parse structures of a big man slipped on the ice and the 
boy dropped his wallet somewhere 
In the parse structures, each terminal category (leaf node) is given a name (tag) while there is no 
label for each nonterminal category (intermediate node). With this corpus, the grammar learning 
task corresponds to a process to determine the nonterminal label of each bracket in the corpus. 
More precisely, this task is concerned with the way to classify the brackets into some certain groups 
and give each group a label. For instance, in Figure 1, it is reasonable to classify the brackets 1 
(c2),(c4) and (c5) into a same group and give them a same label (e.g., NP(noun phrase)). As 
the result, we obtain three grammar rules: NP ~ (ART)(NOUN), NP ~ (PRON)(NOUN) and 
NP ~ (ART)(el). To perform this task, our grammar acquisition algorithm operates in five stages 
as follows. 
1. Assign a unique label to each node of which lower nodes are assigned labels. At the initial 
step, such node is one whose lower nodes are lexical categories 2. This process is performed 
throughout all parse trees in the corpus. 
1A bracket corresponds to a node in Figure 1. 
~In Figure 1, there are three unique labels derived: el---~(ADJ)(NOUN), cc2-*(ART)(NOUN) and 
cs ~ (PRON)(NOUJV). 
169 
2. Calculate the similarity of every pair of the derived labels. 
3. Merge the most similar pair to a single new label(i.e., a label group) and recalculate the 
similarity of this new label with other labels. 
4. Repeat (3) until a termination condition is detected. As the result of this step, a certain set 
of label groups is derived. 
5. Replace labels in each label group with a new label in the corpus. For example, if (ART)(NOUN) 
and (PRON)(NOUN) are in the same label group, we replace them with a new label (such as 
NP) in the whole corpus. 
6. Repeat (1)-(5) until all brackets(nodes) in the corpus are assigned labels. 
In this paper, as a first step of our grammar acquisition, we focus on step (1)-(4), that is how to 
group nodes of which lower nodes are lexical categories. Figure 2 depicts an example of the grouping 
process. 
G3 
I 
(ADJ){NOUN) C1   ---~- gl (rip without an article) 
(NOUN)(NOUN) C7 t 
(ART)(NOUN) C2 t 
(PRON)INOUN) C5 g2 (rip with an article) 
(INDEF)(NOUN) C6 
I 
| 
INDEF = { both, some, any .... } 
ART ={a, the .... } 
PRON = { my, his, her, their .... } 
NOUN = {trip, newspaper .... } 
ADJ = {high, available .... } 
Figure 2: A part of the bracket grouping process 
To compute the similarity of a pair of labels(in step 2), we propose two types of techniques called 
distributional analysis and hierarchical Bayesian cbtstering as shown in section 3. In section 4, we 
introduce the concept of differential entropy as the termination condition used in step (4). 
3 Local Contextual Information as Similarity Measure 
In this section, we describe two techniques which utilize "local context information" to calculate 
similarity between two labels. The term "local contextual information" considered here is repre- 
sented by a pair of words immediately before and after a label. In the rest of this section, we first 
describe distributional analysis in subsection 3.1. Next, we give the concept of Bayesian clustering 
in subsection 3.2. 
3.1 Distributional Analysis 
Distributional analysis is a statistical method originally proposed by Harris\[Harbl\] to uncover reg- 
ularities in the distributional relations among the features of speech. Applications of this technique 
are varied\[Bri92\]\[Per93\]. In this paper, we apply this technique to group similar brackets in a 
bracketed corpus. The detail of this technique is illustrated below. 
Let P1 and P2 be two probability distributions over environments. The relative entropy between 
P1 and P2 is: 
Pl(e) D(PIlIP2) = ~ Pz(e) x log 
P~(e) 
• E Env~iro~m¢~'ts 
Relative entropy D(PIlIP2 ) is a measure of the amount of extra information beyond P2 needed 
to describe Pl. The divergence between Pz and P= is defined as D(PIlIP2 ) + D(P2IIP1), and is 
a measure of how difficult it is to distinguish between the two distributions. The environment is 
170 
a pair of words immediately before and after a label(bracket). A pair of labels is considered to be 
identical when they are distributionaliy similar, i.e., the divergence of their probability distributions 
over environments is low. 
The probability distribution can be simply calculated by counting the occurrence of (c~) and 
(word1 c~ words). For the example in Figure 1, the numbers of appearances of (c1), (c2), (c5), 
(ART cz VI), (PREP c2 NULL) and (VT es ADV) are collected from the whole corpus. NULL 
stands for a blank tag representing the beginning or ending mark of a sentence. 
Sparse Data Considerations 
Utilizing divergence as a similarity measure, there is a serious problem caused by the sparseness 
of existing data or the characteristic of language itself. In the formula of relative entropy, there 
is a possibility that P2(e) becomes zero. In this condition, we cannot calculate the divergence of 
two probability distributions. To cope with this problem, we extend the original probability to one 
shown in the following formula. 
P(ac~ b) = A N(ac~b) ~.(l_A)Nt~g N(~) 
s 2 
where, N(a) is the occurrence frequency of o~, Ntag8 is the number of terminal categories and A is 
a interpolation coefficient. The first term in the right part of the formula is the original estimated 
probability. The second term is generally called a uniform distribution, where the probability of an 
unseen event is estimated to a uniform fixed number. A is applied as a balancing weight between 
the observed distribution and the uniform distribution. Intuitively, when the size of data is large, 
the small number should be used as A. In the experimental results in this paper, we assigned A with 
a value of 0.6. 
3.2 Hierarchical Bayesian Clustering Method 
As a probabilistic method, hierarchical Bayesian clustering was proposed by Iwaya~na\[Iwa95\] to 
automatically classify given texts. It was applied to improve the efficiency and the effectiveness of 
text retrieval/categorization. Referring to this method,we try to make use of Baycsiar~ posterior 
probability as another similarity measure for grouping the similar brackets. In this section, we 
conclude the concept of this measure as follows. 
Let's denote a posterior probability with P(GIC), where C is a collection of data (i.e., in Figure 
2, C = {c1,e2, ..., CN}) and G is a set of groups(clusters) (i.e., G = {gz,g2, ...}). Each group(cluster) 
gj is a set of data and the groups are mutually exclusive. In the initial stage, each group is a singleton 
set; g~ = {~} for all i. The method tries to select and merge the group pair that brings about the 
maximum value of the posterior probability ~ P(GIC). That is, in each step of merging, this method 
searches for the most plausible situation that the data in C are partitioned in the certain groups G. 
For instance, at a merge step h + 1 (0 < b < N - 1), a data collection C has been partitioned into 
a set of groups G~. That is each datum e belongs to a group g E Gk. The posterior probability at 
the merging step/¢ + 2 can be calculated using the posterior probability at the merging step/¢ + 1 
as shown below (for more detail, see\[Iwa95\]). 
PC(Gh) SC(g=)SC(g,) 
Here PC(G~) corresponds to the prior probability that N random data are classified in to a set of 
groups O~. As for the factor of ~ a well known estimate\[Ris89\] is applied and it is reduced PC(G~) ' 
to a constant value A -1 regardless of the merged pair. For a certain merging step, P(G~IC ) is 
identical independently of which groups are merged together. Therefore we can use the following 
measure to select the best group pair to merge. The similarity between two bracket groups(labels), 
g= and gv, can be defined by SIM(g=,gv). Here, the larger SIM(g=,g~) is, the more similar two 
brackets are. 
SO(g, U g,) SIM(g=,g,) = 
SC(g= )SC(g, ) 
SC(g) = I~ P(elg) 
cEg 
SMaximizing P(GIC ) is a generalization of Mac/mUSh L//cel/hood estimation. 
171 
= ~ P(clg, e)P(elg) P(clg) 
eEEnvi~onwtc~ta 
,~, E P(cle)p(elg) 
e 
P(elc) P(elg) = P(c) 
P(e) 
where SC(g) expresses the probability that all the labels in a group g are produced from the group, 
an elementai probability P(c\[g) means the probability that a group g produces its member c and 
P(elc ) denotes a relative frequency of an environment e of a label e, P(elg ) means a relative frequency 
of an environment e of a group g and P(e) is a relative frequency of an environment e of the entire 
label set. In the calculation of SIM(g=,gv), we can ignore the value of P(c) because it occurs 
Ig= U gvl times in both denominator and numerator. Normally, SIM(g=,gy) is ranged between 0 
and 1 due to the fact that P(c\[g= U gy) _< P(clg= ) when c E g=. 
4 Differential Entropy as Termination Condition 
During iteratively merging the most similar labels, all labels will finally be gathered to a single group. 
Due to this, it is necessary to provide a criterion for determining whether this merging process should 
be continued or terminated. In this section, we describe a criterion named differential entropy which 
is a measure of entropy (perplexity) fluctuation before and after merging a pair of labels, Let cl and 
c2 be the most similar pair of labels based on divergence or Bayesia~u posterior probability. Also let 
c3 be the result label. P~i (e), Pc= (e) and Pc3(e) are probability distributions over environment e of 
cl, e2 and c3, respectively. Pc1, Pc= and P~3 are estimated probabilities of cl, c2 and c3, respectively. 
The differentiaJ entropy (Z~E) is defined as follows. 
~E = Consequence Entropy - Previous Entropy 
= - P0~ x ~Po~(e)logPo~(e) 
e 
+ Po, x  Po,(e)logPo,(e) + Po= x  P0,(e)logPo,(e) 
• e 
where ~, Pc~ (e) log Pc, (e) is the total entropy over various environments of label c~. The larger ~E 
is, the larger the information fluctuation before aad after merging becomes. Generally, we prefer 
a small fluctuation to a larger one. When ZXE is large, the current merging process introduces a 
large amount of information fluctuation and its reliability should be low. From this viewpoint, we 
apply this measure as a criterion for determining the termination of the merging process which will 
be given in the next section. 
5 Preliminary Experimental Results 
In this section, we show some results of our preliminary experiments to confirm effectiveness of 
the proposed techniques. The corpus we used is constructed by EDR and includes nearly 48,000 
bracketed, tagged sentences\[EDR94\]. As mentioned in the previous sections, we focus on only 
the rules with lexical categories as their right hand side 4. For instance, cx --~ (ADJ)(NOUN), 
e2 --* (ART)(NOUN) and cs --* (PRON)(NOUN) in Figure 1. To evaluate our method, we use 
the rule tokens which appear more than 500 times in the corpus. Table I gives some characteristics 
of the corpus. 
From the 35 initial rules, we calculate the similarity between any two rules (i.e., any rule pair) 
based on divergence and Bayesian posterior probability (BPP). For the divergence measure, the 
smaller the vaiue is, the more similar the rule pair is. Inversely, for BPP, the larger the value is, the 
more similar the pair looks. After calculating all pairs' similarities, we merge the most similar pair 
(the minimum divergence or the maximum BPP) to a new label and recalculate the similarity of 
the new label with other remaining labels. The merging process is carried out in an iterative way. 
Figure 3 shows the minimum divergence (left) and the maximum Bayesian posterior probability 
(right) of each merge step. 
In each iterative step of the merging process, we calculate differential entropy for both cases. The 
differential entropy of each step equals to the entropy difference between the entropy of two rules 
before merging and the entropy of a new rule after merging as described in the previous section. 
4Other types of rules can be acquired in almost the same way and are left now as our further work. 
172 
No.of sentences 48259 
No.of initial rules (.f > 500) 35 
(from total 761 rules) 
Total number of rule tokens 136087 
(from total 152925) 
Table 1: Some features of the corpus 
8 == 
2.5 
i \[\] 
2 ............. ' ............. 4 .............. ~ .............. , .............. , .............. ,......: ..... 
15 ............................................................ ........... ......... .... .... 
1 ................................................................ ............ 
I 
0.5 , ~ -" ~BE\]~ 
i | 
0 5 10 15 20 25 30 35 
0.75 
0.5 
0.25 
% 
i ~aGa " 
0 5 10 15 20 25 30 35 
Merge Step " Merge Step 
Figure 3: The minimum divergence (left) and the maximum Bayesian posterior probability (right) of 
each merge step 
Two graphs in Figure 4 indicate the results of differential entropy (LXE calculated by the formula 
in section 4) when the merging process advanced with divergence and BPP as its similarity measures. 
There are some sharp peaks indicating the rapid fluctuation of entropy in the graphs. In this work, 
we use these peaks as a clue to find the timing we should to terminate the merging process. As the 
result, we halt up the process at the 22nd step and the 27th step for the cases of divergence and 
BPP, respectively. Table 2 shows the obtained grouping results. In these tables, there axe 13 groups 
for divergence and 8 groups for Bayesian posterior probability. To clarify the result in the tables, 
Some sample words of each label axe given in the appendix. 
0.5 
0.4 
O.3 
o.e 
0.1 
0 
I 
0 5 10 15 20 25 30 35 
0.2 i 
i 0.15 
............. "; .................................................................... ~'"'"J ..... i 
/ / 
0 5 10 15 20 25 30 35 
Merge Step Merge Step 
Figure 4: Differential entropy during the merging processes using divergence (left) and BPP (right) 
We also made an experiment to evaluate these results with the solution given by three human 
evaluators (later called A, B and C) who axe non-natlve but high-educated with more than 20 years 
of English education. The evaluators were told to construct 7-15 groups from the 35 initial rules, 
based on the grammatical similarity as they thought. As the result, the evaluators A, B and C 
classified the rules into 14, 13 and 14 groups, respectively. 
173 
Members 
(INDEF)(NOUN), (ART)(NOUN), 
(PRON)(NOUN), (DEMO)(NOUN), 
(NUM)(NOUN), (NUM)(UNIT), 
(NOUN)(NUM) 
(ADJ)(NOUN), (NOUN)(NOUN), 
(NOUN) (CONJ)(NOUN) 
3 (AUX)(VT) 
(PREP)(NOUN), (PREP)(NUM), 
(PREP)(PRON), (ADV)(ADV), 
(PTCL)(VI) 
m I\[ gDI  (VT)(NOUN), (VI)(ADV), 
(VT)(PRON), (AUX)(VI), (BE)(VI), 
(BE)(VT), (BE)(ADJ), (ADV)(VI), 
(VI)(PTCL) 
7 (ADV)(ADJ) 
8 (AUX)(ADV) 
9 (ADV)(VT), (VT)(PTGL), 
(VI)(PREP) 
10 (AUX)(BE) 
11 (BE)(ADV) 
12 (ADV)(BE) 
13 (PRON)(VT) 
Members 
1 (INDEF)(NOUN), (ART)(NOUN), 
(PRON)(NOUN), (DEMO)(NOUN), 
(NOUN)(CONJ)(NOUN) 
2 (ADJ)(NOUN), (NOUN)(NOUN) 
3 (AUX)(VT), (AUX)(BE), 
(BE)(ADV), 
(PTCL)(VT), (AUX)(ADV) 
4 (PREP)(NOUN), (ADV)(ADV), 
(PREP)(PRON), (PTCL)(VI), 
(PREP)(NUM) 
5 (VT)(NOUN), (Vl)(ADV), 
(VT)(PRON), (AUX)(VI), 
(BE)(Vl), (BE)(ADJ), 
(BE)(VT), (ADV)(VI), 
(VI)(PTCL) 
6 (ADV)(AD3), (NUM)(NOUN), 
(NUM)(UNIT) 
7 (ADV)(VT), (VT)(PTCL), 
(VI)(PREP) 
8 (NOUN)(NUM), (ADV)(BE), 
(PRON)(VT) 
Table 2: The grouping result using divergence (left) and BPP (fight) 
The system says Yes 
The system says No 
The Evaluator's Answer 
Yes ' No 
a b 
e d 
Table 3: The number of entry pairs for evaluating accuracy 
To evaluate the system with the model solutions, we applied a contingency table model as one 
shown in Table 3. This table model was introduced in \[Swe69\] and widely used in Information 
Retrieval and Psychology. In the table, a is the number of the label pairs which an evaiuator 
assigned in the same group and so did the system, b is the number of the pairs which an evaluator 
did not assign in the same group but the system did, e is the number of the pairs which an evaluator 
assigned but the system did not, and d is the number of the pairs which both an evaluator and 
the system did not assign in the same group. From this table, we define seven measures, as shown 
below, for evaluating performance of the proposed methods. This evaluation technique was also 
applied partly for computing "closeness" between a system's answer and an evaluator's answer in 
\[Hat93\]\[Aga95\]\[Iwa95\] 
• Positive Recall (PR) : 
• Positive Precision (PP) : 
d • Negative Recall (NR) : ~+---~ 
• Negative Precision (NP) : 7~ 
PR~NR • Averaged Recall (AR) : 2 
PP.~.NP • Averaged Precision (AP) : 2 
(fl3-1-1)×PPxPR • F-measure (FM) 
: fla×PP+PR 
The F-measure is used as a combined measure of recall and precision, where fl is the weight of 
recall relative to precision. Here, we use fl = 1.0, which corresponds to equal weighting of the two 
measures. The results compared with three human evaluators are shown in Table 4. 
174 
Evaluator A 
Evaluator B 
Evaluator C 
Averaged 
Similaxity Measures 
PR I PP NR NP AR AP FM 
Divergence 0.91 i 0.70 0.96 0.99 0.93 0.84 0.79 
BPP 0.63 i 0.46 0.92 0.96 0.77 0.71 0.53 
Divergence 0.73 i 0.66 0.95 0.97 0.84 0.81 0.69 
BPP 0.68 i 0.59 0.94 0.96 0.81 0.78 0.63 
Divergence 0.89 i 0.66 0.95 0.99 0.92 0.82 0.76 
BPP 0.80 0.57 0.94 0.98 0.87 0.77 0.66 
Divergence 0.84 0.67 0.95 0.98 0.90 0.82 0.75 
BPP 0.70 0.54 0.93 0.971 0.82 0.75 0.61 
Table 4: Eva~luation results using three human eva~uators' solutions 
From these results, we observe some features as follows. The divergence gives a better solution 
than Bayesian posterior probability does. Normally, the positive measures (PR and PP) have smaller 
values than the negative ones (NR and NP) do. This means that it is difficult to judge two labels 
to be in a same group rather than to judge them to be in a separate group. Using divergence 
as a similarity measure, we get, on average, 84 % positive recall and 67 ~ positive precision and 
up to 90 ~ and 82 % when considering both positive and negative measures. Even for the worst 
result(Evaluator B), we can get up to 84 % and 81% for averaged recall and precision. In order 
to confirm the performance of the system, the evaluators' results axe compared with each other. 
This comparison is useful for investigating the difficulty of the grouping problem. The comparison 
result is shown in Table 5. At this point, we can observe that the label grouping process is a hard 
problem that may make an evaluator's solution inconsistent with the others' solutions. However, 
our proposed method seem to give a reconciliation solution between those solutions. Especially, the 
method which applies divergence as the similarity measure, has a good performance in grouping 
brackets in the bracketed corpus. 
A+B 
B+A 
B+C 
C+B 
C+A 
A+C 
l Measures l PR \[ PP \[ NR \[ NP \[AR AP \[ FM 
0.55 0.47 0~94 0.95 0.74 0.71 0.51 
0.68 0.83 0.98 0.96 0.83 0.90 0.75 
0.57 0,55 0.95 0.96 0.76 0.76 0.56 
I, Averaged10"6110"61\[ 0"96 \[ 0"96 \[ 0"78 0"78\[0"61l 
Table 5: Comparing the grouping results obtained by the evaluators(A,B,C) 
We also make an experiment to evaluate whether divergence is a better measure than BPP, 
and whether the application of differential entropy to cut off the merging process is appropriate. 
This examination can be held by plotting values of recall, precision and F-measure during each step 
of merging process. Figure 5 shows the fluctuation of positive recall(PR), positive preclsion(AP), 
averaged recall(AR), averaged precision and F-measure (FM). 
From the graphs, we found out that the maximum value of F-measure is 0.75 in the case of 
divergence while it is only 0.65 in the case of BPP. That is, divergence provides a better solution 
than BPP. Moreover, the 22nd an 25th merge steps were the most suitable points to terminate the 
merging process for divergence and BPP, respectively. This result is consistent with the grouping 
result of our system (13 groups) in the case of divergence. Although differential entropy leads us 
to terminate the merging process at the 27th merge step in the case of BPP, we observe that there 
is just a little difference between the F-measure value of the 25th merge step and that of the 27th 
merge step. From this result, we conclude that differential entropy can be used a good measure to 
predict the cut-off timing of the merging process. 
175 
0.8 
E tL 0.6 
"~ 0.4 
~ 0.2 
0 0 
r--v'-"i ............. ! ........................... ! .............. ~ .............. 7 ....... ~- ...,...~........\ ~... ~ ~/ ! 
............ ~....,~ ................... ~.....p ......:.,,...w.....~ ............ ~.~ .......... 
i"" ~ ', .-" ~ ~ i ~, P..:..,.~.. 
............ ~/i'~ ............. P'~: ....... ~~ii~ {. .............. ~ .............. !--..v---.~.-÷ ........ ~--- 
', / : : • " __-L ~ :~ : ,. L../'~,~ % ,~ 
i," ~:A:: .... ~:~,.},--4. 
........ :~:.:::F" i,, "'i 
.~'ill ~-measbre --i- i .... 4 
~fi i i i i 
5 10 15 20 25 30 35 
i 0.8 
0.6 
~, 0.4 
" 0.2 
Q: 
0 
0 
I I I I I 
i ! i i i r~-~ ............. ? ............ I .............. I .............. ! ............. t ...... 
b,', _,¢,.., ~ ~ ~ ! ~ / 
........... \[ .......... );!~ ........ J ...... i ............ ~:,,::: 
t ~ ,~ ....... "~ " . 
F, ............. ~. ............. $.,~,*,::t..~ .............. ~~::~..~\[ii:~.,~..~'.~-.~_,.: ..t~.,,.~...~....:~., 
i ~ ~ ./ : ! X! \ 
............. ~ ............. ~- .......... ~.~::::--/-.~-.- -----i-.- .-----.:~.-\-. 
~ .,. ~.,,J Recalli i\\ 
i ~.--~" jPI, Pr~ision~ .... i\ 
......... 7"~.:: ............. .... 
5 10 15 20 25 30 
Merge Step Merge Step 
35 
Figure 5: The transition of PR, PP, AR, AP, FM during the merging process using divergence(left) and 
BPP(right) as the similarity measures 
6 Conclusion 
There has been an increase of the construction of many types of corpora, including bracketed cor- 
pora. In this work, we attempt to use a bracketed (tagged) corpus for grammar inference. Towards 
the automatic acquisition of a context-free grammar, we proposed some statistical techniques to 
group brackets in a bracketed corpus (with lexical tags), according to their local contextual infor- 
mation. Two measures, divergence and Bayesian posterior probability, were introduced to express 
the similarity between two brackets. Merging the most similar bracket pair iteratively, a set of 
label groups was constructed. To terminate a merging process at appropriate timing, we proposed 
differential entropy as a measure to represent the entropy difference before and after merging two 
brackets and stopl~ed the merging process at a large fluctuation. From the experimental results 
compared with the model solutions given by three human evaluators, we observed that divergence 
gave a better solution than Bayesian posterior probability. For divergence, we obtained 84 % recall 
and 67 % precision, and up to 90 ~ and 82 % when considering both positive and negative measures. 
We also investigated the fitness of using differential entropy for terminating the merging process by 
way of experiment and confirmed it. 
In this paper, we focus on only rules with lexical categories as their right hand side. As a further 
work, we are on the way to introduce the techniques introduced here to acquire the other rules(rules 
with nonterminal categories as their right hand side). At that time, it is also necessary for us to 
develop some suitable evaluation techniques for assessing the obtained grammar. 

References 
\[Aga95\] Agarwal, R.: Evaluation of Semantic Clusters, in Proceeding of 33th Annual Meeting of 
the AGL, pp. 284-286, 1995. 
\[Bri92\] BrlU, E.: Automatically Acquiring Phrase Structure using Distributional Analysis, in Pro¢. 
of Speech and Natural Language Workshop, pp. 155-159, 1992. 
\[EDR94\] EDR: Japan Electronic Dictionary Research Institute: EDR Electric Dictionary User's 
Manual (in Japanese), 2.1 edition, 1994. 
\[Har51\] Harris, Z.: Structural Linguistics, Chicago: University of Chicago Press, 1951. 
\[Hat93\] Hatziwassiloglou, V. and K. It. McKeown: Towards the Automatic Identification of Ad- 
jectival Scales: Clustering Adjectives according to Meaning, in Proceeding of 31st Annual 
Meeting of the ACL, pp. 172-182, 1993. 
\[Iwa95\] Iwayama, M. and T. Tokunaga: Hierarchical Bayesian Clustering for Automatic Text 
Classification, in IJCAI, pp. 1322-1327, 1995. 
\[Kiy94a\] Kiyono, M. and J. Tsujii: Combination of Symbolic and Statistical Approaches for Gram- 
matical Knowledge Acquisition, in Proc. of 4th Conference on Applied Natural Langnage 
Processing(ANLP'9,~), pp. 72-77, 1994. 
Kiyono, M. and J. Tsujii: Hypothesis Selection in Grammar Acquisition, in COLING.g~, 
pp. 837-841, 1994. 
Lari, K. and S. Young: "The Estimation of Stochastic Context-free Grammars Using the 
Inside-Outside Algorithm", Computer speech and languages, Vol. 4, pp. 35-56, 1990. 
Miller, S. and H. J. Fox: Automatic Grammar Acquisition, in Proc. of the Human Language 
Technology Workshop, pp. 268-271, 1994. 
Pereira, F. and Y. Schabes: Inside-Outside reestimatlon from partially bracketed corpora, 
in Proceeding of 30th Annual Meeting of the ACL, pp. 128-135, 1992. 
Pereira, F., N. Tishby, and L. Lee: Distributional Clustering of English Words, in Pro. 
ceeding of 31st Annual Meeting of the ACL, pp. 183-190, 1993. 
Rissanen, J.: Stochastic Complexity in Statistical Inquiry, World Scientific Publishing, 
1989. 
Shirai, K., T. Tokunaga, and H. Tanaka: Automatic Extraction of Japanese Grammar from 
a Bracketed Corpus, in Natural Language Processing Pacific Rim Symposium(NLPRS'gs), 
pp. 211-216, 1995. 
Swets, J.: Effectiveness of Information Retrieval Methods, American Documentation, 
Vol. 20, pp. 72-89, 1969. 
