Maximum Entropy Model Learning 
of Subcategorization Preference* 
I 
t- 
Takehito Utsuro Takashi Miyata Yuji Matsumoto 
Graduate School of Information Science, 
Nara Institute of Science and Technology 
8916-5, Takayama-cho, Ikoma-shi, Nara, 630-01, JAPAN 
E-ma~l: {utsuro, ~akashi ,matsu}@is. aist-nara, ac. jp 
U~: http ://cactus. aist-nara, ac. jp/staff/utsuro/home-e, html 
Abstract 
This paper proposes a novel method for learning probabilistic models of subcategorization 
preference of verbs. Especially, we propose to consider the issues of case dependencie~ and noun 
class generalization in a uniform way. We adopt the maximum entropy model learn~,g method 
and apply it to the task of model learning of subcategorization preference. Case dependencies 
and noun class generalization are represented as featura~ in the maximum entropy approach. 
The feature selection facility of the maximum entropy model learning makes it possible to find 
optimal case dependencies and optimal noun c!~ generalization levels. We describe the results 
of the experiment on learning probabilistic models of subcategorization preference f~om the EDR 
Japanese bracketed corpus. We also evaluated the performance of the selected features and their 
estimated parameters in the subcategorization preference task. 
1 Introduction 
In corpus-based NLP, extraction of linguistic knowledge such as lexical/semantic collocation is one 
of the most important issues and has been intensively studied in recent years. In those research, 
extracted lexical/semantic collocation is especially useful in terms of ranking parses in syntactic 
analysis as well as automatic construction of lexicon for NLP. 
For example, in the context of syntactic disambiguation, Black (1993) and Magerman (1995) 
proposed statistical parsing models based-on decision-tree learning techniques, which incorporated 
not only syntactic but also lexical/semantic information in the decision-trees. As lexical/semantic 
information, Black (1993) used about 50 semantic categories, while Magerman (1995) used lexi- 
cal forms of words. Collins (1996) proposed a statistical parser which is based on probabilities of 
dependencies between head-words in the parse tree. In those works, lexical/semantic collocation 
are used for ranking parses in syntactic analysis. They put an assumption that syntactic and lexi- 
cal/semantic features are dependent on each other. In their models, syntactic and lexical/semantic 
features are combined together, and this causes each parameter to depend on both syntactic and 
lexical/semantic features. 
On the other hand, in the context of automatic lexicon construction, the emphasis is mainly on 
the extraction of lexical/semantic collocational knowledge of specific words rather than its use in 
sentence parsing. For example, Haruno (1995) applied an information-theoretic data compression 
technique to corpus-based case frame learning, and proposed a method of finding case frames of 
verbs as compressed representation of verb-noun collocational data in corpus. The work concen- 
trated on the extraction of declarative representation of case frames and did not consider their 
performance in sentence parsing. 
"The authors would like to thank Dr. Kentaro Inui and Mr. Kiyoaki Shirai of Tokyo Institute of Technology for 
valuable information on implementing maximum entropy model learning. This research was partially supported By 
the Ministry of Education, Science, Sports and Culture, Japan, Grant-in-Aid for Encouragement of Young Scientists, 
09780338, 1997. 
I 
I 
I ,! 
I' 
i 
I 
I 
I ! 
i 
I 
,| 
246 
i 
i 
! 
I 
It 
i 
iJ 
I 
! 
! 
! 
B 
,! 
! 
! 
I 
I 
! 
As in the case of the models of Black (1993), Magerman (1995), and Collins (1996), this paper 
proposes a method of utilizing lexical/semantic features for the purpose of applying them to ranking 
parses in syntactic analysis. However, unlike the models of Black (1993), Magerman (1995), and 
Collins (1996), we put an assumption that syntactic and lexical/semantie features are independent. 
Then, we focus on extracting lexical/semantic collocational knowledge of verbs which is useful in 
syntactic analysis. 
More specifically, we propose a novel method for learning a probabilistic model of subcatego- 
rization preference of verbs. In general, when learning lexical/semantic eollocational knowledge of 
verbs from corpus, it is necessary to consider the following two issues: 
1) Case dependencies 
2) Noun class generalization 
When considering 1), we have to decide which cases are dependent on each other and which cases 
are optional and independent of other cases. When considering 2), we have to decide which super- 
ordinate class generates each observed leaf class in the verb.noun collocation. 
So far, there exist several researches which worked on these two issues in learning eollocational 
knowledge of verbs and also evaluated the results in terms of syntactic disambiguation. Resnik 
(1993) and Li and Abe (1995) studied how to find an optimal abstraction level of an argnment 
noun in a tree-structured thesaurus. Although they evaluated the obtained abstraction level of 
the argument noun by its performance in syntactic disambiguation, their works are limited to only 
one argument. Li and Abe (1996) also studied a method for learning dependencies between case 
slots and evaluated the discovered dependencies in the syntactic disambiguation task. They first 
obtained optimal abstraction levels of the argument nouns by the method in Li and Abe (1995), 
and then tried to discover dependencies between the class-based case slots. They reported that 
dependencies were discovered only at the slot-level and not at the class-level. 
Compared with those previous works, this paper proposes to consider the above two issues 
in a uniform way. First, we introduce a model of generating a collocation of a verb and argu- 
ment/adjunct nouns and then view the model as a probabilistic model. As a model learning 
method, we adopt the maximum entropy model learning method (Della Pietra, Della Pietra, and 
Lafferty, 1997; Berger, Della Pietra, and Della Pietra, 1996) and apply it to the task of model 
learning of subcategorization preference. Case dependencies and noun class generalization axe rep- 
resented as features in the maximum entropy approach. In the maximum entropy approach, features 
are allowed to have overlap and this is quite advantageous when we consider case dependencies and 
noun class generalization in parameter estimation. The feature selection facility of the maximum 
entropy model learning method also makes it possible to find optimal set of features, i.e, optimal 
case dependencies and optimal noun class generalization levels. We introduce several different mod- 
els according to the difference of case dependencies. We describe the results of the experiment on 
learning models of subcategorization preference from the EDR Japanese bracketed corpus (EDR, 
1995). We also evaluate the performance of the selected features and their estimated parameters 
in the subcategorization preference task. 
2 A Model of Generating a Verb-Noun Collocation from Sub- 
categorization Frame(s) 
This section introduces a model of generating a verb-noun collocation from subcategorization 
frame(s). 
247 
2.1 Data Structure 
2.1.1 Verb-Noun Collocation 
Verb-noun collocation is a data structure for the collocation of a verb and all of its argument/adjunct 
nouns. A verb-noun collocation e is represented by a feature structure which consists of the verb v 
and all the pairs of co-occurring case-markers p and thesaurus classes c of case-marked nouns: 
Fred : v 
Pz :cz 
e = : (1) 
Pk : ck 
We assume that a thesaurus is a tree-structured type hierarchy in which each node represents a 
semantic class, and each thesaurus class cx,..., c~ in a verb-noun collocation is a leaf class. We also 
introduce ..~c as the superorcUnate-subordinate relation of classes in a thesaurus: cz ~c c2 means 
that cz is subordinate to c2. 1 
2.1.2 Subcategorization Frame 
A subcategorization frame s is represented by a feature structure which consists of a verb v and the 
pairs of case-markers p and sense restriction c of case-marked argument/adjunct nouns: 
Fred: 
Pl :cl s 
= . (2) 
Pl :cl 
Sense restriction c1,..., cz of case-marked argument/adjunct nouns are represented by classes at 
arbitrary levels of the thesaurus. A subcategorization frame s can be divided into two parts: one 
is the verbal part s= contai-lug the verb v while the other is the nominal part sp containing all the 
pairs of case-markers p and sense restriction c of case-marked nouns. 
s = S, As, = \[pred:v \]A : (3) 
pz : ct 
2.1.3 Subsumption Relation 
We introduce subsumption relatiozL ~sI of a verb-noun collocation e and a subcategorization.frame s: 
e --~sl s i.ft. for each case-marker pi in s and its noun class c8/, there exists the same case- 
marker Pi in e and its noun class cce/is subordinate to c~/, i.e. cce/~c cs/ 
The subsnmption relation _~sf is applicable also as a subsumption relation of two subcategorization 
fraInes. 
2.2 Generating a Verb-Noun Collocation from Subcategorization Frame(s) 
Next, let us consider modeling the generation of a verb-noun collocation from a subcategorization 
frame. Especially, we describe the basic idea of incorporating case dependencies and noun class 
generalization into the model of generating a verb-noun collocation from a subcategorization frame. 
Suppose a verb-noun collocation e is given as: 
Fred: v 
Pl :cez 
e = 
Pk : c¢i¢ 
x Although we ignore sense ambiguities of case-marked nouns in the definitions of this section, in the cttrrent 
implementation, we deal with sense ambiguities of case-marked nouns by deciding that a class c is superordinate to 
an ambiguous leaf class Cz if c is superordinate to at least one of the possible unambiguous classes of Cl. 
; 
I 
I 
! 
I ! 
i 
\] 
! 
I 
I 
t 
! 
! 
! 
1 
I 
248 1 
WF' 
! 
I 
I 
Ii 
1 
I 
i 
! 
I 
I 
I 
i 
Then, we consider a subcategorization frame s which can generate e and assume that s subsumes e: 
e ~f s 
We denote the generation of the verb-noun collocation e from the subcategorization frame s as: 
s , e (4) 
2.2.1 Case Dependencies 
When considering a subcategorization frame which can generate a verb-noun collocation e, there 
are several possibilities of the case dependencies in the subcategorization frame. 
For example, consider the following example: 
Example 1 
Kodomo-ga kouen-de juusu-wo nomu. 
ehild-NOM park-at juice-A CC drirJc 
(A child drinks juice at the park.) 
The verb-noun collocation is represented as a feature structure e below: 
pred : n~rmu 
e = ga : c~ (5) wo : 
c~ 
de :% 
In this feature structure e, cc, c~, and cj represent the leaf classes (in the thesau_us) of the nouns 
~odomo(child) ", Rotten(park)", and "juus~(fldce) ". 
Next, we assume that the concepts "human", "place", and "beverege" are superordinate to 
~odorao(ckild)", ~ot~en(park)", and "juusu(juice)", respectively, and introduce the corresponding 
classes ch=,n, c~c, and cb~. Then, the following superordinate-subordinate relations hold: 
Allowing these superordinate classes as sense restriction in subcategorization frames, let us consider 
several patterns of subcategorization frames each of which can generate the verb-noun collocation 
e. Those patterns of subcategorization frames vary according to the dependencies of cases within 
them. 
If the three cases "ga(NOM)", "~vo(ACC)", and ade(at)" are dependent on each other and it 
is not possible to find any division into several independent subcategorization frames, e can be 
regarded as generated from a subcategorization frame contaiuing all of the three cases: 
pred : nomu 
ga : chum , e (6) 
'WO : C..bev 
de :~== 
Otherwise, if only the two cases "ga(NOM)" and ~'wo(ACC)" are dependent on each other 
and the "de(at)" case is independent of those two cases, e can be regarded as generated from the 
following two subcategorization frames independently: 
Oa : ch~ ~ , e, de : ~l~ , e (7) 
It/)O : Cbe v 
Othe~'.se, if all the three cases "ga(NOM') ~, ~wo(ACC)", and "de(al,~)" are independent of 
each other, e can be regarded as generated from the following three subcategorization frames 
independently, each of which contains only one case: 
---~ e, , e .... • e (8) L ga : ch~ L wo : Cb~ ' de : %z~ 
249 
! 
2.2.2 Noun Class Generalization 
In the similar way, when considering a subcategorization frame which can generate a verb-noun 
co\]location e, there are several possibilities of the noun class generalization levels as the sense 
restrictions of the case-marked nouns. 
For example, let us again consider Example 1. We assume that the concepts "animal" and ~liq- 
uid" are superordinate to ~uman" and "beverage", respectively, and introduce the corresponding 
classes ca,~i and ct~q. Then, the fonowing superordinate-subordinate relations hold: 
Chum .~c Cani, Cbev ~e Cliq 
H we additionally allow these superordinate classes as sense restriction in subcategorization frames, 
we can consider several additional patterns of subcategorization frames which can generate the 
verb-noun collocation e, along with those patterns described in the previous section. 
Suppose that only the two cases "ga(NOM)" and Uwo(ACC)" are dependent on each other and 
the "de(at)" case is independent of those two cases as in the formula (7). Since the leaf class cc 
("child") can be generated from either chum or ~ni, and also the leaf class cj ("juice") can be 
generated from either Cbez, or ~iq, e can be regarded as generated according to either of the four 
formulas (the left-side formula of) (7) and (9): 
ga : c6n i " ~ e, ga : Chum " ') e, ga : Can ~ > e 
~!30 : Cbe ~ ~130 : Qiq ~J10 : Cliq 
C9) 
2.3 Case Dependencies and the Design of the Generation Models 
As we described in the previous section, there are several possibilities of the case dependencies in 
a verb-noun collocation, and this results in the differences of the subcategorization frames which 
can generate the given verb-noun collocation. According to the different assumptions on the case 
dependencies, we can design several different models of generating a verb-noun collocation from 
subcategorization frame(s). 
2.3.1 Partial-Frame Model 
First, we put no assumption on the case dependencies in the given verb-noun collocation e, and 
assume that any subcategorization frame s which subsumes e can generate e. 
e ~.f s 
0 
! 
t, ! 
EJ 
| \ 
1 
With this requirement, the subcategorization frame s does not have to have all the cases in e, but 
has to have only some part of the cases in e. We call the model satisfying this requirement the 
partial-frame model. All the examples of the formulas (6) and (9) satisfy this requirement and can 
be regarded as examples of the partial-frame model. 
2.3.2 One-Frame Model 
Next, in addition to the requirement that s subsumes e, we put another assumption that all the 
cases in the given verb-noun collocation e are dependent on each other and that a subcategorization 
frame s which can generate e should have exactly the same cases as e has: 
e 
Fred : v 
Px : cl 
Pk : ck 
pred : ~ 
Pl :all 
p~ :dk 
(lo) 
We call the model satisfying this requirement as the one-frame model. For example, supposing that 
the verb-noun collocation e in the equation (5) is given, the example in the formula (6) satisfies 
this requirement. 
I. 
! 
! 
I 
! 
250 D 
m. 
! 
• i 
! 
I 
! 
! 
! 
i 
i 
i| 
i 
I 
i 
i 
I 
T 2.3.3 Independent-Case Model 
In addition to the requirement that s subsumes e, we can also put an assumption that all the cases 
in the given verb-noun collocation e are independent of each other and that a subcategorization 
frame s which has only one case of e can generate e: 
\[ pred:v \] (l<i<k) 8 ~ Ct -- 
Pi : i 
We call the model satisfying this requirement as the independent-cause model. For example, sup- 
posing that the verb-noun collocation e in the equation (5) is given, the examples in the formula 
(8) satisfy this requirement. 
2.3.4 Independent-Frame Model 
As can be seen in the definitions of the above three models, the basic idea of defining the model 
of generating a verb-noun collocation from subcategorization frame(s) lies in identifying the de- 
pendencies of the cases in the given verb-noun collocation and expressing the dependencies within 
a subcategorization frame. Here, we briefly show a method of statistically identifying the depen- 
dencies of the cases in verb-noun collocations from corpus. 2 Then, by incorporating the identified 
case dependencies into the generation model, we introduce a model of generating a verb-noun col- 
location from a tuple of independent partial subcategorization frames. We call this model as the 
independent-frame model. 
Partial Subcategor2ation Frame 
Suppose a verb-noun collocation e is given as in the formula (10) and a subcategorization frame s 
satisfies the requirement of the one-frame model in section 2.3.2, i.e., as in the formula (10), s has 
exactly the same case-markers as e has, and s subsumes e. 
Then, we define a part~l subeate~orization frame si of s as a subcategorization frame which has 
the same verb v as s as well as some of the case-markers of s and their semantic classes. Then, we 
can find a division of s into a tuple (sl, ..., s,) of partial subcategorization frames of s, where any 
pair si and si, (i ~ i') do not have common case-markers and the unification sl A--. Asn of all the 
partial subcategorization frames equals to s: 
pred : v 
: vjvf p~# ~ pey (11) s = siA---AS,, Si = Pij:~j ' (i,i'=l,...,n~ i~i') 
Independence of Partial Subcategorization Frames 
The conditional joint probability p(sl~..., sn I v) is estimated by svmmlug up p(e I v) where e is 
subsumed by all of Sl,..., sn (e -~sl Sl,..., s,): 
rCsl,...,s~l~) ~ ~ p(~l~) (12) 
e.~o l al ,...,$, 
Then, we introduce a parameter c~ (0 < c~ < 1) for relaxing the constraint of independence. Partial 
subcategorization frames sl, ..., s, are judged as independent if, for every subset sil, ---, si# of j of 
these partial subcategorization frames ~ = 2,..., z~), the following inequalities hold: 
<_ pCs~,,...,~,l~) < _1 (13) 
p(~ I~)'"p(~ I~) - o~ 
This definition of independence judgment means that the condition on independence judgment 
becomes weaker as ce decreases, while it becomes more strict as cz increases. 
2Details of the method of statistically identifying the dependencies of the cases in verb.noun collocations are ~ven 
in Utsuro and Matsumoto (1997). 
ti 251 
Generation from Independent Partial Subcategorization Frames 
Now, we denote the generation of e from a tuple (Sl, ..., sn) of independent partial subcategoriza- 
tion frames of s as below: 
(sl,...,s,) e (14) 
Example 
For example, suppose that a verb-noun collocation e is given as in the formula (5) in section 2.2.1. 
If the three cases in e are dependent on each other as in the generation of e in the formula (6), the 
generation of e is denoted as below in the case of the independent-frame model: 
pred : nomu 
ga : Chum 
"11.70 : Cbe v 
: cpz~ 
~ e (15) 
! 
i 
| 
i 
i 
*~'k, I 
the 
generation of e is denoted as below: 
Otherwise, if only the two cases "ga(NOM)" and "wo(ACG) ~ are dependent on each other and 
Ude(at)" case is independent of those two cases as in the generation of e in the formula (7), the ! 
( ----* e (16) 
~0 : Obey 
3 Maximum Entropy Modeling 
This section gives a formal description of maximum entropy modeling (Della Pietra, Dena Pietra, 
and Lafferty, 1997; Berger, Della Pietra, and Della Pietra, 1996). 
3.1 The Maximum Entropy Principle 
We consider a random process that produces an output value y, a member of a finite set y. In 
generating y, the process may be influenced by some conteztual information z, a member of a finite 
set t~'. Our task is to construct a stochastic model that accurately represents the behavior of the 
random process. Such a model is a method of estimating the conditional probability that, given a 
context x, the process will output y. We denote by p(y I z) the probability that the model assigns 
to y in context x. We also denote by ~ the set of all conditional probability distributions. Thus a 
model p(y Ix) is an element of ~P. 
To study the process, we observe the behavior of the random process by collecting a large 
number of samples of the event (z, y). We can summarize the training sample in terms of its 
empirical probsbility distribution ~, defined by: 
(17) 
X,y 
where freq(z, y) is the number of time.s that the pair (x, y) occurs in the sample. 
Next, in order to express certain features of the whole event (z, y), a binary-valued indicator 
function is introduced and called a feature function. Usually, we suppose that there exists a large 
collection .T of candidate features, and include in the model only a subset S of the full set of 
candidate features ~. We call S the set of active features. The choice of S must capture as much 
information about the random process as possible, yet only include features whose expected values 
can be reliably estimated. In this section and the next section, we assume that the set 8 of active 
features can be found in some way. How to find 8 will be described in section 3.3. 
1 
I 
i 
11 
! 
! 
i 
I 
252 I 
! 
! 
|, 
| 
. ° 
g! 
i 
'i" | 
I 
I |: 
! 
Now, we assume that S contains n feature functions. For each feature fi(E S), the sets V~ and 
Vyi will be given for indicating the sets of the values of z and y for that feature. According to those 
sets, each feature function fi will be defined as follows: 
1 ifz•Vz, andy•V~i fi(z,y) 
= 0 otherwise (18) 
When we discover a feature that we feel is useful, we can acknowledge its importance by requiring 
that our model accord with the feature's empirical distribution. In ma~dmum entropy modeling 
approach, this is done by constraining that the expected value of each fi with respect to the model 
p(y \] x) (left-hand side) be the same as that of fi in the training sample (right-hand side): 
I = v f, • s (19) 
This requirement is called a constraint equation. This requirement means that we would like p to 
lie in the subset of ~. 
Then, among the possible models p, the philosophy of the maximum entropy modeling approach 
is that we should select the most uniform distribution. A mathematical measure of the uniformity 
of a conditional distribution p(y I z) is provided by the conditional entropy: 
Hb,) :  fC )pCu I )logpCy Ix) (20) 
Now, we present the principle of maximum entropy: 
Maximum Entropy Principle 
To select a model from a set of allowed probability distributions, choose the model p. 
with ma~irmm~ entropy H(p): 
p. = argmaxH(p) (21) P 
3.2 Parameter Estimation 
It can be shown that there always exists a unique model p. with maximum entropy in any con- 
strained set. According to Della Pietra, Della Pietra, and Lafferty (1997) and Berger, Della Pietra, 
and Della Pietra (1996), the solution can be found as the following px(y \[ z) of the form of the 
exponential family: 
p~(y \[ =) = ~ (22) 
y i 
where a parameter Ai is introduced for each feature fi. 
Della Pietra, Della Pietra, and Lafferty (1997) and Berger, Della Pietra, and Della Pietra (1996) 
also presented an optimization method of estimating the parameter values ~*i that max~rn~.e the 
entropy, which is called Improved Iterative Scaling (IIS.) algorithm. 
3.3 Feature Selection 
Given the full set .T of candidate features, this section outlines how to select an appropriate subset 
S of active features. The feature selection process is an incremental procedure that builds up S by 
successively adding features. At each step, we select the candidate feature which, when adjoined to 
the set of active features S, produces the greatest increase in log-likelihood of the training sample: 3 
sit is shown in Della Pietra, Della Pietra, and La/ferty (1997) and Berger, Della Pietra~ and Della Pietra (1996) 
that the model p. with maximum entropy H(p) is the model in the parametric f~m~ly Px (Y I z) of the .formula (22) 
that maximizes the likelihood of the tr~inlug sample i~. 
253 
! 
4 Maximum Entropy Model Learning of Subcategorization Prefo ! 
erence Ill 
This section describes how to apply the maximum entropy modeling approach to the task of model r ! 
learning of subcategorization preference• IIW 
4.1 Events I 
In our task of model learning of subcategorization preference, each event (x, y) in the training sample ~ 
is a verb-noun collocation e, which is defined as in the formula (1). As well as a subcategorization 
frame, a verb-noun collocation e can be divided into two parts: one is the verbal part ~ containing 
the verb v while the other is the nominal part ep containing all the pairs of case-markers p and 
thesaurus leaf classes c of case-marked nouns: 
e = e, Aep = \[trred:v \]A 
"1 Pl 
: Cl / 
J Pk :ck 
Then, we define the contezt x of an event (z, y) as the verb v and the output 9 as the nominal part 
ep of e, and each event in the training sample is denoted as (v, ep): 
! 
! 
i 
4.2 Features z ~_ v, ~ -= % '! 
i Each (partial) subcategorisation frame is represented as a feature in the maximum entropy modeling approach. In the case of the partial-frame/one-frame/independent-case models in the sections 2.3.1 ,,~ 2.3.3, a binary-valued feature function fs(v, ep) is defined for each subcategorization frame s. In 
the case of the independent-frame model in section 2.3.4, a binary-valued feature function fs~ (v, ~) 
is defined for each partial subcategorization frames si in the tuple of the formula (14). Each feature 'B 
function f has its own parameter A, which is also the parameter of the corresponding (partial) 
subcategorization frame. According to the possible variations of case dependencies and noun class 
generalization, we consider every possible patterns of subcategorization frames which can generate 
a verb-noun collocation, and then construct the full set jr of candidate features. 
In the following, we give formal definitions of the features in each of the partial-frame/one- 
frame/independent-case/independent-frame models which we introduced in section 2.3. I 
4.2.1 Partial-Frame Model w 
Each feature function corresponds to a subcategorization frame s. For each subcategorization frame 
s, a binary-valued feature function fs(v, ep) is defined to be true if and only if the given verb-noun 
collocation e is subsumed by s: 'lg 
f3(v, ep) = 0 otherwise 
4.2.2 One-Frame Model 
Each feature function corresponds~to a subcategorization frame s which has exactly the same cases | 
as the given verb-noun collocation e has. For each subcategorization frame s, a binary-valued 
feature function fs(v, ep) is defined to be true if and only if the given verb-noun collocation e has 
exactly the same cases as s has and is also subsumed by s: ~l 
Pl :Cl Pl :Ctl 1 g e=(\[~rex~: ~\]Aep) ~$f 8 
e = . , s = . , j,(v, ep) = 0 otherwise 
pk :ck pk :dk 
! 
254 r! 
4.2.3 Independent-Case Model 
Each feature function corresponds to a subcategorization frame s which has only one case of the 
given verb-noun collocation e. For each subcategorization frame s which has only one case, a 
binary-valued feature function fs(v, ~) is defined to be true if and only if the given verb-noun 
collocation e has the same case and is also subsumed by s: 
~,red : v 
\[ \] {l if e=(~red:v\]Ae,)~__.$f$ Pa : ca pred : v (1 < i < k), fs(v, ep) = 
e = . , s = p~:c'i -- -- 0 otherwise 
p~ : ck 
4.2.4 Independent-Fra.me 1V~odel 
Each feature function corresponds to a partial subcategorization frames s~ in the tuple of indepen- 
dent ,partial subcategorization frames which can generate the given verb-noun collocation. First, 
for the given verb-noun collocation e, tuples of independent partial subcategorization frames which 
can generate e are collected into the set SF(e) as below: 4 s 
SF(e) 
Then, for each partial subcategorization frame s, a binary-valued feature function fs(v, e~) is 
defined to be true if and only if at least one element of the set SF(e) is a tuple (sl,..., s,..., s,) 
that contains s: 
{ z if 3(s~,...,s,...,s,~) • SF(~=C~ea: H Aep)) (2~) fs(v, ep) 
= 0 otherwise 
4.3 Parameter Estimation 
Let £ be the training corpus consisting of traln~ng events of the form (v, ep). Let Jr be the full 
set of candidate features each element of which corresponds to a possible subcategorization frame. 
Then, given the empirical distribution i~(v, e~) of the training sample, the set 5(C_ ~') of active 
features is found according to the feature selection algorithm in section 3.3, and the parameters of 
subcategorization frames are estimated according to HS Algorithm(Della Pietra, Della Pietra, and 
Lafferty, 1997; Berger, Della Pietra, and Della Pietra~ 1996). Finally, the conditional probability 
distribution p$(e~ Iv) is estimated. 
ps(~ I~) = f,~s (25) 
ep Y, E8 
4.4 Subcategorization Preference in Parsing a Sentence 
Suppose that, after estimating parameters of subcategorization preference from the training corpus 
£ of verb-noun collocations, we obtain the set ,5 of active features and the model ps(ep \] v) 
incorporating these features. Now, we describe how to rank parse trees of a given input sentence 
according to the estimated parameters of subcategorization preference of verbs. 
4More precisely, for a tuple (sl,... ,s.) of independent partial subcategorization frames to be included in the 
set SF(e), the following requirement has to be satisfied: it is not possible to divide any of the partial frames 
s;,..., s, into more than one frame and to construct a finer-grained tuple ' ' ..., s,+~) of independent 
partial subcategorization frames. 
SWhen applying the learned probabilistic model to the he\]d-out test event e ~', independence of the partial subcat- 
egorization frames are judged using the probabilities of partial subcategorization frames estimated from the truini~g 
da~ (as described in section 2.3.4), then the set SF(e is) is constructed. 
255 
4.4.1 Basic Model 
Let w be the given input sentence, T(w) be the set of parse trees of w, t be a parse tree in T(w), 
E(t) be the set of verb-noun collocations contained in t. Then, each parse tree is assigned the 
product of all the conditk,nal probabilities ps(e~ s I v) of verb-noun collocations (v, e~ s) within it, 
which is denoted by ¢(t): 
(,,,e~')eE(O 
A parse tree t(6 T(~u)) with the greatest value of ¢(t) is chosen as the best parse tree { of w. 
i = ~gmax~(O ~er(~) 
4.4.2 Heuristics of Case Covering 
Along with the estimated conditional probabilities ps(e~ s I v) and the basic model above, we 
consider a heuristics concerning covering of the cases of verb-noun collocations as below and evaluate 
their effectiveness in the experiments of the next section. 
Let (v,e~) be a test event which is not included in the training corpus E (i.e., (v,e~) ~ £). 
Subcategorization preference of test events is determined according to whether each case p (and 
the leaf class marked by p) of e~ is covered by at least one feature in S. 
More formally, we introduce case cover/ng relation -<~ of a verb-noun collocation (v, e~) and a 
feature set S: 
(v, ~) -<_~ S iff. for each casep (and the leaf class ct marked byp) of~, at least one 
subcategorization frame corresponding to a feature in S has the 
same case p and its sense restriction cs subsumes c~, i.e. cl _-de cs 
According to this factor, (vl, e~i) is preferred to (v2, %2) if and only if the following condition 
holds: 
Ranking Parse Trees 
This heuristics can be also incorporated into ranking parse trees of a given input sentence. 
Let z~ be the given input sentence, T(zv) be the set of parse trees of zv, t be a parse tree in 
T(zv), E(t) be the set of verb-noun collocations contained in t. Let ~-~(t) (C_ E(t)) be the set 
of verb-noun collocations (% e~) for which (% ~) ~co ,q holds, and Esnco(t) (C E(t)) be the set 
of verb-noun collocations (v, e~) for which (v, e~) ___co ,~ does not hold. Then, subcategorization 
preference of parse trees is determined as follows, tt is preferred to t2 if and only if one of the 
following conditions (i) ,-, (iii) holds: 
(i) I~,,(',.)1 > IEf.,,(t2)l 
(a) IE~(',.)I = IE~C':~)I, 
(~) I.~--~,(h)l = IE~C'~)I, 
5 
5.1 
I~ ps(ep Iv) > l-I ps(eplv) 
psCe,,l~) = I-\[ psCe, I"'1, 
i~ ps(e, l~) > 
Experiments and Evaluation 
Corpus ~*,d Thesaurus 
I~ ps(ep I~) 
(~,,p)eE.so.~ 
As the training and test corpus, we used the EDR Japanese bracketed corpus (EDR, 1995), which 
contains about 210,000 sentences collected from newspaper and magazine articles. From the EDR 
corpus, we extracted 153,014 verb-noun collocations of 835 verbs which appear more than 50 times 
256 
Table 1: Examples of Selected Features for ukau(buy, incur)" (Independent-Frame Model(a = 0.9)) 
FOrder \[\[ Feature \[ Noun Class/Ezample Nouns I' # of events 
1 Fh-st 10 Selected Features 
1 wo(ACC):1404 kippu(tickets), shouken(bills) 22 
2 wo(ACC):1524 tochi(land) 16 
3 " wo(ACC):1553 kabu(stoc.k) 23 
4 wo(ACC):14 Products 158 
5 wo(ACC):l196 Currency, Unit 32 
i 6 wo(ACC):1301 ikari(anger ) 9 
i 7 ~oo(ACC):n51 ha.p..~u(,.ep,~io.) ZZ 
I 8 wo(ACC):1462 Electronic Products 9 
9 wo(ACC):1451 Container 2 
i,' , 10 wo(ACC):1302 hankan(enrnityJ 8 : 
\] ~ , First 5 Selected Features with More Than One Cases 
30 ga(NOM):1259' wo(ACC):13 ga(NOM):Country, wo(ACC):kol~uaai(go~ernment loan) 2 
53 ni(for~.1200, wo(ACC):14 ni(for):watashi(I), wo(ACC):Products 1 
54 ni(for):121, wo(ACC):145 .~\[~:~: t ~ .,~.~ .~. ,m ~l~ ~.. w'~ ~ 1 
el .i(for):12, w.o(ACC):1404 .i(for):.Human, ~oo_.'ACC_):~ppu(t~keS) 1 
62 hi(for):1205, wo(A.CC):140 ni(for):kodomo(child), wo(ACC):Products 1 
in the corpus. These verb-noun collocations contain about 270 case-markers. We constructed the 
training set ~ from these 153,014 verb-noun collocations. 
We used 'Bunrui Goi Hyou'(BGH) (NLRI, 1993) as the Japanese thesaurus. BGH has a s~x- 
layered abstraction hierarchy and more than 60,000 words are assigned at the leaves and its nominal 
part contains about 45,000 words. Five classes are allocated at the next level from the root node. 
5.2 Feature Selection and Parameter Estimation 
We conduct the feature selection procedure in section 3.3 and the parameter estimation procedure 
in section 3.2 under the following conditions: i) we limit the noun class generalization level of each 
feature to those which are above the level 5 from the root node in the thesaurus, ii) since verbs are 
independent of each other in our model learning framework, we collect verb-noun collocations of 
one verb into a training data set and conduct the model learning procedure for each verb separately. 
For each verb, the size of the training data set is about 200 --, 500. The size of the set of 
candidate features varies according to the models: 200 ~ 400 for independent-case model, 500 
-,, 1,300 for one-frame/independent-frame(independence parameter a = 0.5/0.9) models, and 650 
,~ 1,550 for partial-frame model. In the independent-case model, each feature corresponds to a 
subcategorization frame with only one case, while in the one-frame/independent/frame/partial- 
frame models, each feature corresponds to a subcategorization frame with any number of cases. 
This is why the size of the set of candidate features is much smaller in the independent-case model 
than in other models. In the one-frame/independent-frame models, more restrictions are put on 
the definition of features than in the partial-frame model, and the sizes of the sets of candidate 
features are relatively smaller. 
Examples of Selected Features 
For a Japanese verb ~au(buy, incur)", Table 1 shows examples of the selected features for 
the independent-frame model (independence parameter ~ = 0.9). In the table, first 10 selected 
features, as well as first 5 selected features corresponding to (partial) subcategorization frames with 
more than one cases, are shown. In the tables, each feature is represented as the corresponding 
(partial) subcategorization frame which consists of pairs of a case-marking particle and the noun 
class restriction of the case. Each noun class restriction is represented as a Japanese noun class of 
BGH thesaurus. Noun classes of BGH thesaurus are represented as numerical codes, in which each 
257 
,.,, 
digit denotes the choice of the branch in the thesaurus. The classes starting with '11', '12', '13', 
'14', and '15' are subordin~,te to abstract-relations, agents-of-human-activities, human-activities, 
products and natural-objects-and-natural-phenomena, respectively. Each table consists of the order 
of the feature, the feature itself (which is represented as a (partial) subcategorization frame), noun 
class descriptions or example no-n~ in the (partial) subcategorization frames, and the number of 
the training verb-noun collocations for which the feature function returns true. 
Since about 75% of the verb-noun collocations in the training set have only one case-n~rked 
noun, all of the first 10 selected features have only one cases in both of the independent-frame/partial- 
frame models. However, the two models are different in the orders of the first 5 selected features 
with more than one cases. In the partial-frame model, those 5 features have much superior orders 
than in the independent-frame model. In the partial-frame model, less restrictions axe put on the 
definitions of features than in the independent-frame model. Therefore, in the partial-frame model, 
the feature functions corresponding to (pextial) subcategorization frames with more than one cases 
tend to return true for more verb-noun collocations than in the independent-frame model. 
5.3 Evaluation of Subcategorization Preference 
5.3.1 Evaluation Method 
We evaluate the performance of the selected features and their estimated parameters in the following 
subcategorization preference task. Suppose that the following word sequence represents a verb-final 
Japanese sentence with a subordinate clause, where N=,..., N2k are nouns, Pz,... ,P2~ are case- 
marking post-positional particles, and vl, v2 are verbs, and the first verb vi is the head verb of the 
subordinate clause. 
Nf -p=- NI I-pi l ..... N1z-~ z-r1- Nm-~1 .... N~.k-1~k-z~ 
We consider the subcategorization ambiguity of the post-positional phrase Nf-p=: i.e, whether 
Nz-pz is subcategorized for by vl or v2. 
We use held-out verb-noun collocations of the verbs vl and v2 which are not used in the training. 
They are like those verb-noun collocations in the left side below. Next, we generate erroneous verb- 
noun collocations of vl and v2 as those in the right side below, by choosing a case element Px: N= 
at random and moving it from vl to v2. 
\[Co~ 
im'e~ : vl 
~1 :NlI 
Pal : Nil 
I pred : v2 
/~1 : N21 
/~l : N2k 
L i 
pred : r~ pred : vl 
I~1 : Nix P21 : N21 
plz : Nat t~k : N2~ pf:N= 
Then, we compare the products ¢(t) (in the equation (26)) of the conditional probabilities of the 
constituent verb-noun collocations between the correct and the erroneous pairs, and calculate the 
rate of selecting the correct pair. We measure the following three types of precisions: i) the precision 
rb of the basic modelin section 4.4.1, ii) the precision rh when incorporating the heuristics in section 
4.4.2, iii) the precision rc of those verb-noun collocations which satisfy the ease covering relation 
___~ with the set S of active features, i.e., this means that we collect verb-noun collocations (vl, epl) 
and (v2, ep2) of the verbs vl and v2 which satisfy the case corering relation (vl, ep1), (v2, e~2) _~c~ S, 
and calculate the precision re. 
5.3.2 Results 
Figure I (a)-~(c) compares the precisions re and rh among the one-frame/independent-fr~me/partial- 
frame/independent-case models. We also compare the changes of the rate of the verb-noun collo- 
cations in the test set which satisfy the case covering relation ~_co with the set ,q of active features. 
258 
| 
"5 
| 
& 
! 
i 
One-Frame MOdel 
0 i m o ,~o ~ ~ ~o ~ ~o 
Number o/Selected Features 
(a) With Heuristics of Case Covering (An Models 
" " On----el;raiN) Model " 120 Independe~-I=rarne Model (alpha.O.9) 
° ;, ~o ' '~o ' ' 0 1 300 50o 6o0 Number o/Selected Features 
(c) Case-Coverage of Test Data (All Models) 
1 
J 
o 
| 
Oo lOO 
. r 
One-Frame Model 
in~pendeetndent-Frame Model (a~u~0.9) 
4OO S00 6O0 Number of Sele~ed Features 
& 
8 
(b) Precisions of Case-Covered Events (All Models 
Pm(:ismn (covered) 120 Precision (heuristic) 
precision (ba~c) --,~- 
100 
80 
60 
40 
20 
0 , I 0 100 200 ~00 40o soo 
Number o~ Selected Features 
(d) Independent-Frame Model (a = 0.9) 
Figure 1: Changes in Case-Coverage of Test Data and Precisions of Subcategorization Preference 
For the independent-frame model, we examined two different values of the independence parameter 
a, i.e., c~ - 0.5 as a weak condition on independence judgment and ~ - 0.9 as a strict condition 
on independence judgment. Figure 1 (d) shows the changes of the precisions r~, rh, and re as 
well as the case-coverage of the test data during the training for the independent-frame model 
(the independence parameter ~ - 0.9). Both of the precisions re and rh of the independent-frame 
model are higher than those of any other models. On the other hand, the case-coverage of the 
independent-frame model (as well as the that of one-frame model) is much lower than that of 
the partial-frame/independent-case models. The decrease of the case-coverage in the independent- 
frame/one-frame models is caused by the overfitting to the training data. s 
In the case of the independent-frame model, precisions decrease in the order of re, rh, and 
r~. This means that the independent-frame model performs well in the task of subcategorization 
preference when the verb-noun collocations satisfy the case covering relation "<cr with the set S of 
active features. When the verb-noun collocations do not satisfy the case covering relation, we have 
to use the heuristics of case covering in section 4.4.2 and then the precision of subcategorization 
preference decreases. If we do not care whether the verb-noun collocations satisfy the case covering 
relation and do not use the heuristics of case covering, this means that we use the basic model in 
6The reason why the overfitting to the training data occurs in the independent-frame/one-frame models can be 
explained by comparing the effects of the two values of the independence parameter ~ in the independent model. 
When c~ equals to 0.9, both rc and rh are slightly h/gher than when a equals to 0.5. Especially, when the number of 
selected features are less than 300, rc is much higher when ~ equals to 0.9 than when ~ equals to 0.5, although the 
case-coverage of the test data is much lower. When the condition on independence judgment becomes more strict, 
the cases in the trig data are judged as dependent on each other more often and then this causes the estimated 
model to overfit to the training data. In the case of the independent-frame model, overfit to the training data seems 
to result in higher performance in subcategor/zation preference task, although the ease-coverage of the test data is 
caused to become lower. 
259 
section 4.4.1 and it perfor~ worst as indicated by the precision rb. 
6 Conclusion 
This paper proposed a novel method for learning probabilistic models of subcategorization prefer- 
ence of verbs. We proposed to consider the issues of case dependencies and noun class generalization 
in a uniform way. We adopted the maxlmum entropy model learning method and applied it to the 
task of model learning of subcategorization preference. 7 We described the results of the exper- 
iment on learning the models of subcategorization preference from the EDR Japanese bracketed 
corpus. We evaluated the performance of the selected features and their estimated parameters in 
the subcategorization preference task. In this evaluation task, the independent-frame model with 
the independence parameter c~ = 0.9 performed best in the precision when incorporating the heuris- 
tics of case-covering, as well as in the precision of case-covered test events. As for further issues, 
it is important to improve the case-coverage of the independent-frame model without decreasing 
the precision of subcategorization preference. For this purpose, we have already invented a new 
feature selection algorithm which meets the above requirement on preserving high case-coverage 
with a relatively small number of active features, s We will report the details of applying this new 
algorithm to the task of model learning of subcategorization preference in the near future. 

References 
Berger, A. L., S. A. Della Pietra, and V. J. Della Pietra. 1998. A maximum entropy approach to natural language 
processing. Computational Lingu~tics, 22(1):39-71. 
Black, E. 1993. Towards history-based grammars: Using 1~cher models for probab~stic parsing. In Proceedings of 
the 31st Annual Meeting of A CL, pages 31-37. 
Collins, M. 1996. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual 
Meeting of A CL, pages 184-191. 
Della Pietra~ S., V. Dells Pietra, and J. Lafferty. 1997. Inducing features of random fields. IEEE Transaction.~ on 
Pattern Analpsis and Machine Intelligence, 19(4):380-393. 
EDI~ (Japan Electronic Dictionary Research Institute, Ltd.), 1995. EDIt Electronic Dictionary Technical Guide. 
Haruno, M. 1995. Verbal case frame acquisition as data compression. In Proceedings of the 5th International 
Workshop on Nature~ Language Understanding and Logic Programming, pages 45-50. 
Li, H. and N. Abe. 1995. Generalizing case frames using a thesaurus and the MDL principle. In Proceedings of 
International Conference on Recent Advances in Natural Language Processing, pages 239-248. 
Li, H. and N. Abe. 1996. Learning dependencies between case frame slots. In Proceedings of the 16th COLING, 
pages 10-15. 
Mngerman, D. M. 1995. Statistical decision-tree models for parsing. In Proceedings of the 33rd Annual Meeting of 
A CL, pages 276-283. 
NLRI, (National Language Research Institute), 1993. Word List by Semantic Principles. Syuei Syuppan. (in 
Japanese). 
Resnik, P. 1993. Semantic classes and syntactic ambiguity. In Proceedings of the Human Language Technology 
Workshop, pages 278-283. 
Rissanen, J. 1989. Stochastic Compleafly in Statistical Inquiry, volume 15. World Scientific Publishing Company. 
Utsttro, T. and Y. Matsumoto. 1997. Learning probabilistic subcategorization preference by identifying.case depen- 
dencies and optimal noun class generalization level. In Proceedings of the 5th ANLP, pages 364-371. 
