General-to-Specific Model Selection 
for Subcategorization Preference* 
Takehito Utsuro and Takashi Miyata and Yuji Matsumoto 
Graduate School of Information Science, Nara Institute of Science and Technology 
8916-5, Takayama-cho, Ikoma-shi, Nara, 630-0101, JAPAN 
E-mail: utsuro@is, aist-nara, ac. jp, URL: http://cl, aist-nara, ac. jp/-utsuro/ 
Abstract 
This paper proposes a novel method for learning 
probability models of subcategorization preference of 
verbs. We consider the issues of case dependencies 
and noun class generalization in a uniform way by em- 
ploying the maximum entropy modeling method. We 
also propose a new model selection algorithm which 
starts from the most general model and gradually ex- 
amines more specific models. In the experimental 
evaluation, it is shown that both of the case depen- 
dencies and specific sense restriction selected by the 
proposed method contribute to improving the perfor- 
mance in subcategorization preference resolution. 
1 Introduction 
In empirical approaches to parsing, lexi- 
cal/semantic collocation extracted from corpus 
has been proved to be quite useful for ranking 
parses in syntactic analysis. For example, Mager- 
man (1995), Collins (1996), and Charniak (1997) 
proposed statistical parsing models which incor- 
porated lexical/semantic information. In their 
models, syntactic and lexical/semantic features 
are dependent on each other and are combined 
together. This paper also proposes a method 
of utilizing lexical/semantic features for the pur- 
pose of applying them to ranking parses in syn- 
tactic analysis. However, unlike the models of 
Magerman (1995), Collins (1996), and Char- 
niak (1997), we assume that syntactic and lex- 
ical/semantic features are independent. Then, 
we focus on extracting lexical/semantic colloca- 
tional knowledge of verbs which is useful in syn- 
tactic analysis. 
More specifically, we propose a novel method 
for learning a probability model of subcategoriza- 
tion preference of verbs. In general, when learn- 
ing lexical/semantic collocational knowledge of 
verbs from corpus, it is necessary to consider 
the two issues of 1) case dependencies, and 2) 
noun class generalization. When considering 1), 
we have to decide which cases are dependent on 
each other and which cases are optional and in- 
* This research was partially supported by the Ministry 
of Education, Science, Sports and Culture, Japan, Grant- 
in-Aid for Encouragement of Young Scientists, 09780338, 
1998. An extended version of this paper is available from 
the above URL. 
dependent of other cases. When considering 2), 
we have to decide which superordinate class gen- 
erates each observed leaf class in the verb-noun 
collocation. So far, there exist several works 
which worked on these two issues in learning col- 
locational knowledge of verbs and also evaluated 
the results in terms of syntactic disambiguation. 
Resnik (1993) and Li and Abe (1995) studied how 
to find an optimal abstraction level of an argu- 
ment noun in a tree-structured thesaurus. Their 
works are limited to only one argument. Li and 
Abe (1996) also studied a method for learning de- 
pendencies between case slots and reported that 
dependencies were discovered only at the slot- 
level and not at the class-level. 
Compared with these previous works, this pa- 
per proposes to consider the above two issues 
in a uniform way. First, we introduce a model 
of generating a collocation of a verb and argu- 
ment/adjunct nouns (section 2) and then view 
the model as a probability model (section 3). As 
a model learning method, we adopt the max- 
imum entropy model learning method (Della 
Pietra et al., 1997; Berger et al., 1996). Case 
dependencies and noun class generalization are 
represented as features in the maximum entropy 
approach. Features are allowed to have overlap 
and this is quite advantageous when we consider 
case dependencies and noun class generalization 
in parameter estimation. An optimal model is se- 
lected by searching for an optimal set of features, 
i.e, optimal case dependencies and optimal noun 
class generalization levels. As the feature selec- 
tion process, this paper proposes a new feature 
selection algorithm which starts from the most 
general model and gradually examines more spe- 
cific models (section 4). As the model evalua- 
tion criterion during the model search from gen- 
eral to specific ones, we employ the description 
length of the model and guide the search process 
so as to minimize the description length (Ris- 
sanen, 1984). Then, after obtaining a sequence 
of subcategorization preference models which are 
totally ordered from general to specific, we se- 
lect an approximately optimal subcategorization 
preference model according to the accuracy of 
subcategorization preference test. In the exper- 
imental evaluation of performance of subcatego- 
1314 
rization preference, it is shown that both of the 
case dependencies and specific sense restriction 
selected by the proposed method contribute to 
improving the performance in subcategorization 
preference resolution (section 5). 
2 A Model of Generating a Verb-Noun 
Collocation from Subcategorization 
Frame(s) 
This section introduces a model of generating 
a verb-noun collocation from subcategorization 
frame(s). 
2.1 Data Structure 
Verb-Noun Collocation Verb-noun colloca- 
tion is a data structure for the collocation of a 
verb and all of its argument/adjunct nouns. A 
verb-noun collocation e is represented by a fea- 
ture structure which consists of the verb v and 
all the pairs of co-occurring case-markers p and 
thesaurus classes e of case-marked nouns: 
Fred : v 
Pl : cx 
e = . (1) 
Pk : Ck 
We assume that a thesaurus is a tree-structured 
type hierarchy in which each node represents 
a semantic class, and each thesaurus class 
0,..., Ck in a verb-noun collocation is a leaf class 
in the thesaurus. We also introduce ~c as the 
superordinate-subordinate relation of classes in 
a thesaurus: cl ___e c2 means that cl is subordi- 1 
nate to c2. 
Subcategorization Frame A subcategoriza- 
tion frame s is represented by a feature structure 
which consists of a verb v and the pairs of case- 
markers p and sense restriction c of case-marked 
argument/adjunct nouns: 
Fred : v 
pl : cl s = . (2) 
Pl : cl 
Sense restriction cl, • • •, ct of case-marked argu- 
ment/adjunct nouns are represented by classes 
at arbitrary levels of the thesaurus. 
Subsumption Relation We introduce the 
subsumption relation "~s$ of a verb-noun collo- 
1Although we ignore sense ambiguities of case-marked 
nouns in the definitions of this section, in the current 
implementation, we deal with sense ambiguities of case- 
marked nouns by deciding that a class c is superordinate 
to an ambiguous leaf class Cl if c is superordinate to at 
least one of the possible unambiguous classes of Ct. 
cation e and a subcategorization frame s: 
e --sl s iff. for each case-marker Pi in s and its 
noun class csi, there exists the same 
case-marker pi in e and its noun 
class cei is subordinate to c~i, i.e. 
Cei "<c Csi 
The subsumption relation "~s$ is applicable also 
as a subsumption relation of two subcategoriza- 
tion frames. 
2.2 Generating a Verb-Noun Collocation 
from Subcategorization Frame(s) 
Suppose a verb-noun collocation e is given as: 
Fred : v 
Pl : Cel e = . (3) 
Pk : Cek 
Then, let us consider a tuple (sl, ...,sn) of 
partial subcategorization frames which satisfies 
the following requirements: i) the unification 
sl A ... Asn of all the partial subcategorization 
frames has exactly the same case-markers as e 
has as in (4), ii) each semantic class Csi of a 
case-marked noun of the partial subcategoriza- 
tion frames is superordinate to the correspond- 
ing leaf semantic class eei of e as in (5), and iii) 
any pair si and si, (i 7£ i I) do not have common 
case-markers as in (6): 
S 1 A • " " A S n ~- 
wed : v 
Pl : Csl 
Pk : Csk 
csi (i=l,...,k) 
(4) 
pred : v \] J 
vjvj' pij # pi,j, 
si = ' (i,i'=l,..,n, i#i') (6) Pij : Cij 
When a tuple (Sl, ...,sn) satisfies the above 
three requirements, we assume that the tuple (Sl, 
..., sn) can generate the verb-noun collocation e 
and denote as below: 
(~,..., ~.) , e (7) 
As we will describe in section 3.2, we assume that 
the partial subcategorization frames Sl, ..., Sn 
are regarded as events occurring independently 
of each other and each of them is assigned an 
independent parameter. 
2.3 Example 
This section shows how we can incorporate case 
dependencies and noun class generalization into 
the model of generating a verb-noun collocation 
from a tuple of partial subcategorization frames• 
1315 
The Ambiguity of Case Dependencies 
The problem of the ambiguity of case dependen- 
cies is caused by the fact that, only by observing 
each verb-noun collocation in corpus, it is not de- 
cidable which cases are dependent on each other 
and which cases are optional and independent of 
other cases. Consider the following example: 
Example 1 
Kodomo-ga kouen-de juusu-wo nomu. 
child-NOM park-at juice-A CC drink 
(A child drinks juice at the park.) 
The verb-noun collocation is represented as a 
feature structure e below: 
pred : nomu \] 
ga : Cc \] e = wo : cj (8) 
de : cp 
where co, cp, and cj represent the leaf classes 
(in the thesaurus) of the nouns "kodomo(child)", 
"kouen(park)", and '~uusu(juice)'. 
Next, we assume that the concepts "hu- 
man", "place", and "beverage" are superordi- 
hate to "kodomo(child)", "kouen(park)", and 
'~uusu(juice)", respectively, and introduce the 
corresponding classes Chum, Cplc, and Cbe v as sense 
restriction in subcategorization frames. Then, 
according to the dependencies of cases, we can 
consider several patterns of subcategorization 
frames each of which can generate the verb-noun 
collocation e. 
If the three cases "ga(NOM)", "wo(ACC)", 
and "de(at)" are dependent on each other and 
it is not possible to find any division into several 
independent subcategorization frames, e can be 
regarded as generated from a subcategorization 
frame containing all of the three cases: 
ga : Chum :' e (9) 
WO : Cbev 
de : Cptc 
Otherwise, if only the two cases "ga(NOM)" 
and "wo(A CC)" are dependent on each other and 
the "de(at)" case is independent of those two 
cases, e can be regarded as generated from the 
following two subcategorization frames indepen- 
dently: 
ga : Chu m ' de : Cpl c ~ e 
WO : Cbe v 
The Ambiguity of Noun Class Generaliza- 
tion The problem of the ambiguity of noun 
class generalization is caused by the fact that, 
only by observing each verb-noun collocation in 
corpus, it is not decidable which superordinate 
class generates each observed leaf class in the 
verb-noun collocation. Let us again consider Ex- 
ample 1. We assume that the concepts "mam- 
mal" and "liquid" are superordinate to "human" 
and "beverage", respectively, and introduce the 
corresponding classes Cma m and Ctiq. If we addi- 
tionally allow these superordinate classes as sense 
restriction in subcategorization frames, we can 
consider several additional patterns of subcate- 
gorization frames which can generate the verb- 
noun collocation e. 
Suppose that only the two cases "ga(NOM)" 
and "wo(ACC)" are dependent on each other 
and the "de(at)" case is independent of those two 
cases as in the formula (10). Since the leaf class 
cc ("child") can be generated from either Chum 
or cream, and also the leaf class cj ('~uice') can 
be generated from either Cbe v or Cliq, e can be 
regarded as generated according to either of the 
four formulas (10) and (11),~(13): 
ga : Cma m ~ de : Cpl c ) e 
WO : Cbe v 
ga : Chum ' de : Cpl c > 
WO : Cliq 
ga : c .... de : %to , e (13) 
WO : Cliq 
3 Maximum Entropy Modeling of 
Subcategorization Preference 
This section describes how we apply the maxi- 
mum entropy modeling approach of Della Pietra 
et al. (1997) and Berger et al. (1996) to model 
learning of subcategorization preference. 
3.1 Maximum Entropy Modeling 
Given the training sample C of the events (x, y), 
our task is to estimate the conditional probabil- 
ity p(y I x) that, given a context x, the process 
will output y. In order to express certain features 
of the whole event (x, y), a binary-valued indica- 
tor function is introduced and called a feature 
function. Usually, we suppose that there exists a 
large collection F of candidate features, and in- 
clude in the model only a subset S of the full set 
of candidate features .T. We call S the set of ac- 
tive features. Now, we assume that S contains n 
feature functions. For each feature fi(E S), the 
sets Vzi and Vyi indicate the sets of the values 
of x and y for that feature. According to those 
sets, each feature function fi will be defined as 
follows: 
1 ifxE V~iandyEVyi 
fi(x,y) = 0 otherwise 
Then, in the maximum entropy modeling ap- 
proach, the model with the maximum entropy 
is selected among the possible models. With this 
constraint, the conditional probability of the out- 
put y given the context x can be estimated as the 
following p~(y \[ x) of the form of the exponen- 
tial family, where a parameter Ai is introduced 
1316 
for each feature fi. exp(~-'~ )qfi(x,y)) 
p Cy I x) = ~ (14) 
y i 
The parameter values )¢i are estimated by an 
algorithm called Improved Iterative Scaling (IIS) 
algorithm. 
Feature Selection by One-by-one Feature 
Adding The feature selection process pre- 
sented in Della Pietra et al. (1997) and Berger 
et al. (1996) is an incremental procedure that 
builds up S by successively adding features one- 
by-one. It starts with S as empty, and, at each 
step, selects the candidate feature which, when 
adjoined to the set of active features S, pro- 
duces the greatest increase in log-likelihood of 
the training sample. 
3.2 Modeling Subcategorization Prefer- 
ence 
Events In our task of model learning of sub- 
categorization preference, each event (x,y) in 
the training sample is a verb-noun collocation e, 
which is defined in the formula (1). A verb-noun 
collocation e can be divided into two parts: one 
is the verbal part ev containing the verb v while 
the other is the nominal part ep containing all the 
pairs of case-markers p and thesaurus leaf classes 
c of case-marked nouns: 
Pk Ck 
Then, we define the context x of an event (x, y) 
as the verb v and the output y as the nominal part 
& of e, and each event in the training sample is 
denoted as (v, %): 
x = v, y -~ ep 
Features We represent each partial subcatego- 
rization frame as a feature in the maximum en- 
tropy modeling. According to the possible vari- 
ations of case dependencies and noun class gen- 
eralization, we consider every possible patterns 
of subcategorization frames which can generate 
a verb-noun collocation, and then construct the 
full set ~- of candidate features. Next, for the 
given verb-noun collocation e, tuples of partial 
subcategorization frames which can generate e 
are collected into the set SF(e) as below: 
Then, for each partial subcategorization frame 
s, a binary-valued feature function fs(V, ep) is de- 
fined to be true if and only if at least one element 
of the set SF(e) is a tuple (sl,...,s,...,sn) that 
contains s: 1 if 3(sl,...,s,...,sn) 
f,(v, ep) = • SF(e=(\[pred : v\] A %)) 
0 otherwise 
1317 
In the maximum entropy modeling approach, 
each feature is assigned an independent param- 
eter, i.e., each (partial) subcategorization frame 
is assigned an independent parameter. 
Parameter Estimation Suppose that the set 
S(C_ ~') of active features is found by the pro- 
cedure of the next section. Then, the param- 
eters of subcategorization frames are estimated 
according to IIS Algorithm and the conditional 
probability distribution ps(& \[ v) is given as: 
:,~s (15) 
% f~ E S 
4 General-to-Specific Feature Selec- 
tion 
This section describes the new feature selection 
algorithm which utilizes the subsumption rela- 
tion of subcategorization frames. It starts from 
the most general model, i.e., a model with no 
case dependency as well as the most general 
sense restrictions which correspond to the high- 
est classes in the thesaurus. This starting model 
has high coverage of the test data. Then, the al- 
gorithm gradually examines more specific mod- 
els with case dependencies as well as more spe- 
cific sense restrictions which correspond to lower 
classes in the thesaurus. The model search pro- 
cess is guided by a model evaluation criterion. 
4.1 Partially-Ordered Feature Space 
In section 2.1, we introduced subsumption rela- 
tion ~sl of two subcategorization frames. All the 
subcategorization frames are partially ordered 
according to this subsumption relation, and el- 
ements of the set .T of candidate features consti- 
tute a partially ordered feature space. 
Constraint on Active Feature Set 
Throughout the feature selection process, 
we put the following constraint on the active 
feature set S: Case Covering Constraint: for each verb-noun 
collocation in the training set C, each case p (and 
the leaf class marked by p) of e has to be covered 
by at least one feature in S. 
Initial Active Feature Set Initial set So of 
active features is constructed by collecting fea- 
tures which are not subsumed by any other can- 
didate features in ~-: 
So = (fslVfs,(  fs) E ~,s 7~sf S t } (16) 
This constraint on the initial active feature set 
means that each feature in So has only one case 
and the sense restriction of the case is (one of) 
the most general class(es). 
Candidate Non-active Features for Re- 
placement At each step of feature selection, 
one of the active features is replaced with sev- 
eral non-active features. Let G be a set of non- 
active features which have never been active until 
that step. Then, for each active feature fs(E S), 
the set DI, (C ~) of candidate non-active features 
with which fs is replaced has to satisfy the fol- 
lowing two requirements 2 3. 1. Subsumption with s: 
for each element fs' of DI. , 
s' has to be subsumed by s. 2. Upper Bound of 
~: for each element fs, of DI, , 
and for each element ft of G, t does not subsume 
s', i.e., DI, is a subset of the upper bound of 
with respect to the subsumption relation ~sI- 
Among all the possible replacements, the most 
appropriate one is selected according to a model 
evaluation criterion. 
4.2 Model Evaluation Criterion 
As the model evaluation criterion during feature 
selection, we consider the following two types. 
4.2.1 MDL Principle 
The MDL (Minimum Description Length) prin- 
ciple (Rissanen, 1984) is a model selection crite- 
rion. It is designed so as to "select the model that 
has as much fit to a given data as possible and 
that is as simple as possible." The MDL princi- 
ple selects the model that minimizes the follow- 
ing description length l( M, D) of the probability 
model M for the data D: 1N 
l(M,D) = -logLM(D) + ~ MloglO I (17) 
where logLM(D) is the log-likelihood of the 
model M to the data D, NM is the number of 
the parameters in the model 21I, and IDI is the 
size of the data D. 
Description Length of Subcategorization 
Preference Model The description length 
l(ps, £) of the probability model Ps (of (15)) for 
the training data set C is given as below: 4 
l(ps,C) = - ~ logps(% Iv)+ llsIloglCI (18) 
(v,e,.)~ 
2The general-to-specific feature selection considers only 
a small portion of the non-active features as the next can- 
didate for the active feature, while the feature selection by one-by-one feature adding 
considers all the non-active fea- 
tures as the next candidate. Thus, in terms of efficiency, 
the general-to-specific feature selection has an advantage 
over the one-by-one feature adding algorithm, especially 
when the number of the candidate features is large. 
3As long as the case covering constraint is satisfied, the set 
Df, of candidate non-active features with which f, is 
replaced could be an empty set 0. 
4More precisely, we slightly modify the probability 
model ps by multiplying the probability of generating the 
verb-noun collocation e from the (partial) subcategoriza- 
tion frames that correspond to active features evaluating 
to true for e, and then apply the MDL principle to this 
modified model. The probability of generating a verb- 
noun collocation from (partial) subcategorization frames 
is simply estimated as the product of the probabilities 
4.2.2 Subcategorization Preference Test 
using Positive/Negative Examples 
The other type of the model evaluation criterion 
is the performance in the subcategorization pref- 
erence test presented in Utsuro and Matsumoto 
(1997), in which the goodness of the model is 
measured according to how many of the posi- 
tive examples can be judged as more appropriate 
than the negative examples. This subcategoriza- 
tion preference test can be regarded as modeling 
the subcategorization ambiguity of an argument 
noun in a Japanese sentence with more than one 
verbs like the one in Example 2. 
Example 2 
TV-de mouketa shounin-wo mita 
TV-by/on earn money merchant-A CC see 
(If the phrase "TV-de'(by/on TV) modifies the verb 
"mouketa'(earn money), the sentence means that 
"(Somebody) saw a merchant who earned money by 
(selling) TV." On the other hand, if the phrase "TV- 
de"(by/on TV) modifies the verb "mita'(see), the 
sentence means that "On TV, (somebody) saw a mer- 
chant who earned money.") 
Negative examples are artificially generated from 
the positive examples by choosing a case element 
in a positive example of one verb at random and 
moving it to a positive example of another verb. 
Compared with the calculation of the descrip- 
tion length l(ps, C) in (18), the calculation of the 
accuracy of subcategorization preference test re- 
quires comparison of probability values for suffi- 
cient number of positive and negative data and 
its computational cost is much higher than that 
of calculating the description length. There- 
fore, at present, we employ the description length 
l(ps,C) in (18) as the model evaluation crite- 
rion during the general-to-specific feature selec- 
tion procedure, which we will describe in the next 
section in detail. After obtaining a sequence of 
active feature sets (i.e., subcategorization pref- 
erence models) which are totally ordered from 
general to specific, we select an optimal subcate- 
gorization preference model according to the ac- 
curacy of subcategorization preference test, as we 
will describe in section 4.4. 
4.3 Feature Selection Algorithm 
The following gives the details of the general-to- 
specific feature selection algorithm, where the de- 
of generating each leaf-class in the verb-noun collocation 
from the corresponding superordinate class in the subcat- 
egorization frame. With this generation probability, the 
more general the sense restriction of the subcategoriza- 
tion frames is, the less fit the model has to the data, and the 
greater the data description length (the first term of 
(18)) of the model is. Thus, this modification causes the 
feature selection process to be more sensitive to the sense 
restriction of the model. 
1318 
scription length l(ps, g) in (18) is employed as 
the model evaluation criterion: 5 
General-to-Specific Feature Selection 
Input: Training data set E; 
collection ~- of candidate features 
Output: Set `S of active features; 
model Ps incorporating these features 
1. Start with ,S = ,So of the definition (16) and with g =~'-& 
2. Do for each active feature f E `S and every pos- 
sible replacement D I C G: 
Compute the model PSuD/-U} using 
IIS Algorithm. 
Compute the decrease in the descrip- 
tion length of (18). 
3. Check the termination condition s 
4. Select the feature j and its replacement D\] with 
maximum decrease in the description length 
5. S,----SuD\]-{\]}, G~---G-D\] 
6. Compute ps using IIS Algorithm 
7. Go to step 2 
4.4 Selecting a Model with Approx- 
imately Optimal Subcategorization 
Preference Accuracy 
Suppose that we are constructing subcategoriza- 
tion preference models for the verbs Vl,...,Vm. 
By the general-to-specific feature selection algo- 
rithm in the previous section, for each verb vi, 
a totally ordered sequence of ni active feature 
sets Si0,... ,'-"¢ini (i.e., subcategorization prefer- 
ence models) are obtained from the training sam- 
ple g. Then, using another training sample C ~ 
which is different from C and consists of positive 
as well as negative data, a model with optimal 
subcategorization preference accuracy is approx- 
imately selected by the following procedure. Let 
~,..., 7-m denote the current sets of active fea- 
tures for verbs Vl,..., Vm, respectively: 
1. Initially, for each verb vi, set ~ as the most gen- 
eral one `sis of the sequence `sio,..., `sire. 
2. For each verb vi, from the sequence `sn,..., `sire, 
search for an active feature set which gives a 
maximum subcategorization preference accuracy 
for g~, then set Ti as it. 
3. Repeat the same procedure as 2. 
4. Return the current sets ~,..., 7-m as the approx- 
imately optimal active feature sets 'S1,.--,'~r~ 
for verbs Vl,..., vm, respectively. 
5Note that this feature selection algorithm is a hill- 
climbing one and the model selected here may have a de- 
scription length greater than the global minimum. 
6In the present implementation, the feature selection 
process is terminated after the description length of the 
model stops decreasing and then certain number of active 
features are replaced. 
5 Experiment and Evaluation 
5.1 Corpus and Thesaurus 
As the training and test corpus, we used the 
EDR Japanese bracketed corpus (EDR, 1995), 
which contains about 210,000 sentences collected 
from newspaper and magazine articles. We 
used 'Bunrui Goi Hyou'(BGH) (NLRI, 1993) 
as the Japanese thesaurus. BGH has a seven- 
layered abstraction hierarchy and more than 
60,000 words are assigned at the leaves and its 
nominal part contains about 45,000 words. 
5.2 Training/Test Events and Features 
We conduct the model learning experiment under 
the following conditions: i) the noun class gener- 
alization level of each feature is limited to above 
the level 5 from the root node in the thesaurus, 
ii) since verbs are independent of each other in 
our model learning framework, we collect verb- 
noun collocations of one verb into a training data 
set and conduct the model learning procedure for 
each verb separately. 
For the experiment, seven Japanese verbs 7 are 
selected so that the difficulty of the subcatego- 
rization preference test is balanced among verb 
pairs. The number of training events for each 
verb varies from about 300 to 400, while the 
number of candidate features for each verb varies 
from 200 to 1,350. From this data, we construct 
the following three types of data set, each pair 
of which has no common element: i) the training 
data ~: which consists of positive data only, and 
is used for selecting a sequence of active feature 
sets by the general-to-specific feature selection 
algorithm in section 4.3, ii) the training data g' 
which consists of positive and negative data and 
is used in the procedure of section 4.4, and iii) the 
test data C ts which consists of positive and neg- 
ative data and is used for evaluating the selected 
models in terms of the performance of subcate- 
gorization preference test. The sizes of the data 
sets g, g', and g ts are 2,333, 2,100, and 2,100. 
5.3 Results 
Table 1 shows the performance of subcategoriza- 
tion preference test described in section 4.2.2, for 
the approximately optimal models selected by the 
procedure in section 4.4 (the "Optimal" mode\] 
of "General-to-Specific" method), as well as for 
several other models including baseline models. 
Coverage is the rate of test instances which sat- 
isfy the case covering constraint of section 4.1. 
Accuracy is measured with the following heuris- 
tics: i) verb-noun collocations which satisfy the 
r"Agaru (rise)", "kau (buy)", "motoduku (base)", "oujiru (respond)", "sumu (live)", "tigau (differ)", 
and "tsunagaru (connect)". 
1319 
Table 1: Comparison of Coverage and Accuracy 
of Optimal and Other Models (%) 
General-to-Specific 
(Initial) 
(Independent Cases) 
(General Classes) 
(Optimal) 
(MDL) 
One-by-one Feature Adding 
(Optimal) 
Coverage 
84.8 
84.8 
77.5 
75.4 
15.9 
60.8 
Accuracy 
81.3 
82.2 
79.5 
87.1 
70.5 
79.0 
case covering constraint are preferred, it) even 
those verb-noun collocations which do not satisfy 
the case covering constraint are assigned the con- 
ditional probabilities in (15) by neglecting cases 
which are not covered by the model. With these 
heuristics, subcategorization preference can be 
judged for all the test instances, and test set cov- 
erage becomes 100%. 
In Table 1, the "Initial" model is the one 
constructed according to the description in sec- 
tion 4.1, in which cases are independent of each 
other and the sense restriction of each case is 
(one of) the most general class(es). The "Inde- 
pendent Cases" model is the one obtained by re- 
moving all the case dependencies from the "Op- 
timal" model, while the "General Classes" model 
is the one obtained by generalizing all the sense 
restriction of the "Optimal" model to the most 
general classes. The "MDL" model is the one 
with the minimum description length. This is 
for evaluating the effect of the MDL principle in 
the task of subcategorization preference model 
learning. The "Optimal" model of "One-by-one 
Feature Adding" method is the one selected from 
the sequence of one-by-one feature adding in sec- 
tion 3.1 by the procedure in section 4.4. 
The "Optimal" model of 'General-to-Specific" 
method performs best among all the models in 
Table 1. Especially, it outperforms the "Op- 
timal" model of "One-by-one Feature Adding" 
method both in coverage and accuracy. As for 
the size of the optimal model, the average num- 
ber of the active feature set is 126 for "General- 
to-Specific" method and 800 for "One-by-one 
Feature Adding" method. Therefore, general-to- 
specific feature selection algorithm achieves sig- 
nificant improvements over the one-by-one fea- 
ture adding algorithm with much smaller num- 
ber of active features. The "Optimal" model of 
"General-to-Specific" method outperforms both 
the "Independent Cases" and "General Classes" 
models, and thus both of the case dependencies 
and specific sense restriction selected by the pro- 
posed method have much contribution to improv- 
ing the performance in subcategorization prefer- 
1320 
ence test. The "MDL" model performs worse 
than the "Optimal" model, because the features 
of the "MDL" model have much more specific 
sense restriction than those of the "Optimal" 
model, and the coverage of the "MDL" model 
is much lower than that of the "Optimal" model. 
6 Conclusion 
This paper proposed a novel method for learn- 
ing probability models of subcategorization pref- 
erence of verbs. Especially, we proposed a new 
model selection algorithm which starts from the 
most general model and gradually examines more 
specific models. In the experimental evaluation, 
it is shown that both of the case dependencies 
and specific sense restriction selected by the pro- 
posed method contribute to improving the per- 
formance in subcategorization preference resolu- 
tion. As for future works, it is important to eval- 
uate the performance of the learned subcatego- 
rization preference model in the real parsing task. 

References 
A. L. Berger, S. A. Della Pietra, and V. J. Della 
Pietra. 1996. A Maximum Entropy Approach to Nat- 
ural Language Processing. Computational Linguistics, 
22(1):39-71. 
E. Charniak. 1997. Statistical Parsing with a Context- 
free Grammar and Word Statistics. In Proceedings of 
the 14th AAAI, pages 598-603. 
M. Collins. 1996. A New Statistical Parser Based on Bi- 
gram Lexical Dependencies. In Proceedings of the 34th 
Annual Meeting of ACL, pages 184-191. 
S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. 
Inducing Features of Random Fields. IEEE Transac- 
tions on Pattern Analysis and Machine Intelligence, 
19(4):380-393. 
EDR (Japan Electronic Dictionary Research Institute, 
Ltd.). 1995. EDR Electronic Dictionary Technical 
Guide. 
H. Li and N. Abe. 1995. Generalizing Case Frames Using 
a Thesaurus and the MDL Principle. In Proceedings of 
International Conference on Recent Advances in Natu- 
ral Language Processing, pages 239-248. 
H. Li and N. Abe. 1996. Learning Dependencies between 
Case Frame Slots. In Proceedings of the 16th COLING, 
pages 10-15. 
D. M. Magerman. 1995. Statistical Decision-Tree Models 
for Parsing. In Proceedings of the 33rd Annual Meeting 
of A CL, pages 276-283. 
NLRI (National Language Research Institute). 1993. 
Word List by Semantic Principles. Syuei Syuppan. (in 
Japanese). 
P. Resnik. 1993. Semantic Classes and Syntactic Ambigu- 
ity. In Proceedings of the Human Language Technology 
Workshop, pages 278-283. 
J. Rissanen. 1984. Universal Coding, Information, Pre- 
diction, and Estimation. IEEE Transactions on Infor- 
mation Theory, IT-30(4):629-636. 
T. Utsuro and Y. Matsumoto. 1997. Learning Probabilis- 
tic Subcategorization Preference by Identifying Case 
Dependencies and Optimal Noun Class Generalization 
Level. In Proceedings of the 5th ANLP, pages 364-371. 
