Decision Tree Learning Algorithm with Structured 
Attributes: 
Application to Verbal Case Frame Acquisition 
Hideki Tanaka 
NHK Science and Technical Research Laboratories 
Abstract 
The Decision Tree Learning Algorithms 
(DTLAs) are getting keen attention 
from the natural language processing re- 
search comlnunity, and there have been 
a series of attempts to apply them to 
verbal case frame acquisition. However, 
a DTLA cannot handle structured at- 
tributes like nouns, which are classified 
under a thesaurus. In this paper, we 
present a new DTLA that can ratio- 
nally handle the structured attributes. 
In the process of tree generation, the 
algorithm generalizes each attribute op- 
timally using a given thesaurus. We ap- 
ply this algorithm to a bilingual corpus 
and show that it successfiflly learned a 
generalized decision tree for classifying 
the verb "take" and that the tree was 
smaller with more prediction power on 
the open data than the tree learned by 
the conventional DTLA. 
1 Introduction 
The group of Decision Tree Learning Algorithms 
(DTLAs) like CART (Breiman et al., 1984), ID3 
(Quinlan, 1986) and C4.5 (Quinlan, 1993) are 
some of the most widely used algorithms for learn- 
ing the rules for expert systems and has been sue- 
eessfully applied to several areas so far. 
These algorithms are now getting keen atten- 
tion from the natural language processing (NLP) 
research community since the huge text corpus 
is becoming widely available. The most popular 
touchstone for the DTLA in this community is the 
verbal case frame or the translation rules. There 
have already been some attempts, like (Tanaka, 
1994) and (Almuallim et al., 1994). 
The group of DTLAs, however, was origi- 
nally designed to handle "plain" data, whereas 
nouns are "structured" under a thesaurus. Al- 
though handling such "structured attributes" in 
the DTLA was described as a "desirable exten- 
sion" in the book of Quinlan (Quinlan, 1993), the 
1-10-11 Kinuta, Setagaya-ku 
Tokyo, 157, Japan 
tanakah@strl, nhk. or. jp 
value attribute (case) (semantic restriction) object noun 
,.t 
\[¥ I I Q l \[ I Taro Hanako cat dog elephant TV camera 
tsurete-iku \[escort\] hakobu\[carry\] 
class (translation) 
Figure 1: Case Prame Tree Learned by DTLA 
problem has received rather limited attention so 
far (Ahnuallim et el., 1995). 
There have been several attempts to solve tile 
problem in the NLP community, such as (Tanaka, 
1995b), (Almuallim et el., 1995). These attempts, 
however, are not always satisfactory in that the 
handling of the thesaurus is not flexible enough. 
In this paper, we introduce an extended DTLA, 
LASA-1 (inductive earning Algorithm with Struc- 
tured Attributes), which can handle structured 
attributes in an optimmn way. We first present 
an algorithm called T*, which can solve the 
sub-problem for structured attributes and then 
present the whole algorithm of LASA-1. Finally, 
we report an application of our new algorithm to 
verbal case frame acquisition and show its effec- 
tiveness. 
2 The Structured Attribute 
Problem 
Figure 1 shows an example decision tree represent- 
ing acase frame for the verb "take." This decision 
tree was called the case frame tree (Tanaka, 1994) 
and we follow that convention in this paper, too. 
One may recognize that the restrictions in figure 1 
are not semantic categories but are words: this 
tree was learned from table I which contains word 
forms for the values. Although the tree has some 
attractive features mentioned in (Tanaka, 1994), 
it suffers from two problems. 
• weak prediction power 
A case frame tree with word forms does not 
have high prediction power on the open data 
943 
Table 1: Single Attribute Table.,for:. "take" 
Object Noun Japanese Translation 
Taro 
Hanako 
cat 
dog 
elephant 
elephant 
TV 
camera 
tsurete-iku (escort) 
tsurete-iku (escort) 
tsurete-iku (escort) 
tsurete-iku (escort) 
hakobu (carry) 
hakobu (carry) 
hakobu (carry) 
hakobu (carry) 
(the data not used for learning). The nouns 
are the most problematic. There will be 
many unknown nouns in the open data. 
• low legibility 
If we include many different nouns in the 
training data (the data used for learning), 
the obtained tree will have as many branches 
as the number of nouns. The ramified tree is 
hard for humans to understand. 
Introducing a thesaurus or a semantic hierar- 
chy in a case frame tree seems a sound way to 
ameliorate these two problems. We can replace 
the similar nouns in a case fl'ame tree by a proper 
semantic class, which will reduce the size of the 
tree while increasing the prediction power on the 
open data. But how can we introduce a thesaurus 
into the conventional DTLA framework? This is 
exactly the "structured attributes" problem that 
we mentioned in section 1. 
3 The Problem Setting 
3.1 Partial Thesaurus 
The DTLA takes an attribute, value and class ta- 
ble for an input 1 Although the table usually 
includes multiple attributes, the algorithm evalu- 
ates an attribute's goodness as a classifier inde- 
pendently of the rest of the attributes. In other 
words a "single attribute table" as shown in ta- 
ble 1 is the flmdamental unit for the DTLA. This 
table shows an imaginary relationship between an 
object noun of the verb "take" and the Japanese 
translation. We used this table to learn the case 
frame tree in figure 1 and it suffered from the two 
problems. 
Here, we can assume that the word forms of the 
ON are in a thesaurus (We call this thesaurus the 
original thesaurus) and we can extract the rele- 
vant part as in figure 2. We call this tree a partial 
thesaurus T 2. If we replace "Taro" and "Hanako" 
lWe are going to mainly use the terms attribute, 
value, and class for generality. They actually refer to 
the case, restrictions for the case, and the translation 
of the verb respectively in our application. In this 
paper, we use these terms interehangeably. 
2The scores at each node will be explained in sec- 
tion 3.3. 
Taro Hanako cat dog elephant TV camera 
0.197 0.197 0.197 0.197 0.197 0.197 0.197 
tsure tsure tsure tsure hakobu hakobu hakobu 
te-iku te-iku te-iku te-iku hakobu 
Figure 2: Partial Thesaurus T 
Table 2: Notations 1 
T 
I" 
P 
P 
N L(v) 
partial thesaurus 
root node of T 
any node in T, take subscripts i, j 
any node set in T 
set of all nodes in T 
set of leaf under p 
in table 1 by "*human" in T, for example, and as- 
sign the translation "tsurete-iku" to "*human," 
the learned case frame tree will reduce the size 
by one (two leaves in figure 1 are replaced by 
one leaf). If we replace "Taro," "Hanako," "cat," 
"dog" and "elephant" by "*mammal," and assign 
the translation "tsurete- iku" to "*mammal" (The 
majority translation under the node "*mammal" 
in T. We are going to use this "majority rule" 
for the class assignment.), then the learned case 
fl'ame tree will reduce the size by four. But the 
case frame tree will produce two translation er- 
rors ("hakobu" for "elephant") when we classify 
the original table 1. In both cases, the learned 
case frame trees are expected to have reinforced 
prediction power on the open data thanks to the 
semantic classes: the replacement in the table gen- 
eralizes the case frame tree. We want high-level 
generalization but low-level translation errors; but 
how do we achieve this in an optimum way? 
3.2 Unique and Complete Cover 
Generalization 
One factor we have to consider is the possible com- 
binations of the node set in T which we use for 
the generalization of the single attribute table. In 
this paper, we allow to use the node sets which 
cover the word forms in the table uniquely and 
completely. These two requirements are formally 
defined below using the notations in table 2. 
Definition 1: For a given node set P C N, P is 
called the unique cover node set if L(p{)n L(p\]) = 
¢ for Vpi, pj c P and i # j. 
Definition 2: For a given node set P C 
N, P is called the complete cover node set if U ,cv 
L(p ) = L(,'). 
944 
Table 3: Notations 2 
M p' 
D(p') O(p) 
Ci 
f(cO A 
IAI 
It(A) 
total word count in thesaurus 
thesaurus node corresponding to p 
word count under p' 
set of class under p 
class 
frequency of (:i 
set of class 
f(c ) 
entropy of class distri/mtion in A 
'rile node set that satisfy the two definitions is 
called the unique and complete cow'~r (UCC) node 
set and each such node set is denoted by P~ .... 
The set of all UCC node set is denoted by "P. It 
should be noted that if we use only the leaves in T 
for generMization, there will be no actual change 
in the table and this node set is included in 7 ). 
The total nund)er of UCC node sets in a tree is 
generally high. For example, the number of UCC 
node set in a 10 ary tree with the depth of 3 is 
about 1.28 × l0 ~°. We will consider this prol)lem 
in section 4. 
3.3 Goodness of Generalization 
Another factor to consider is the measurement of 
the goodness of a generalization. To evaluate this 
quantitatively, we assign a t)enalty score S(p) to 
each node p in ?1' a.~ 
S(p) = a" C;,~.,~(p) + r':(p), (1) 
where a is a coefficient, Gw~(p ) is the penalty for 
generality ~ , and E(p) is a I)enalty for the induced 
errors by using p. 
The node that ha,s small S(p) is pro, ferabh;. And 
Gv~,n(p ) and E(p) are generally mutually conflict- 
ing: high generality node p (with low Gv,,n(p)) will 
induce many errors resulting in high E(p) and vice 
versa. We measure a generalization's goodness by 
tile total sum of the penalty scores of the nodes 
used for the generalization. There are several pos~ 
sible candidates for the penalty score function and 
we (:hose the formula (2) for this research. 
D(v') IO(V)I H, O, ,, s(p) = log + t vv J (2) 
New notations are listed in table 3 in addition 
to table 2. The second term in formula (2) is the 
"weighted entropy" of the class distribution under 
node p, which coincides Quinlan's criterion (Quin- 
lan, 1993). 
We calculated Gp~n(p) (tile first term of for- 
mula (2)) based on the word numt)er coverage of 
p' in the original thesaurus rather than in the par- 
tim thesaurus, since the original thesaurus usu- 
ally contains many more words than tile partial 
alf p has low generality, it will have high Gp~,~(p). 
Original Thesaurus 
Partial Tesaurus T 
, = - - Z_T ..-T 
oo o '-" M 
Figure 3: Generality Calculation 
thesaurus, and is thus expected to yield a better 
estimate on the generMity of node p. TILe idea 
is shown in figure 3. The coefficient a is rather 
ditlicult to handle and wc will touch oil this is- 
sue ill section 4.3. The figures attached to each 
node in figure 2 are the example penalty scores 
given by formula (2) under the assmnption that 
the T and the original thesaurus are tile same and 
a = 0.0074. 
With these preparations, we now formally ad- 
dress the problem of tlm optimum generMization 
of the singh' attribute tattle. 
The Optimum Attribute Generalization 
Given a tree whose nodes each have a score: 
Find 1~ ...... that has the minimal total sum of 
scores: 
arg rain ~ S(pi) (3) 
I ~,, ,~ ,: Q 7 ) 
Pl G P~ (:,: 
4 The Algorithms 
4.1 The Algorithm T* 
As was mentioned in section 3, the number of 
UCC node set in a tree tends to be gigantic, and 
we should obviously avoid an exllaustive search to 
find the optimum generalization. To do this search 
efficiently, we propose a new algorithm, T*. The 
essence of T* lies in the conversion of the partial 
thesaurus: from a tree T into a directed acyclic 
graph (DAG) T. This makes the problem into 
"the shortest path problem in a graph," to which 
we can apply several efficient algorithms. We use 
the new notations in table 4 in addition to those 
in table 2. 
The Algorithm T* 
Tstar( value, class){ 
extract partial thesaurus T with 
value and class; 
/* conversion of T into a DAG T */ 
assign index numbers (1,..., m) 
to leaves in T from the left; 
add start node s to T with 
index number 0 
and c with index number re+l; 
ror~ach( n ~ N U {s}){ 
extend an arc from n to each 
4This coefficient was fixed cxperimentMly. 
945 
*mammal *anything *instrument 
0.723 1.00 0.127 
beast 
/ I ,,'*human "',,~,,~ 0.586 / / " , / /:/ o.1. --.,,.%<-- / : '.. / 
,' , : , ", : ; : ,I 
1 Taro 2Hanako 3cat 4dog 5elephant 6TV 7camera 
0.197 0.197 0.197 0.197 0.197 0.197 0.197 
Figure 4: Traversal Graph 7" 
Table 4: Notations 3 
Lmi,~(p) leaf with smallest index in L(p) 
Lm~,(p) leaf with biggest index in L(p) 
element in the set H,~ 
defined by (4);} 
delete original edges appeared in T; 
/* search for shortest path in 7" */ 
opt=node_set = find_short(7"); 
return opt_node_set; 
H,, : {xlxeNU{e}, 
Lining(x) - 1 : L~(n)} (4) 
This algorithm first converts T in figure 2 into 
a DAG 7-, as in figure 4. We call this graph a 
traversal 9raph and each path from s to e in the 
traversal graph a traverse. The set of nodes on 
each traverse is called a traversal node set. 
Here we have two propositions related to the 
traversal graph. 
Proposition 1: A traversal graph is a DAG. 
Proposition 2: For any P C N, P is a UCC 
node set if and only if P is a traversal node set. 
Since proposition 2 holds, we can solve the opti- 
mum attribute generalization problem by finding 
the shortest traverse 5 in the traversal graph. By 
applying a shortest path algorithm (Gondran and 
Minoux, 1984) to figure 4, we find the shortest tra- 
verse as (s ~ *human --+ *beast --+ *instrument 
---+ e) arm get the optimally generalized table as 
in table 5 and the generalized decision tree as in 
figure 5. 
4.2 Correctness and Time Complexity 
We will not give a full proof for propositions 1 and 
2 (correctness of T*) because of the limited space, 
but give an intuitive explanation of why the two 
propositions hold. 
5The sum of the scores in the traversM node set is 
minimal. 
Table 5: Optimally Generalized Single Attribute 
Table for "take" 
ON Translation I Freq\[ Error 
*human tsurete-iku 2 0 
*beast tsurete-iku 4 2 
-~instrument hakobu 2 0 
*human *beast *instrument 
tsurete-iku hakobu 
Figure 5: Optimally Generalized Decision Tree for 
"take" 
Let's suppose that we select "*human" in fig- 
ure 2 for a UCC node set P~cc; then we cannot 
include "*mammal" in the P~c~: there will be 
leaf overlap between the two nodes, which vio- 
lates the unique cover. Meanwhile, we have to 
include nodes that govern Lm~(*human)+ 1, i.e. 
"cat," to satisfy the complete cover. In conclu- 
sion, we have to include "cat" or "*beast" in the 
P~, which satisfies formula (4). The T* links all 
such possible nodes with arcs, and the traversal 
node sets can exhaust T'. 
One may easily understand that the traversal 
graph will be a DAG, since formula (4) allows 
an arc between two nodes to be spanned only 
in the direction that increases the index num- 
ber of the leaf. Since proposition 1 holds, the 
time complexity of the T* can be estimated by 
the number of arcs in a traversal graph: there is 
an algorithm for the shortest path problem in an 
acyclic graph which runs with time complexity of 
O(M), where M is the number of arcs (Gondran 
and Minoux, 1984). Then we want to clarify the 
relationship between the number of leaves (data 
amount, denoted by D) and the number of arcs 
in the traversal graph. Unfortunately, the rela- 
tionship between the two quantities varies depend- 
ing on the shape of the tree (partial thesaurus), 
then we consider a practical case: k-ary tree with 
depth d (Tanaka, 1995a). In this case, the number 
of arcs in the traversal graph is given by 
k(k + 1), d d 2 _ k(k+l) 
(k - 1) 2. (5) 
Since the number of leaves D in the present the- 
saurus is k ~ , the first term in formula (5) be- 
~D, showing that T* has O(D) time comes 
complexity in this case. 
Theoretically speaking, when the partial the- 
saurus becomes deep and has few leaves, the time 
complexity will become worse, but this is hardly 
the situation. We can say that T* has approxi- 
mately linear order time complexity in practice. 
946 
4.3 The LASA-1 
The essence of DTLAs lies in the recursive "search 
and division." It searches for the best classifier 
attribute in a given table. It then divides the table 
with values of the attribute. 
The goodness of an attribute is usually mea- 
sm'ed by the following quantities (Quinlan, 1993) 
(The notations are in table 3.). Now let's a~s- 
sume that a table contains a set of class A = 
{Cl,..., c,~}. The DTLA then evaluates the "pu- 
rity" of A in terms of the entropy of the class 
distribution, H(A). 
If an attribute has m different values whicil di- 
vide A into m subsets as A = {BI,... ,J~m}, the 
DTLA evahmtes the "purity after division" by the 
"weighed sum of entropy," WSH(attribute, A). 
WSH(attribute, A) : B~A ~H(B/) (6) 
The DTLA then measures the goodness of the at- 
tribute by 
gain : H(A) - WSH(attribute, A). (7) 
With these processes in mind, we can naturally 
extend the DTLA to handle the structured at- 
tributes while integrating T*. The algorithm is 
listed below. Here we have two functions named 
make'lYee 0 and Wsh 0. The function make~lh'ee0 
executes the recursive "search and division" and 
the Wsh() calculates the weighted sum of entropy. 
T* is integrated in Wsh 0 at the first "if clause." a 
In short, we use T* to optimally generalize the val- 
ues of an attribute at each tree generation step, 
which makes the extension quite natural. 
The LASA-1 
place all classes in input table under 
root; makeTree( root, table); 
makeTree(node, table){ 
A: class set in table; 
find attribute which maximizes 
H(A) - Wsh( attribute, table); 
/* table division part follows*/ } 
Wsh( attribute, table){ 
if(attribute is structured){ 
node_set = Tstar( value, class); 
replace value with node=set; } 
return WSH(attributc, A) (6) } 
We have implemcnted this algorithm as a pack- 
age that we called LASA- 1(inductive Learning 
Algorithm with Structured Attributes). This 
package has many parameter setting options. The 
6Without this clause, the algorithm is just a con- 
ventional DTLA. 
most important one is for parameter a in for- 
mula (2). Since it is not easy to find the best 
value before a trial, we used a heuristic method. 
The one used in the next section was set by the 
following method. 
We put equal emphasis on the two terms in 
formula (2) and fixed a so that the traverse via 
the root node of Tand the traverse via leaves 
only would have equal scores. At the beginning, 
LASA-1 calculated the value for each attribute in 
the original table. 
Although this heuristics does not guarantee to 
output the a that has the minimum errors on ()pen 
data, the value was not too far off in our experi- 
ence. 
5 Empirical Evaluation 
5.1 Experiment 
We conducted a case frame tree acquisition exper- 
iment on LASA-1 and the DTLA 7 using part of 
our bilingual corpus for the verb "take." We used 
100 English-Japanese sentence pairs. The pairs 
contained 15 translations (classes) for "take," 
whose occm'rences ranged from 5 to 9. We first 
converted the sentence pairs into an input table 
consisting of the case (attribute), English word 
form (value), and Japanese translation for "take" 
(class). We used 6 cases for attributes s and some 
of these appear in figure 6. 
We used the Japanese "Ruigo-Kokugo-Jiten" 
(Ono, 1985) for the thesaurus. It is a 10-ary tree 
with the depth of 3 or 4. The semantic class at 
each node of the tree was represented by 1 (top 
level) to 4 (lowest level) digits. To link the English 
word forms in the input table to the thesaurus in 
order to extract a partial thesaurus, we used the 
Japanese translations for the English word forms. 
When there was more than one possible semantic 
class for a word form, we gave all of them 9 and 
expanded the input table using all the semantic 
classes. 
We evaluated both algorithms with using the 
10-fold cross validation method(Quinlan, 1993). 
The purity threshold for halting the tree gen- 
eration was experimentally set at 7570 10 for both 
algorithms. 
A part of a case frame tree obtained by LASA- 
1 is shown in figure 6. We can observe that both 
semantic codes and word forms are mixed at the 
7Part of LASA-1 was used as the DTLA. 
Sadverb (DDhl), adverbial particle (Dhl), object 
noun (ONhl), preposition (PNfl), the head of the 
prepositional phrase (PNhl), and subject (SNhl). 
9We basically disambiguated the word senses man- 
ually, and there were not a disastrously large number 
of such cases. 
1°If the total frequency of the majority translation 
exceeds 75% of the total translation frequency, subtree 
generation halts. 
947 
<Root> generalized semantic class 
<Dhl> 0: 
I <ONhl> \[04\] : -~ < (2/0) \[ highway(l) path(l) \] 
<ONhl> \[44\] : ~-~ (9/1) \[ command(l) 
control(6) power(2) \] 
<ONhl> \[45\] • 
<SNhl> \[5\] ~ -~ ~ (1/0) \[ she(l) \] 
<SNhl> \[7\] : ~-~l~-j\[:~ 7~ (1/0) \[ West Germany(I) \] 
I <ONhl> \[9\] : original word forms (occurrence) 
I <SNhl> \[5\] : ~:~j~o "Ck, a < (1/0) \[ rebel(l) \] 
I <SNhl> Delta: -:~b:_ 7~'L7~ (1/0) 
I <SNhl>Shaw: ~Tj~o~ < (1/0) 
"~g word 
(A/B) A: data count, B: classification error 
Figure 6: Case Frame Tree Learned by LASA-1 
Table 6: Classification Results on Open Data(%) 
complete 
incomplete 
total 
leaf size 
LASA(120) 
correct err. 
59.2 20.0 
6.7 14.2 
65.8 34.2 
50.9 
DTLA(100) 
correct err. 
47.0 7.0 
4.0 42.0 
51.0 49.0 
57.9 
same depth of the tree. We can also observe that 
semantically close words are generalized by their 
common semantic code. 
Table 6 shows the percentage of each evaluation 
item. We have 120 open data, not 100, for LASA- 
1, because the data is expanded due to the seman- 
tic ambiguity. The term "incomplete" in the table 
denotes the cases where the tree retrieval stopped 
mid-way because of an "unknown word" in the 
classification. Such cases, however, could some- 
times hit the correct translation since the algo- 
rithm output the most frequent translation under 
the stopped node as the default answer. 
In table 6, we can recognize the sharp decrease 
in incomplete matching rate from 46.0 % (DTLA) 
to 20.8 % (LASA-1). The error rate also de- 
creased from 49.0 % (DTLA) to 34.2 % (LASA-1). 
The average tree size (measured by the number 
of leaves) for DTLA was 57.9, which dropped to 
50.9 for LASA-1. 
These results show that LASA-1 was able to 
satisfy our primary objectives: to solve the two 
problems mentioned in section 3, "weak prediction 
power" and "low legibility." 
5.2 Discussion 
The shape of the decision tree learned by LASA-1 
is sensitive to parameter a and the purity thresh- 
old. There is no guarantee that our method is the 
best, so it would be better to explore for a better 
criterion to decide these values. 
The penalty score !n this research was designed 
so that we get the maximum generalization if the 
error term in formula (2) stays constant. As a 
result, the subtrees in the deep part are highly 
generalized. In those parts, the data is sparse and 
the high-level generalization is questionable from 
a linguistic viewpoint. Some elaboration in the 
penalty function might be required. 
6 Conclusion 
We have proposed a decision tree learning al- 
gorithm (inductive Learning Algorithm with the 
Structured Attributes: LASA-1) that optimally 
handles the structured attributes. We applied 
LASA-1 to bilingual (English and Japanese) data 
and showed that it successfully leaned the gener- 
alized decision tree to classify the Japanese trans- 
lation for "take." The LASA-1 package still has 
some unmentioned features like the handling of 
the words unknown to the thesaurus and differ- 
ent a parameter setting. We would like to report 
those features at another opportunity after further 
experiments. 
References 
Hussein Almuallim, Yasuhiro Akiba, and Take- 
fumi Yamazaki. 1994. Two methods for learn- 
ing alt-j/e translation rules from examples and 
a semantic hierarchy. In Proc. of COLING9~, 
volume 1, pages 57-63. 
Hussein Almuallim, Yasuhiro Akiba, and Shigeo 
Kaneda. 1995. On handling tree-structured 
attributes in decision tree learning. In Proe. 
of 12th International Conference on Machine 
Learning, pages 12-20. 
Leo Breiman, Jerome H. Friedman, Richard A. 
Olshen, and Charles J. Stone. 1984. Classifica- 
tion and Regression Trees. Chapman & Hall. 
Michel Gondran and Michel Minoux. 1984. 
Graphs and Algorithms. John Wiley ~z Sons. 
Susumu Ono. 1985. Ruigo-kokugo-jiten. 
Kadokawa Shoten. 
John Ross Quinlan. 1986. Induction of decision 
trees. Machine Learning, 1:81-106. 
John Ross Quinlan. 1993. C~.5: Programs for 
Machine Learning. Morgan Kaufmann. 
Hideki Tanaka. 1994. Verbal case frame acqui- 
sition from a bilingual corpus: Gradual knowl- 
edge acquisition. In Proc. of COLINGg~, vol- 
ume 2, pages 727-731. 
Hideki Tanaka. 1995a. A linear-time algorithm 
for optimal generalization of language data. 
Technical Report NLC-95-07, IECE. 
Hideki Tanaka. 1995b. Statistical learning of 
"case frame tree" for translating english verbs. 
Natural Language Processing, 2(3):49-72. 
948 
