File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2159_metho.xml
Size: 21,434 bytes
Last Modified: 2025-10-06 14:14:19
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2159"> <Title>Decision Tree Learning Algorithm with Structured Attributes: Application to Verbal Case Frame Acquisition</Title> <Section position="3" start_page="0" end_page="943" type="metho"> <SectionTitle> 2 The Structured Attribute Problem </SectionTitle> <Paragraph position="0"> Figure 1 shows an example decision tree representing acase frame for the verb &quot;take.&quot; This decision tree was called the case frame tree (Tanaka, 1994) and we follow that convention in this paper, too.</Paragraph> <Paragraph position="1"> One may recognize that the restrictions in figure 1 are not semantic categories but are words: this tree was learned from table I which contains word forms for the values. Although the tree has some attractive features mentioned in (Tanaka, 1994), it suffers from two problems.</Paragraph> <Paragraph position="2"> * weak prediction power A case frame tree with word forms does not have high prediction power on the open data</Paragraph> <Paragraph position="4"> (the data not used for learning). The nouns are the most problematic. There will be many unknown nouns in the open data.</Paragraph> <Paragraph position="5"> * low legibility If we include many different nouns in the training data (the data used for learning), the obtained tree will have as many branches as the number of nouns. The ramified tree is hard for humans to understand.</Paragraph> <Paragraph position="6"> Introducing a thesaurus or a semantic hierarchy in a case frame tree seems a sound way to ameliorate these two problems. We can replace the similar nouns in a case fl'ame tree by a proper semantic class, which will reduce the size of the tree while increasing the prediction power on the open data. But how can we introduce a thesaurus into the conventional DTLA framework? This is exactly the &quot;structured attributes&quot; problem that we mentioned in section 1.</Paragraph> </Section> <Section position="4" start_page="943" end_page="944" type="metho"> <SectionTitle> 3 The Problem Setting </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="943" end_page="943" type="sub_section"> <SectionTitle> 3.1 Partial Thesaurus </SectionTitle> <Paragraph position="0"> The DTLA takes an attribute, value and class table for an input 1 Although the table usually includes multiple attributes, the algorithm evaluates an attribute's goodness as a classifier independently of the rest of the attributes. In other words a &quot;single attribute table&quot; as shown in table 1 is the flmdamental unit for the DTLA. This table shows an imaginary relationship between an object noun of the verb &quot;take&quot; and the Japanese translation. We used this table to learn the case frame tree in figure 1 and it suffered from the two problems.</Paragraph> <Paragraph position="1"> Here, we can assume that the word forms of the ON are in a thesaurus (We call this thesaurus the original thesaurus) and we can extract the relevant part as in figure 2. We call this tree a partial thesaurus T 2. If we replace &quot;Taro&quot; and &quot;Hanako&quot; lWe are going to mainly use the terms attribute, value, and class for generality. They actually refer to the case, restrictions for the case, and the translation of the verb respectively in our application. In this paper, we use these terms interehangeably.</Paragraph> <Paragraph position="3"> partial thesaurus root node of T any node in T, take subscripts i, j any node set in T set of all nodes in T set of leaf under p in table 1 by &quot;*human&quot; in T, for example, and assign the translation &quot;tsurete-iku&quot; to &quot;*human,&quot; the learned case frame tree will reduce the size by one (two leaves in figure 1 are replaced by one leaf). If we replace &quot;Taro,&quot; &quot;Hanako,&quot; &quot;cat,&quot; &quot;dog&quot; and &quot;elephant&quot; by &quot;*mammal,&quot; and assign the translation &quot;tsurete- iku&quot; to &quot;*mammal&quot; (The majority translation under the node &quot;*mammal&quot; in T. We are going to use this &quot;majority rule&quot; for the class assignment.), then the learned case fl'ame tree will reduce the size by four. But the case frame tree will produce two translation errors (&quot;hakobu&quot; for &quot;elephant&quot;) when we classify the original table 1. In both cases, the learned case frame trees are expected to have reinforced prediction power on the open data thanks to the semantic classes: the replacement in the table generalizes the case frame tree. We want high-level generalization but low-level translation errors; but how do we achieve this in an optimum way?</Paragraph> </Section> <Section position="2" start_page="943" end_page="944" type="sub_section"> <SectionTitle> 3.2 Unique and Complete Cover Generalization </SectionTitle> <Paragraph position="0"> One factor we have to consider is the possible combinations of the node set in T which we use for the generalization of the single attribute table. In this paper, we allow to use the node sets which cover the word forms in the table uniquely and completely. These two requirements are formally defined below using the notations in table 2.</Paragraph> <Paragraph position="1"> Definition 1: For a given node set P C N, P is called the unique cover node set if L(p{)n L(p\]) = C/ for Vpi, pj c P and i # j.</Paragraph> <Paragraph position="2"> Definition 2: For a given node set P C N, P is called the complete cover node set if U ,cv L(p ) = L(,').</Paragraph> <Paragraph position="4"> total word count in thesaurus thesaurus node corresponding to p word count under p' set of class under p class frequency of (:i set of class f(c ) entropy of class distri/mtion in A 'rile node set that satisfy the two definitions is called the unique and complete cow'~r (UCC) node set and each such node set is denoted by P~ .... The set of all UCC node set is denoted by &quot;P. It should be noted that if we use only the leaves in T for generMization, there will be no actual change in the table and this node set is included in 7 ). The total nund)er of UCC node sets in a tree is generally high. For example, the number of UCC node set in a 10 ary tree with the depth of 3 is about 1.28 x l0 ~deg. We will consider this prol)lem in section 4.</Paragraph> </Section> <Section position="3" start_page="944" end_page="944" type="sub_section"> <SectionTitle> 3.3 Goodness of Generalization </SectionTitle> <Paragraph position="0"> Another factor to consider is the measurement of the goodness of a generalization. To evaluate this quantitatively, we assign a t)enalty score S(p) to each node p in ?1' a.~</Paragraph> <Paragraph position="2"> where a is a coefficient, Gw~(p ) is the penalty for generality ~ , and E(p) is a I)enalty for the induced errors by using p.</Paragraph> <Paragraph position="3"> The node that ha,s small S(p) is pro, ferabh;. And Gv~,n(p ) and E(p) are generally mutually conflicting: high generality node p (with low Gv,,n(p)) will induce many errors resulting in high E(p) and vice versa. We measure a generalization's goodness by tile total sum of the penalty scores of the nodes used for the generalization. There are several pos~ sible candidates for the penalty score function and we (:hose the formula (2) for this research.</Paragraph> <Paragraph position="4"> D(v') IO(V)I H, O, ,, s(p) = log + t vv J (2) New notations are listed in table 3 in addition to table 2. The second term in formula (2) is the &quot;weighted entropy&quot; of the class distribution under node p, which coincides Quinlan's criterion (Quinlan, 1993).</Paragraph> <Paragraph position="5"> We calculated Gp~n(p) (tile first term of formula (2)) based on the word numt)er coverage of p' in the original thesaurus rather than in the partim thesaurus, since the original thesaurus usually contains many more words than tile partial alf p has low generality, it will have high Gp~,~(p).</Paragraph> </Section> <Section position="4" start_page="944" end_page="944" type="sub_section"> <SectionTitle> Original Thesaurus </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> thesaurus, and is thus expected to yield a better estimate on the generMity of node p. TILe idea is shown in figure 3. The coefficient a is rather ditlicult to handle and wc will touch oil this issue ill section 4.3. The figures attached to each node in figure 2 are the example penalty scores given by formula (2) under the assmnption that the T and the original thesaurus are tile same and a = 0.0074.</Paragraph> <Paragraph position="3"> With these preparations, we now formally address the problem of tlm optimum generMization of the singh' attribute tattle.</Paragraph> <Paragraph position="4"> The Optimum Attribute Generalization Given a tree whose nodes each have a score: Find 1~ ...... that has the minimal total sum of scores:</Paragraph> <Paragraph position="6"/> </Section> </Section> <Section position="5" start_page="944" end_page="946" type="metho"> <SectionTitle> 4 The Algorithms </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="944" end_page="945" type="sub_section"> <SectionTitle> 4.1 The Algorithm T* </SectionTitle> <Paragraph position="0"> As was mentioned in section 3, the number of UCC node set in a tree tends to be gigantic, and we should obviously avoid an exllaustive search to find the optimum generalization. To do this search efficiently, we propose a new algorithm, T*. The essence of T* lies in the conversion of the partial thesaurus: from a tree T into a directed acyclic graph (DAG) T. This makes the problem into &quot;the shortest path problem in a graph,&quot; to which we can apply several efficient algorithms. We use the new notations in table 4 in addition to those in table 2.</Paragraph> <Paragraph position="1"> The Algorithm T* Tstar( value, class){ extract partial thesaurus T with value and class; /* conversion of T into a DAG T */ assign index numbers (1,..., m) to leaves in T from the left; add start node s to T with index number 0 and c with index number re+l; ror~ach( n ~ N U {s}){ extend an arc from n to each Lmi,~(p) leaf with smallest index in L(p) Lm~,(p) leaf with biggest index in L(p) element in the set H,~ defined by (4);} delete original edges appeared in T; /* search for shortest path in 7&quot; */</Paragraph> <Paragraph position="3"> This algorithm first converts T in figure 2 into a DAG 7-, as in figure 4. We call this graph a traversal 9raph and each path from s to e in the traversal graph a traverse. The set of nodes on each traverse is called a traversal node set.</Paragraph> <Paragraph position="4"> Here we have two propositions related to the node set if and only if P is a traversal node set. Since proposition 2 holds, we can solve the optimum attribute generalization problem by finding the shortest traverse 5 in the traversal graph. By applying a shortest path algorithm (Gondran and Minoux, 1984) to figure 4, we find the shortest traverse as (s ~ *human --+ *beast --+ *instrument ---+ e) arm get the optimally generalized table as in table 5 and the generalized decision tree as in figure 5.</Paragraph> </Section> <Section position="2" start_page="945" end_page="946" type="sub_section"> <SectionTitle> 4.2 Correctness and Time Complexity </SectionTitle> <Paragraph position="0"> We will not give a full proof for propositions 1 and 2 (correctness of T*) because of the limited space, but give an intuitive explanation of why the two propositions hold.</Paragraph> <Paragraph position="1"> Let's suppose that we select &quot;*human&quot; in figure 2 for a UCC node set P~cc; then we cannot include &quot;*mammal&quot; in the P~c~: there will be leaf overlap between the two nodes, which violates the unique cover. Meanwhile, we have to include nodes that govern Lm~(*human)+ 1, i.e.</Paragraph> <Paragraph position="2"> &quot;cat,&quot; to satisfy the complete cover. In conclusion, we have to include &quot;cat&quot; or &quot;*beast&quot; in the P~, which satisfies formula (4). The T* links all such possible nodes with arcs, and the traversal node sets can exhaust T'.</Paragraph> <Paragraph position="3"> One may easily understand that the traversal graph will be a DAG, since formula (4) allows an arc between two nodes to be spanned only in the direction that increases the index number of the leaf. Since proposition 1 holds, the time complexity of the T* can be estimated by the number of arcs in a traversal graph: there is an algorithm for the shortest path problem in an acyclic graph which runs with time complexity of O(M), where M is the number of arcs (Gondran and Minoux, 1984). Then we want to clarify the relationship between the number of leaves (data amount, denoted by D) and the number of arcs in the traversal graph. Unfortunately, the relationship between the two quantities varies depending on the shape of the tree (partial thesaurus), then we consider a practical case: k-ary tree with depth d (Tanaka, 1995a). In this case, the number of arcs in the traversal graph is given by</Paragraph> <Paragraph position="5"> Since the number of leaves D in the present thesaurus is k ~ , the first term in formula (5) be~D, showing that T* has O(D) time comes complexity in this case.</Paragraph> <Paragraph position="6"> Theoretically speaking, when the partial thesaurus becomes deep and has few leaves, the time complexity will become worse, but this is hardly the situation. We can say that T* has approximately linear order time complexity in practice.</Paragraph> </Section> <Section position="3" start_page="946" end_page="946" type="sub_section"> <SectionTitle> 4.3 The LASA-1 </SectionTitle> <Paragraph position="0"> The essence of DTLAs lies in the recursive &quot;search and division.&quot; It searches for the best classifier attribute in a given table. It then divides the table with values of the attribute.</Paragraph> <Paragraph position="1"> The goodness of an attribute is usually measm'ed by the following quantities (Quinlan, 1993) (The notations are in table 3.). Now let's a~ssume that a table contains a set of class A = {Cl,..., c,~}. The DTLA then evaluates the &quot;purity&quot; of A in terms of the entropy of the class distribution, H(A).</Paragraph> <Paragraph position="2"> If an attribute has m different values whicil divide A into m subsets as A = {BI,... ,J~m}, the DTLA evahmtes the &quot;purity after division&quot; by the &quot;weighed sum of entropy,&quot; WSH(attribute, A). WSH(attribute, A) : B~A ~H(B/) (6) The DTLA then measures the goodness of the attribute by gain : H(A) - WSH(attribute, A). (7) With these processes in mind, we can naturally extend the DTLA to handle the structured attributes while integrating T*. The algorithm is listed below. Here we have two functions named make'lYee 0 and Wsh 0. The function make~lh'ee0 executes the recursive &quot;search and division&quot; and the Wsh() calculates the weighted sum of entropy.</Paragraph> <Paragraph position="3"> T* is integrated in Wsh 0 at the first &quot;if clause.&quot; a In short, we use T* to optimally generalize the values of an attribute at each tree generation step, which makes the extension quite natural.</Paragraph> <Paragraph position="4"> The LASA-1 place all classes in input table under root; makeTree( root, table); makeTree(node, table){ A: class set in table; find attribute which maximizes We have implemcnted this algorithm as a package that we called LASA- 1(inductive Learning Algorithm with Structured Attributes). This package has many parameter setting options. The 6Without this clause, the algorithm is just a conventional DTLA.</Paragraph> <Paragraph position="5"> most important one is for parameter a in formula (2). Since it is not easy to find the best value before a trial, we used a heuristic method. The one used in the next section was set by the following method.</Paragraph> <Paragraph position="6"> We put equal emphasis on the two terms in formula (2) and fixed a so that the traverse via the root node of Tand the traverse via leaves only would have equal scores. At the beginning, LASA-1 calculated the value for each attribute in the original table.</Paragraph> <Paragraph position="7"> Although this heuristics does not guarantee to output the a that has the minimum errors on ()pen data, the value was not too far off in our experience. null</Paragraph> </Section> </Section> <Section position="6" start_page="946" end_page="947" type="metho"> <SectionTitle> 5 Empirical Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="946" end_page="947" type="sub_section"> <SectionTitle> 5.1 Experiment </SectionTitle> <Paragraph position="0"> We conducted a case frame tree acquisition experiment on LASA-1 and the DTLA 7 using part of our bilingual corpus for the verb &quot;take.&quot; We used 100 English-Japanese sentence pairs. The pairs contained 15 translations (classes) for &quot;take,&quot; whose occm'rences ranged from 5 to 9. We first converted the sentence pairs into an input table consisting of the case (attribute), English word form (value), and Japanese translation for &quot;take&quot; (class). We used 6 cases for attributes s and some of these appear in figure 6.</Paragraph> <Paragraph position="1"> We used the Japanese &quot;Ruigo-Kokugo-Jiten&quot; (Ono, 1985) for the thesaurus. It is a 10-ary tree with the depth of 3 or 4. The semantic class at each node of the tree was represented by 1 (top level) to 4 (lowest level) digits. To link the English word forms in the input table to the thesaurus in order to extract a partial thesaurus, we used the Japanese translations for the English word forms.</Paragraph> <Paragraph position="2"> When there was more than one possible semantic class for a word form, we gave all of them 9 and expanded the input table using all the semantic classes.</Paragraph> <Paragraph position="3"> We evaluated both algorithms with using the 10-fold cross validation method(Quinlan, 1993).</Paragraph> <Paragraph position="4"> The purity threshold for halting the tree generation was experimentally set at 7570 10 for both algorithms.</Paragraph> <Paragraph position="5"> A part of a case frame tree obtained by LASA-1 is shown in figure 6. We can observe that both semantic codes and word forms are mixed at the 7Part of LASA-1 was used as the DTLA.</Paragraph> <Paragraph position="6"> Sadverb (DDhl), adverbial particle (Dhl), object noun (ONhl), preposition (PNfl), the head of the prepositional phrase (PNhl), and subject (SNhl).</Paragraph> <Paragraph position="7"> 9We basically disambiguated the word senses manually, and there were not a disastrously large number of such cases.</Paragraph> <Paragraph position="8"> 1degIf the total frequency of the majority translation exceeds 75% of the total translation frequency, subtree generation halts.</Paragraph> <Paragraph position="9"> same depth of the tree. We can also observe that semantically close words are generalized by their common semantic code.</Paragraph> <Paragraph position="10"> Table 6 shows the percentage of each evaluation item. We have 120 open data, not 100, for LASA1, because the data is expanded due to the semantic ambiguity. The term &quot;incomplete&quot; in the table denotes the cases where the tree retrieval stopped mid-way because of an &quot;unknown word&quot; in the classification. Such cases, however, could sometimes hit the correct translation since the algorithm output the most frequent translation under the stopped node as the default answer.</Paragraph> <Paragraph position="11"> In table 6, we can recognize the sharp decrease in incomplete matching rate from 46.0 % (DTLA) to 20.8 % (LASA-1). The error rate also decreased from 49.0 % (DTLA) to 34.2 % (LASA-1).</Paragraph> <Paragraph position="12"> The average tree size (measured by the number of leaves) for DTLA was 57.9, which dropped to 50.9 for LASA-1.</Paragraph> <Paragraph position="13"> These results show that LASA-1 was able to satisfy our primary objectives: to solve the two problems mentioned in section 3, &quot;weak prediction power&quot; and &quot;low legibility.&quot;</Paragraph> </Section> <Section position="2" start_page="947" end_page="947" type="sub_section"> <SectionTitle> 5.2 Discussion </SectionTitle> <Paragraph position="0"> The shape of the decision tree learned by LASA-1 is sensitive to parameter a and the purity threshold. There is no guarantee that our method is the best, so it would be better to explore for a better criterion to decide these values.</Paragraph> <Paragraph position="1"> The penalty score !n this research was designed so that we get the maximum generalization if the error term in formula (2) stays constant. As a result, the subtrees in the deep part are highly generalized. In those parts, the data is sparse and the high-level generalization is questionable from a linguistic viewpoint. Some elaboration in the penalty function might be required.</Paragraph> </Section> </Section> class="xml-element"></Paper>