File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1015_metho.xml

Size: 16,549 bytes

Last Modified: 2025-10-06 14:14:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1015">
  <Title>A Comparative Study of the Application of Different Learning Techniques to Natural Language Interfaces</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Learning Task
</SectionTitle>
    <Paragraph position="0"> Our interface architecture is displayed in Fig. 1. It represents a multilingual database interface for the languages English, German, and Japanese. First, the language of the user input is detected and the input is transferred to the corresponding language-specific morphological and lexical analyzer.</Paragraph>
    <Paragraph position="1"> Morphological and lexical analysis performs the tokenization of the input, i.e. the segmentation into individual words or tokens. This task is not always trivial as in the case of Japanese, which uses no spaces for separating words. As next step the input is transformed into a deep form list (DFL), which indicates for each token its surface form, category, and semantic deep form.</Paragraph>
    <Paragraph position="2"> For database interfaces, unknown values contained in the input possess particular importance for the meaning of a command. Therefore, we treat those unknown values separately in the unknown value list (UVL) analyzer. This module checks the data type of unknown values and looks them up in the database to find out whether they represent identifiers of existing entities. In such a case, the entity type is indicated in the resulting UVL, otherwise we use the data type instead.</Paragraph>
    <Paragraph position="3"> DFL and UVL represent the input to the machine learning (ML) classifier. It assigns a ranked list of command classes to the input sentence according to the learned classification rules. As last step the classifications are used for generating appropri- null ate database commands.</Paragraph>
    <Paragraph position="4"> For the encoding :of the training data we only make use of the semantic deep forms contained in the DFL.</Paragraph>
    <Paragraph position="5"> We use English concepts as deep forms and map them to binary features, i.e. a certain feature equals I if the deep form is a member of the DFL, otherwise it equals 0. For the elements of the UVL we apply a more detailed encoding, which maps the number and the type to binary features. Figure 2 shows an example of the features derived from English, German, and Japanese input sentences for the update of the purchase price for a material.</Paragraph>
    <Paragraph position="6"> Thus, the learning task replaces an elaborate semantic analysis of the user input. The development of the corresponding underlying rule base might require several man-months. The learning task represents a realistic real-life application, which differs from many other problems studied in machine learning research in that it consists of a large number of features and classes. Furthermore, the command classes are often very similar and even for human experts very difficult to distinguish.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning Algorithms
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Instance-based Learning
</SectionTitle>
      <Paragraph position="0"> Instance-based approaches represent the learned knowledge simply as collection of training cases or instances. For that purpose they use the same language as for the description of the training data (Quinlan, 1993a). A new case is then classified by finding the instance with the highest similarity and using its class as prediction. Therefore, instance-based algorithms are characterized by a very low training effort. On the other hand, this leads to a high storage requirement because the algorithm has to keep all training cases in memory. Besides this, one has to compare new cases with all existing instances, which results in a high computation cost for classification.</Paragraph>
      <Paragraph position="1"> Different instance-based algorithms Vary in how they assess the similarity (or distance) between two instances. Two very commonly used methods are IB1 (Aha et al., 1991) and IBI-IG (Daelemans and van den Bosch, 1992). Whereas IB1 applies the simple approach of treating all features as equally important, IBI-IG uses the information gain (Quinlan, 1986) of the features as weighting function.</Paragraph>
      <Paragraph position="2"> We have developed an algorithm called BIN-CAT for binary features with class-dependent weighting and asymmetric treatment of the feature values. The similarity between a new case X and a training case Y is calculated according to the following formula: null</Paragraph>
      <Paragraph position="4"> In this formula, n indicates the number of features, Di the number of instances that have value 1 for feature i, and Cy the class of the training case Y.</Paragraph>
      <Paragraph position="5"> The term p(Di, Cy) then denotes the proportion of instances in Di that belong to class Cy. o'(xl,yi), ~Y(a~i, yi), and 5x(zi, yi) are determined as follows:</Paragraph>
      <Paragraph position="7"> so that the second sum in (1) is rated higher for a larger number of occurrences of the ith feature for class Cy whereas the third sum is rated lower.</Paragraph>
      <Paragraph position="8"> This means that if the training case Y contains a certain feature and the new case X does not, then we rate this difference the stronger the more often the feature occurs for class Cy. On the other hand, for features appearing in the new case X but not in Y, the opposite is true.</Paragraph>
      <Paragraph position="9"> Finally, wi represents the weight of feature i. It is calculated by making use of the following formula:</Paragraph>
      <Paragraph position="11"> The term under the summation symbol represents the selectivity of feature i for class j. It equals 1 if either all or none of the cases have value 1 for this feature. In other words, all instances for class j then either possess or do not possess this feature, which makes it a very discriminative characteristic.</Paragraph>
      <Paragraph position="12"> The Other extreme is that p(Di,j) equals 50%. In that case, this feature allows for no prediction of the class and the term under the summation symbol becomes 0.</Paragraph>
      <Paragraph position="13"> We have implemented all above-mentioned algorithms for binary features in ROCK &amp; ROLL in that we store the instances as objects and assign to them the features as ordered lists sorted by the feature numbers. The calculation of the similarity between two cases is then realized as method invocation on the feature list. For example, Fig. 3 shows the ROCK method to compute the distance between two feature lists according to IS1.</Paragraph>
      <Paragraph position="14"> Besides pure instance-based learning we have also developed an algorithm BIN-PRO, which creates a prototype for each class. Those prototypes are then used for the comparison with new cases. This has the big advantage that one does not have to store all the training instances and that the number of required comparisons for classification is reduced to the number of existing classes. As similarity function between a new case X and a certain class C we use the following formula:</Paragraph>
      <Paragraph position="16"> In this formula, we give more emphasis to features f that are present in X in that we multiply them by lOci, the number of instances for class C.</Paragraph>
      <Paragraph position="17"> However, the second sum takes also important features for class C into account that are missing in the new case X. As weighting function wl we use again (3). The implementation in ROCK ~ ROLL is performed by creating an object for each prototype and by invoking the associated method for computing the similarity to a new test case.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Model-based Learning
</SectionTitle>
      <Paragraph position="0"> In contrast to instance-based learning, model-based approaches represent the learned knowledge in a theory language that is richer than the language used for the description of the training data (Quinlan, 1986). Such learning methods construct explicit generalizations of training cases resulting in a large reduction of the size of the stored knowledge base and the cost of testing new test cases.</Paragraph>
      <Paragraph position="1"> In our research we consider the subtypes of decision trees and rule-based learning as well as hybrid approaches between them. The main difference between the various methods for constructing decision trees is the selection of the feature for splitting a node. The following two main categories are distinguished: null * static splitting: selects the best feature for splitting always on the basis of the complete collection of instances, * dynamic splitting: re-evaluates the best feature for splitting for each node based on the current local set of instances.</Paragraph>
      <Paragraph position="2"> (4) Static splitting requires less computational effort because it performs the feature ranking only once for the construction process. However, it entails overhead to keep track of already used features and to eliminate features that provide no proper splitting of the set of instances. Besides that, dynamic splitting methods produce much more compact trees with fewer nodes, leaves, and levels. This results in a sharp reduction of the storage requirement as well as the number of comparisons during classification.</Paragraph>
      <Paragraph position="3"> We have implemented decision trees for static (BS-tree) and dynamic splitting (SO-tree) by using the weighting function (3) as ranking scheme for the splitting criterion. In addition, we have also implemented the IGTree algorithm (Daelemans et al., 1997), which uses the information gain as static splitting criterion, and C~.5 (Quinlan, 1993b), which applies the information gain to dynamic splitting. The decision trees are implemented in  distance(x: featurelist): int begin var ix: int; var iy: int; var fx: int; var fy: int; var dist: int;</Paragraph>
      <Paragraph position="5"> type declaration for feature lists list of features</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ROCK methods
</SectionTitle>
    <Paragraph position="0"> method for calculation of distance to feature list of new instance X persistent class definition visibility method for distance calculation index for instance X index for instance Y feature for instance X feature for instance Y computed distance initialization of index ix initialization of index iy while index/xS that of last feature for X do get feature number for instance X at index ix get feature number for instance Y at index iy if same feature, then increment indices else increment distance if feature number for X smaller than that for Y, then increment index/x else increment index iy if index ix&gt; that of last feature for X, then while index iy~ that of last feature for Y do increment index iy increment distance if index iy&gt; that of last feature for Y, then while index/x~ that of last feature for X do  and by linking the nodes according to the tree structure. The classification of a new case is then simply performed as top-down traversal of the tree starting from the root. Besides this exact search we have also implemented an approximate search method, which allows one incorrect edge along the traversal to find a larger number of similar cases.</Paragraph>
    <Paragraph position="1"> Rule-based learning represents a second large category of model-based techniques. It aims at deriving a set of rules from the instances of the training set. A rule is here defined as a conjunction of literals, which, if satisfied, assigns a class to a new case. For the case of binary features, the literals correspond to feature tests with positive or negative sign. This means that they check whether a new case possesses a certain feature (for positive tests) or not (for negative tests).</Paragraph>
    <Paragraph position="2"> The methods for deriving the rules originate from the field of inductive logic programming (Muggleton, 1992). One of the most prominent algorithms for rule-based learning is FOIL (Quinlan and Cameron-Jones, 1995), which learns for each class a set of rules by applying a separate-and-conquer strategy. The algorithm takes the instances of a certain class as target relation. It iteratively learns a rule and removes those instances from the target relation that are covered by the rule. This is repeated until no in-Winiwarter 8C/ Kambayashi 129 Learning and NL Interfaces</Paragraph>
    <Paragraph position="4"> method for performing feature test persistent class definition visibility method for performing feature test retums true if test is not satisfied, otherwise false test for positive sign get sign of feature test test if sign is positive get feature test if feature is not member of feature list test for negative sign get sign of feature test test if sign is negative get feature test if feature is member of feature list type declaration for rule  stances are left in the target relation. A rule is grown by repeated specialization, adding literals until the rule does not cover any instances of other classes. In other words, the algorithm tries to find rules that possess some positive bindings, i.e. instances that belong to the target relation, but no negative bindings for instances of other classes. Therefore, the reason for adding a literal is to increase the relative proportion of positive bindings.</Paragraph>
    <Paragraph position="5"> As weighting function for selecting the next literal, FOIL uses the information gain. We have implemented FOIL, and besides this, we also use the algorithm BIN-rules with the following weighting function: null w1,,,c = b\]. (b- - bT). * (5) In this formula, s indicates the sign of the feature test. The number of positive (negative) bindings after adding the literal for the test of feature f is written as b~&amp;quot; (57). Finally, b- indicates the number of negative bindings before adding the literal so that b- - b~ calculates the reduction of negative bindings achieved by adding the literal. The weights w1,~,c are calculated as class-dependent weights for class C by making use of the feature weights w! from (3):</Paragraph>
    <Paragraph position="7"> We have implemented the test of rules as deductive ROLL method as shown in Fig. 4. The invocation of the method is a query with the parameter fl for the feature list of the new case. The test returns Winiwarter ~ Kambayashi 130 Learning and NL Interfaces false for those rules that are satisfied by the new case. The result of the query can then be assigned to the set of satisfied rules rs by using the command: rs := \[{l:t}l,-~diffor(!fl)~l~\] ;. As in the case of decision trees, we have developed an approximate test, which tolerates one divergent literal.</Paragraph>
    <Paragraph position="8"> As last group of:model-based algorithms we look at hybrid approaches between decision trees and rule-based learning. There exist two ways in principle to combine the advantages of the two paradigms.</Paragraph>
    <Paragraph position="9"> The first one is to extract rules from a decision tree whereas the second one follows the opposite direction by constructing a decision tree from a rule base. As example of the first type of approach we have implemented C~. 5-R ULES (Quinlan, 1993b), which extracts rules from the decision tree built by C4.5.</Paragraph>
    <Paragraph position="10"> Rules are computed as paths along the traversal from the root to all'leaves. In a second run, rules are pruned by removing redundant literals and rules.</Paragraph>
    <Paragraph position="11"> Regarding the second type of approach, we start from the rule base:produced by BIN-rules and use it for building an SE-tree (Rymon, 1993). SE-trees are a generalization of decision trees in that they allow not only one but several feature tests at one node. Therefore, a much flatter and more compact tree structure is achieved. For the construction of the tree we sort the feature tests of the rules first.</Paragraph>
    <Paragraph position="12"> Starting from a root node, we then construct paths according to the literals of the individual rules. For this process we make use of existing paths as far as possible before creating new branches.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML