File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/w99-0704_abstr.xml

Size: 23,205 bytes

Last Modified: 2025-10-06 13:49:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0704">
  <Title>Finding Representations for Memory-Based Language</Title>
  <Section position="1" start_page="0" end_page="28" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Constructive induction transforms the representation of instances in order to produce a more accurate model of the concept to be learned. For this purpose, a variety of operators has been proposed in the literature, including a Cartesian product operator forming pair-wise higher-order attributes. We study the effect of the Cartesian product operator on memory-based language learning, and demonstrate its effect on generalization accuracy and data compression for a number of linguistic classification tasks, using k-nearest neighbor learning algorithms. These results are compared to a baseline approach of backward sequential elimination of attributes. It is demonstrated that neither approach consistently outperforms the other, and that attribute elimination can be used to derive compact representations for memory-based language learning without noticeable loss of generalization accuracy.</Paragraph>
    <Section position="1" start_page="0" end_page="24" type="sub_section">
      <SectionTitle>
The Netherlands
</SectionTitle>
      <Paragraph position="0"> Daelemans et al., 1997\]. Successful attribute elimination leads to compact datasets, which possibly increase classification speed. Constructive induction, on the other hand, tries to exploit dependencies between attributes, by combining them into complex attributes that increase accuracy of the classifier. For instance-based algorithms, this approach has been demonstrated to correct invalid independence assumptions made by the algorithm \[Pazzani, 1998\]: e.g., for the Naive Bayes classifier (Duda &amp; Hart, 1973), the unwarranted assumption that in general the various attributes a, = v, are independent, and form a joint probability model for the prediction of the class C: It is a widely held proposition that inductive learning models, such as decision trees \[Quinlan, 1993\] or k-nearest neighbor models \[Aha, Kibler &amp; Albert, 1991\], are heavily dependent upon their representational biases. Both decision tree algorithms and instance-based algorithms have been reported to be vulnerable to irrelevant or noisy attributes in the representation of exemplars, which unnecessarily enlarge the search space for classification \[John, 1997\]. In general, there are two options for dealing with this problem. Attribute elimination (or selection) can be applied in order to find a minimal set of attributes that is maximally informative for the concept to be learned. Attribute elimination can be seen as a radical case of attribute weighting \[Scherf &amp; Brauer, 1997, Aha, 1998\], where attributes are weighted on a binary scale, as either relevant or not; more fine-grained methods of attribute weighting take information-theoretic notions ifito account such as information gain ratio \[Quinlan, 1993.</Paragraph>
      <Paragraph position="2"> Constructive induction thus can be used to invent relationships between attributes that, apart fl'om possibly offering insight into the underlying structure of the learning task, may boost performance of the resulting classifier. Linguistic tasks are sequential by nature, as language processing is a linear process, operating on sequences with a temporal structure (see e.g. Cleeremans (1993) for motivation for the temporal structure of finite-state grammar learning). Learning algorithms like k-nearest neighbor or decision trees abstract away from this linearity, by treating representations as multisets of attribute-value pairs, i.e. permutation-invariant lists. Using these algorithms, constructive induction cannot be used for corrections on the linearity of the learning task, but it can be used to study attribute interaction irrespective of ordering issues.</Paragraph>
      <Paragraph position="3"> In this paper, the use of constructive induction is contrasted with attribute elimination for a set of linguistic learning tasks. The linguistic learning domain appears to be deviant from other symbolic domains in being highly susceptible to editing. It has been noticed \[Daelemans et al., 1999i\] that editing exceptional  instances from linguistic instance bases tends to harm generalization accuracy. In this study, we apply editing on the level of instance representation. The central question is whether it is more preferable to correct linguistic tasks by combining (possibly noisy or irrelevant) attributes, or by finding informative subsets.</Paragraph>
    </Section>
    <Section position="2" start_page="24" end_page="26" type="sub_section">
      <SectionTitle>
Representation Transformations
</SectionTitle>
      <Paragraph position="0"> John (1997) contains presentations of various attribute selection approaches. In Yang &amp; Honovar (1998), a genetic algorithm is used for finding informative attribute subsets, in a neural network setting. Cardie (1996) presents an attribute selection approach to natural language processing (relative pronoun disambiguation) incorporating a small set of linguistic biases (to be determined by experts).</Paragraph>
      <Paragraph position="1"> Many operators have been proposed in the literature for forming new attributes from existing ones. Pagallo &amp; Hauser (1990) propose boolean operators (like conjunction and negation) for forming new attributes in a decision tree setting. Aha (1991) describes IB3-CI, a constructive indiction algorithm for the instance-based classifier IB3. Aiming at reducing similarity between an exemplar and its misclassifying nearest neighbor, IB3-CI uses a conjunctive operator forming an attribute that discriminates between these two. Bloedorn &amp; Michalski (1991) present a wide variety of mathematical and logical operators within the context of the AQ17-DC1 system. A general perspective on constructive induction is sketched in Bloedorn, Michalski &amp; Wnek (1994). Keogh &amp; Pazzani (1999) propose correlation arcs between attributes, augmenting Naive Bayes with a graph structure.</Paragraph>
      <Paragraph position="2"> Pazzani (1998) proposes a Cartesian product operator for joining attributes, and compares its effects on generalization accuracy with those of attribute elimination, for (a.o.) the Naive Bayes and PEBLS (Cost &amp; Salzberg, 1993) classifiers. The Cartesian product operator joins two attributes At and A2 into a new, complex attribute At..42, taking values in the Cartesian product {&lt; a,,aj &gt;1 aie Values(At) ^ aj E Values(A.,_)} (2) where Values(A) is the value set of attribute A. The Cartesian product operator has an intrinsic linear interpretation: two features joined in a Cartesian product form an ordered pair with a precedence relation (the ordered pair &lt; a, b &gt; differs from the ordered pair &lt; b, a &gt;). This linear interpretation vanishes in learning algorithms that do not discern internal structure in attribute values (like standard nearest neighbor).</Paragraph>
      <Paragraph position="3"> Pazzani's backward sequential elimination and joinb~g algorithm (BSEJ) finds the optimal representation transformation by considering each pair of attributes in turn, using leave-one-out cross-validation to determine the effect on generalization accuracy. Attribute joining carries out an implicit but inevitable elimination step: wiping out an attribute being subsumed by a combination. This reduces the dimensionality of the result dataset with one dimension. Following successful joining, the BSEJ algorithm carries out an explicit elimination step, attempting to delete every attribute in turn (including the newly constructed attribute) looking for the optimal candidate using cross-validation.</Paragraph>
      <Paragraph position="4"> The algorithm converges when no more transformations can be found that increase generalization accuracy. This approach is reported to produce significant accuracy gain for Naive Bayes and for PEBLS. Pazzani contrasts BSEJ with a backward sequential ehmination algorithm (BSE, backward sequential elimination, progressively eliminating attributes (and thus reducing dimensionality) until accuracy degrades. He also investigates forward variants of these algorithms, which successively build more complex representations up to convergence. Both for PEBLS and Naive Bayes, attribute joining appears to be superior to elimination, and the backward algorithms perform better than the forward algorithms. For k-nearest neighbor algorithms based on the unweighted overlap metric, BSEJ did not out-perform BSE.</Paragraph>
      <Paragraph position="5"> Conditioning representation transformations oil the performance of the original classifier implements a wrapper approach (John, 1997; Kohavi &amp; John, 1998), which has proven an accurate, powerful method to measure the effects of data transformations on generalization accuracy. The transformation process is wrapped around the classifier, and no transformation is carried out that degrades generalization accuracy.</Paragraph>
      <Paragraph position="6"> In this study, two algorithms, an implementation of BSE and a simplification of the BSEJ algorithm, were wrapped around three types of classifiers: IBI-IG, IB1-IG&amp;MVDM (a classifier related to PEBLS in using MVDM) and IGTREE \[Daelemans et al., 1997\]. All of these classifiers are implemented in the TiMBL package \[Daelemans et al, 1999ii\]. IBI-IG is a k-nearest neighbor algorithm using a weighted overlap metric, where the attributes of instances have their information gain ratio as weight. For instances X and l', distance is computed as</Paragraph>
      <Paragraph position="8"> where 6 is the overlap metric, and w, is the information gain ratio (Quinlan, 1993) of attribute i.</Paragraph>
      <Paragraph position="9"> The PEBLS algorithm can be approximated to a certain extent by combining IBI-IG with the Modified Value Difference Metric (MVDM) of Cost &amp; Salzberg  (1993). The MVDM defines the difference between two values x and y respective to a class C, as</Paragraph>
      <Paragraph position="11"> i.e., it uses the probabilities of the various classes conditioned on the two values to determine overlap.</Paragraph>
      <Paragraph position="12"> Attribute weighting of IBI-IG&amp;MVDM (information gain ratio based) differs from PEBLS: PEBLS uses performance-based weighting based on class predicyion strength, where exemplars are weighted according to an accuracy or reliability ratio.</Paragraph>
      <Paragraph position="13"> IGTREE is a tree-based k-nearest neighbor algorithm, where information gain is used as a heuristic to insert nodes in the tree. For every non-terminal node, a default classification is stored for the path leading to it. Whenever no exact match can be found for an unknown instance to be classified, the default classification associated with the last matching attribute is returned as classification for the instance. Although IGTREE sometimes lags behind IBI-IG in accuracy, it provides for much faster, high quality classifiers.</Paragraph>
      <Paragraph position="14"> An implementation of the BSE algorithm is outlined in figure . It is akin in spirit to the backward elimination algorithm of John (1997). During every pass, it measures the effects on generalization accuracy of eliminating every attribute in turn, only carrying out the one which maximizes accuracy. A simplified version of the BSEJ algorithm called backward sequential joining with information gain ratio (BSJ-IG) is outlined in figure. N! It checks the ~ ordered combinations for N features during each pass, and carries out the one resulting in the maximum gain in accuracy (as a consequence of the permutation im, ariance, the total search space of N! possible combinations can be halved). Any two joined attributes are put on the position with the maximum information gain ratio of both original positions, after which the remaining candidate position is wiped out. Again, as the used classifiers are all permutation-invariant with respect to their representations, this is only a decision procedure to find a target position for the attribute combination; all candidate positions are equivalent target positions.</Paragraph>
      <Paragraph position="15"> Unlike the original BSEJ algorithm, BSJ-IG omits the additional explicit attribute elimination step directly after every attribute joining step, in order to segregate the effects of attribute joining as much as possible from those of attribute elimination.</Paragraph>
      <Paragraph position="16"> Both BSE and BSJ-IG algorithms are hill-climbing algorithms, and, as such, are vulnerable to local lninima. Ties are resolved randomly by both.</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
Experiments
</SectionTitle>
      <Paragraph position="0"> The effects of forming Cartesian product attributes on generalization accuracy and reduction of dimensionality (compression) were compared with those of backward sequential elimination of attributes. The following 7 linguistic datasets were used. STRESS is a selection of secondary stress assignment patterns from the Dutch version of the Celex lexical database \[Baayen, Piepenbrock &amp; van Rijn, 1993\], on the basis of phonemic representations of syllabified words. Attribute values are phonemes. Also derived from Celex is the DIMIN task, a selection of diminutive formation patterns for Dutch. This task consists of assigning Dutch diminutive suffixes'to a noun, based on phonetic properties of (maximally) the last three syllables of the noun. Attribute values are phoneme representations as well as stress markers for the syllables. The WSJ-NPVP set consists of part-of speech tagged Wall Street Journal material (Marcus, Santorini &amp; Marcinkiewicz, 1993), supplemented with syntactic tags indicating noun phrase and verb phrase boundaries (Daelemans et al, 1999iii). wsJ-POS is a fragment of the Wall Street Journal part-of-speech tagged material (Marcus, Santorini and Marcinkiewicz, 1993). Attributes values are parts of speech, which are assigned using a windowing approach, with a window size of 5. INL-POS is a part-of-speech tagging task for Dutch, using tl~e Dutch-Tale tagset \[van der Voort van der Kleij et al., 1994\], attribute values are parts of speech. Using a windowing approach, on the basis of a 7-cell window, part of speech tags are disambiguated. GRAPHON constitutes a grapheme-to-phoneme learning task for English, based on the Celex lexical database. Attribute values are graphemes (single characters), to be classified as phonemes. PP-ATTACH, finally, is a prepositional phrase (PP) attachment task for English.. where PP's are attached to either noun or verb projections, based on lexical context. Attribute values are word forms for verb, the head noun of the following nouu phrase, the preposition of the following PP, and the head noun of the PP-internal noun phrase (like bring attention to problem). The material has been extracted by Ratnaparkhi et al. (1994) from the Penn Treebank Wall Street Journal corpus. Key numerical characteristics of the datasets are summarized in table 1.</Paragraph>
      <Paragraph position="1"> Each of these datasets was subjected to the BSJ-IG and the BSE wrapper algorithms, embedding either the IBI-IG or IGTREE architecture. Both the Naive Bayes and PEBLS classifier investigated by Pazzani (1998) allow for certain frequency tendencies hidden in the data to bear on the classification. This has a smoothing effect on the handling of low-frequency events, which benefit from analogies with more reliable higher-frequency  events. In order to assess the effects of smoothing, the following additional experiments were carried out. Embeddded into BSE and BSJ-IG, the PEBLS approximation IBI-IG with MVDM was applied to three datasets: STRESS, DIMIN and PP-ATTACH, for three values of k (1, 3, 7), the size of the nearest neighbor set. Values for k larger than 1, i.e. non-singleton nearest neighbor sets.</Paragraph>
      <Paragraph position="2"> have been found to reproduce some of the smoothing inherent to statistical back-off models (Daelemans et al.. 1999ii; Zavrel &amp; Daelemans, 1997).</Paragraph>
      <Paragraph position="3"> Generalization accuracy for every attribute joining or elimination step was measured using 10-fold crossvalidation, and significance was measured using a two-tailed paired t-test at the .05 level. All experiments were carried out on a Digital Alpha XL-266 (Linux) and a Sun UltraSPARC-IIi (Solaris). Due to slow performance of the IBI-IG model on certain datasets with the used equipment, IBI-IG experiments with %VSJ-NPVP could not be completed.</Paragraph>
    </Section>
    <Section position="4" start_page="26" end_page="28" type="sub_section">
      <SectionTitle>
Results
</SectionTitle>
      <Paragraph position="0"> The results show, first of all, that the compression rates obtained with BSE (average 34.9%) were consistently higher than those obtained with BSJ-IG (average 28.6%) (table 2).</Paragraph>
      <Paragraph position="1"> Secondly, BSE and BSJ-IG have com~)arable effects on accuracy. BSE generally boosts IGTREE performance to IBI-IG level, and leads to significant accuracy gains for two datasets, STRESS and PP-A.TTACH (tabel 3). BSJ-IG does so for the STRESS set (tabel 4). Neither BSE nor BSJ-IG produce any significant gain in accuracy for the IBI-IG classifier. This generalizes the findings of Pazzani (1998) ibr classifiers based on unweighted overlap metrics to classifiers based oll a weighted overlap metric.</Paragraph>
      <Paragraph position="2"> For the classifier IBI-IG&amp;MVDM. the situation is more complex (table 5). First, for k = 1. BSE and BSJ-IG have comparable accuracy. For the STRESS and PP-ATTACH sets, both algorithms produce significant and comparable accuracy gains. Second, compression by BSE is significantly higher than compression oy BSJ-IG (47.2% vs. 30.6%).</Paragraph>
      <Paragraph position="3"> For the larger values for k (3, 7), BSJ-IG produces significant higher accuracies on the STRESS set, outperforming BSE. Moreover, BSJ-IG yields a compression rate comparable to BSE. BSE compression drops from 47.2% to 27.8%.</Paragraph>
      <Paragraph position="4"> A detailed look at the representations produced by BSE and BSJ-IG reveals the following.</Paragraph>
      <Paragraph position="6"> only agree on wsJ-POS: they both join the same attributes. For the other datasets, there is no overlap at all.</Paragraph>
      <Paragraph position="7"> * (BSE) For the wsJ-POS set, BSE deletes exactly the same two features that are joined by BSJ-IG for IB1-IG and IGTREE. For the DIMIN set, IBI-IG&amp;BSE and IGTREE&amp;BSE delete 4 common features. For STRESS, all features deleted by IBI-IG&amp;BSE are deleted by IGTREE&amp;BSE as well. On the INL-POS set, three common features are deleted. Frequently, BSE was found to delete an attribute joined by BSJIG. null * (IBI-IG&amp;MVDM, BSJ-IG) BSJ-IG produces no overlap for D1MIN for the three different classifiers (k = 1,3,7). For STRESS, the k = 1, k = 3 and k = 7 classifiers join one common pair of attributes. This is the pair consisting of the nucleus and coda of the last syllable, indeed a strong feature for stress assignment (Daelemans, p.c.). For PP-ATTACH, the k = 1, k - 3 and k = 7 classifiers identify attribute 4 (the head noun of the PP-internal noun phrase) for .joining with another attribute. Attribute 4 clearly introduces sparseness in the dataset: it has 5~695 possible values, opposed to maximally 4,405 values for the other attributes. The k = 3 and k = 7 classifiers agree fully here.</Paragraph>
      <Paragraph position="8"> * (IBI-IG&amp;MVDM, BSE) On the DIMIN set, the k = 1 and k = 3 classifiers differ in 1 attribute elimination only. They display no overlap with k = 7, which eliminates entirely other attributes. For STRESS, k = 1 and k = 3 classifiers overlap on 3 attributes. The three classifiers delete 1. common attribute (not the nucleus or coda). For PP, the k = 3 and k = 7 classifters do not eliminate attributes; the k = 1 classifier deletes the attribute 4 (PP-internal head noun), and even the first verb-valued attribute. In doing so, it constitutes a strongly lexicalised model for PP-attachment taking only into account the first head noun and the following preposition.</Paragraph>
      <Paragraph position="9"> BSE produced more overlapping results across classitiers than BSJ-IG. IBI-IG&amp;MVDM with BSJ-IG is the only type of classifier that is able to trap the important interaction between nucleus and coda in the STRESS set. Due to lack of domain knowledge, we cannot be certain that other important interactions have ,lot been  trapped as well; this lies outside the scope of this study. Although firm conclusions cannot be drawn on the basis of three datasets only, the compact and accurate results of the k = 3 and k = 7 classifiers may indicate a tendency for smoothing algorithms to compensate better for eventual non-optimal attribute combinations than for eliminated attributes. This would be in agreement with Pazzani's findings for PEBLS and Naive Baves.</Paragraph>
      <Paragraph position="10"> Frequently, cases were observed where BSE eliminates attributes that were used for joining by BSJ-IG. This indicates that at least some of the advantages of attribute joining originate from implicit attribute elimination rather than combination, which has also been noted by Pazzani (1998): removing an attribute may improve accuracy more than joining it to another attribute. null Conclusions The effects of two representation-changing algorithms on generalization accuracy and data compression were tested for three different types of nearest neighbor classifters, on 7 linguistic learning tasks. As a consequence of the permutation-invariance of the used classifiers and the use of hill-climbing algorithms, a practical sampling of the search space of data transformations was applied. BSE. an attribute elimination algorithm, was found to produce accurate classifiers, with consistently higher data compression rates than BSJ-IG, an attribute joining algorithm. The generalization accuracy of BSE is comparable to that of BSJ-IG.</Paragraph>
      <Paragraph position="11"> Some evidence hints that attribute joining may be more succesful - both for compression and accuracy - for classifiers employing smoothing techniques, e.g.</Paragraph>
      <Paragraph position="12"> PEBLS-Iike algorithms which select a nearest neighbor from a nearest neighbor set using frequency information. This type of classifier was able to trap at least one important attribute interaction in the STRESS domain, offering extended insight in the underlying learning task. Further evidence is needed to confirm this conjecture, and may shed further li6ht on the question whether and how linguistic learning tasks could benefit from attribute interaction. An alternative line of research to be pursued will address cla~ssifier models that allow for linear encoding of linguistic learning tasks: these models will allow investigations into corrections on the linearity of linguistic tasks.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML