File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0628_metho.xml

Size: 14,596 bytes

Last Modified: 2025-10-06 14:15:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0628">
  <Title>PP-Attachment: A Committee Machine Approach</Title>
  <Section position="5" start_page="232" end_page="234" type="metho">
    <SectionTitle>
2 A neural network approach
</SectionTitle>
    <Paragraph position="0"> to PP-attachment The use of classes is fundamental when working with neural networks. Using words alone without their classes in real texts, floods the memory capacity of a neural network. It is well known that the use of words creates huge probabilistic tables. In Mdition, the use of classes successfully deals with problems of invariance related to compositionality and binding that neural networks have \[Sopena, 1996\]. PP attachment can be considered as a classification problem were 4-tuples are classified in two classes: as to whether it is attached to the noun or to the verb \[Sopena et al. 1998\].</Paragraph>
    <Paragraph position="1"> These classes are represented in the output units. When a local representation for classes is used (one class per unit) the output actiwtion of each unit can be interpreted as the Bayesian posterior probability that the pattern in the input belongs to the class represented by this unit. In our case we have two units: one representing the class &amp;quot;attached to noun&amp;quot; and the other the class &amp;quot;attached to verb&amp;quot;. The activitation of these units will represent the respective probability of attachment given the 4-tuple encoded in the input.</Paragraph>
    <Paragraph position="2"> Given the set of words in the 4-tuple we have to determine a way to represent senses and semantic class information. Polysemy represents a problem when using word classes. In order to use class information, two different procedures are possible. The first one consists in presenting all the classes of each sense of each word serially. The second one consists in the simultaneous presentation of all the senses of all the words. In previous works we have found that parallel presentation improve results.</Paragraph>
    <Paragraph position="3"> The parallel procedure has the advantage of detecting in the network classes that are related  to others within the same slot or among different slots.</Paragraph>
    <Paragraph position="4"> Presenting all of the classes simultaneously (including verb classes) allows us to detect complex interactions among them (either the classes of a particular sense or the classes of different senses of a particular word) that cannot be detected in most of the methods used so far. We have been able to detect their existence in our studies on word sense disambiguation currently being carrying out. If we present simultaneously all the classes of all the senses of each word in the 4-tuple we will have a very complex input.</Paragraph>
    <Paragraph position="5"> A system capable of dealing with such an input would be able to select classes (and consequently senses) which are compatible with other ones.</Paragraph>
    <Paragraph position="6"> Finally, and related to the above, most of the statistical methods used in Natural Language Processing are linear. Multilayer feedforward networks are non linear. One of the objectives of experiments is to see if introducing non-linearity improves the results.</Paragraph>
    <Section position="1" start_page="233" end_page="233" type="sub_section">
      <SectionTitle>
2.1 Test and training data
</SectionTitle>
      <Paragraph position="0"> We used the same data set (the complete training set and the complete test set) as \[Ratnaparkhi et. al 1994\] for purposes of comparison. In this data set the 4-tuples of the test and training sets were extracted from Penn Tree-bank Wall Street Journal \[Marcus et al. 1993\]. The test data consisted of 3,097 4-tuples with 20,801 4-tuples for the training data.</Paragraph>
      <Paragraph position="1"> The following process was run over both test and training data: All numbers were replaced by the string</Paragraph>
    </Section>
    <Section position="2" start_page="233" end_page="234" type="sub_section">
      <SectionTitle>
2.2 Codification
</SectionTitle>
      <Paragraph position="0"> The input was divided into eight slots. The first four slots represented 'verb', 'nl', 'prep', and 'n2' respectively. In slots 'nl' and 'n2' each sense of the corresponding noun was encoded using all the classes within the IS-A branch of the Word-Net hierarchy. This was done from the corresponding hierarchy root node to its bottom-most node. In the verb slot, the verb was encoded using the IS-A-WAY-OF branches. Each node in the hierarchy received a local encoding. There was a unit in the input for each node of the WordNet subset. This unit was ON if it represented a semantic class to which one of the senses of the encoded word belonged.</Paragraph>
      <Paragraph position="1"> Using a local representation we needed a unit for each class-synset. The number class-synsets in WordNet is too large for a neural network. In order to reduce the number of input units we did not use WordNet directly, but constructed a new hierarchy (a subset of WordNet) including only the classes that corresponded to the words that belonged to the training and test sets.</Paragraph>
      <Paragraph position="2"> A feedforward neural network can make good use of class information if there is a sufficient number of examples belonging to each class. For that reason we also counted the number of times the different semantic classes appeared in the training and test sets. The hierarchy was pruned taking these statistics into account. Given a threshold h, classes which appeared less than h% were not included. In all the experiments of this paper, we used tree cut thresholds of 2% . Regarding prepositions, only the 36 most fl'equent ones were represented (those found more than 20 times). For those, a local encoding was used.</Paragraph>
      <Paragraph position="3"> The rest of the prepositions were left uncoded.</Paragraph>
      <Paragraph position="4"> The fifth slot represented the prepositions that the verb subcategorized. By representing the prepositions, \[Sopena et al. 1998\] had obtained improved results. The reason for this improvement being that English verbs with semantic similarity may take on different prepositions (for example, accuse with of and blame with for).</Paragraph>
      <Paragraph position="5"> Apart from semantic classes, verbs can also be  classified on the basis of the kind of prepositions they make use of.</Paragraph>
      <Paragraph position="6"> The prepositions that the verbs subcategorize were initially extracted from COMLEX \[Wolff et al., 1995\]. Upon observation that COMMLEX does not consider all the subcategorized prepositions, we complemented COMLEX with information extracted from training data.</Paragraph>
      <Paragraph position="7"> The prepositions of all the 4-tuples assigned to the verb were considered. The distinction between PP adjuncts and PP close-related were not available in the Ratnaparkhi data set. Therefore, we grouped the subcategorized prepositions by their verbs as well as those that govern PP adjuncts. Only the 36 most frequent prepositions were represented.</Paragraph>
      <Paragraph position="8"> The sixth slot represented the prepositions that were governed by 'nl'. Again, only the 36 most frequent prepositions were represented.</Paragraph>
      <Paragraph position="9"> These prepositions were extracted from the 4-tuples of the training data whose attachments were to the noun.</Paragraph>
      <Paragraph position="10"> The next slot represented 15 units for the lexicography verb files of WordNet. WordNet has a large number of verb root nodes, some of which are not frequent. Due to this fact, in some cases the pruning that was carried out on the tree made root nodes disappear. This lead to some of the verbs that belonged to this class not being coded. In order to avoid these cases, we used the names of the WordNet verb lexicographical files to add a new top level in the WordNet verb class hierarchy. Finally, in the last slot there are 2 units to indicate whether or not the N1 or N2 respectively were proper nouns .</Paragraph>
      <Paragraph position="11"> Regarding the output, there were only two units representing whether the PP was attached to the verb or to the noun.</Paragraph>
      <Paragraph position="12"> Feedforward networks with one hidden layer and full interconnectivity between layers were used in all the experiments. The networks were trained with the backpropagation learning algorithm. The activation function was the hyperbolic tangent function. The number of hidden units used was 0, 50 and 100. For all simulations the momentum was 0, and the initial weight range 0.1.</Paragraph>
      <Paragraph position="13"> A validation set was constructed using 12,029 4-tuples extracted from the Brown Corpus.</Paragraph>
      <Paragraph position="14"> In each run the networks were trained for 60 epochs storing the epoch weights with the smallest error regarding the validation set, as well as the weights of the 60th epoch (without the validation set).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="234" end_page="235" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> Table 2 shows the results of 24 training simulations obtained in the test data using 0, 50 and 100 hidden units respectively. We show the best results by the networks acting individually.</Paragraph>
    <Paragraph position="1">  In spite of the high cost in computational time of neural net training, the response time in test mode is up 3 times faster than Backed-Off model. This is shown in Table 2 where the time taken to disambiguate 3,097 4-tuples is given.</Paragraph>
    <Paragraph position="2"> In this problem we had a high level of noise: on one hand the inadequate senses of each word in the 4-tuple. Words in English have a high number of senses thus, in the input, the level of noise (inadequate sense) can reach 5 times that of signal (correct sense). In addition, the Ratnaparkhi data set contains many errors, some of them due to errors originating from the Penn Treebank I. This level of noise deteriorates the generalizing capacity of the neural network.</Paragraph>
    <Paragraph position="3"> There are many methods that permit a neural network to improve its capacity of generalization. For reasons of complexity, the size of the network that we are using places restrictions on the selection of the method. Of the methods that we are testing, committee machines allow us to improve results the most easily.</Paragraph>
    <Section position="1" start_page="235" end_page="235" type="sub_section">
      <SectionTitle>
3.1 Experiments with Committees of
</SectionTitle>
      <Paragraph position="0"> networks: The performance of a committee machine \[Perrone, Cooper, 1993\], \[Perrone, 1994\] and \[Bishop C., 1995\] can outperform that of the best single network used in isolation.</Paragraph>
      <Paragraph position="1"> As \[Kuncheva et al. 1998\] points out, the process of combining multiple classifiers to achieve higher accuracy is given different names in the literature apart from committee machines: combination, classifier fusion, mixture of experts, consensus aggregation, classifier ensembles, etc. We have applied the following algorithms: average, weighted average, OWA operator, Choquet integral and stacked generalization.</Paragraph>
      <Paragraph position="2">  Suppose we have a set of N trained network models yi(x) where i = 1, ..., N.</Paragraph>
      <Paragraph position="3"> We can then write the mapping function of each network as the desired function t(x) plus an error function \[Bishop C., 1995\]:</Paragraph>
      <Paragraph position="5"> The average sum-of-squares error for model</Paragraph>
      <Paragraph position="7"> The output of the committee is the average of the outputs of the N networks that integrates the committee, in the form</Paragraph>
      <Paragraph position="9"> If we make the assumption that the errors ei(x) have zero mean and are uncorrelated, we have</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="235" end_page="236" type="metho">
    <SectionTitle>
1 ECOM = ~EAv
</SectionTitle>
    <Paragraph position="0"> where ECOM is the average error made by the committee and EAV is the average error made by the networks acting individually.</Paragraph>
    <Paragraph position="1"> In general, the errors ei(x) are highly correlated but even then it is easy to show that</Paragraph>
    <Paragraph position="3"> As some members of the committee will invariably give better results than others, it is of interest to give more weight to some of the members than to others taking the form:</Paragraph>
    <Paragraph position="5"> here wi is based on the error of the validation and learning set.</Paragraph>
    <Paragraph position="6">  where {a(1),-.-,a(n)} is a permutation of {1,..-,n) such that ya(i- 1) &gt;_ ya(i) for all i=2,...,n.</Paragraph>
    <Paragraph position="7"> The OWA operator permits weighting the values in relation to their ordering. Results are show in Tables 3, 4 and 5.  The fuzzy integral introduced by \[Choquet G, 1954\] and the associated fuzzy measures, provide a useful way for aggregation information. A fuzzy measure u defined on the measurable space (X,X) is a set function</Paragraph>
    <Paragraph position="9"> (X,X,u) is said to be a fuzzy measurable space.</Paragraph>
    <Paragraph position="10">  If u is a fuzzy measure on X, then the Choquet integral of a function f : X --+ R with respect to</Paragraph>
    <Paragraph position="12"> where f(y~(i)) indicates that the indices have been permuted so that</Paragraph>
    <Paragraph position="14"> One characteristic property of Choquet integrals is monotonicity, i.e., increases of the input lead to higher integral values. Results are shown in Table 6 and Table 7.</Paragraph>
    <Section position="1" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
3.2 Stacked generalization:
</SectionTitle>
      <Paragraph position="0"> \[Wolpert, 1992\] provides one way of combining trained networks which partitions the data set in order to find an overall system which usually improves generalization. The idea is to train the level-0 networks first and then examine their behavior when generalizing. This provides a new training set which is used to train the level-1 network. The inputs consist of the outputs of all the level-0 networks, and the target value is the corresponding target value from the original full data set. Our experiments using this method did not give improved results (85.35%).</Paragraph>
      <Paragraph position="1"> Net 1 Net 2 Net 3 Net 4 Choquet</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML