File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0717_metho.xml
Size: 16,539 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0717"> <Title>I- I Incorporating Knowledge in Natural Language Learning: A Case Study</Title> <Section position="4" start_page="122" end_page="122" type="metho"> <SectionTitle> 3 The SNOW Approach </SectionTitle> <Paragraph position="0"> The SNOW architecture is a network of threshold gates. Nodes in the first layer of the network represent the input features; target nodes are represented by nodes in the second layer. Links from the first to the second layer have weights; each target node is thus defined as a (linear) function of the lower level nodes.</Paragraph> <Paragraph position="1"> For example, in PPA, the two target nodes represent n and v attachments. Each target node can be thought of as an autonomous subnetwork, although they all feed from the same input. The subnetworks are sparse in that a target node needs not be connected to all nodes in the input layer. For example, it is not connected to input nodes (features) that were never active with it in the same example, or it may disconnect itself from some of the irrelevant inputs while training.</Paragraph> <Paragraph position="2"> Learning in SNOW proceeds in an on-line fashion I. Every example is treated autonomously by each target subnetwork, viewed as a positive example of a few subnetworks and a negative example for the others. In PPA, examples labeled n (v, resp.) are treated as positive for the n (v) target node and as negative for the v (n) target node. Thus, every example is used once by all the nodes to refine their definition, and then discarded. At prediction time, given an input which activates a subset of the input nodes, each subnetwork evaluates the total activity it receives. Subnetworks compete on determining the final prediction; the one which produces the highest activity gets to determine the prediction.</Paragraph> <Paragraph position="3"> In general, a target node in the SNOW architecture is represented by a collection of subnetworks, which we call a cloud, but in the application described here we have used cloud size of I so this will not be discussed here.</Paragraph> <Paragraph position="4"> The Winnow local mistake-driven learning algorithm (Littlestone, 1988) is used at each target node to learn its dependence on the input nodes. Winnow updates the weight on the links in a multiplicative fashion. We do not supply the details of the algorithm and just note that it can he implemented in 1 In the experimental study we do not update the network while testing.</Paragraph> <Paragraph position="5"> such a way that the update time of the algorithm depends on the number of active features in the example rather than the total number of features in the domain. The sparse architecture along with the representation of each example as a list of active features is reminiscent of infinite attribute models of Winnow (Blum, 1992).</Paragraph> <Paragraph position="6"> Theoretical analysis has shown that multiplicative update algorithms, like Winnow, have exceptionally good behavior in the presence of irrelevant attributes, noise, and even a target function changing in time (Littlestone, 1988; Littlestone and Warmuth, 1994; Herbster and Warmuth, 1995). In particular, Winnow was shown to learn efficiently any linear threshold function (Littlestone, 1988), with a mistake bound that depends on the margin between positive and negative examples. The key feature of Winnow is that its mistake bound grows linearly with the number of relevant attributes and only logarithmically with the total number of attributes n. In particular, Winnow still maintains its abovementioned dependence on the number of total and relevant attributes even when no linear-threshold function can make a perfect classification (Littlestone, 1991; Kivinen and Warmuth, 1995).</Paragraph> <Paragraph position="7"> Even when there are only two target nodes and the cloud size is 1, the behavior of SNO Wis different from that of pure Winnow. While each of the target nodes is learned using a positive Winnow algorithm, a winner-take-all policy is used to determine the prediction. Thus, we use the learning algorithm here in a more complex way than just as a discriminator.</Paragraph> <Paragraph position="8"> One reason is that the SNOW architecture, influenced by the Neuroidal system (Valiant, 1994), is being used in a system developed for the purpose of learning knowledge representations for natural language understanding tasks, and is being evaluated on a variety of tasks for which the node allocation process is of importance.</Paragraph> <Paragraph position="9"> We have experimented extensively with various architectures of SNOWon the PPA problem but can present in this paper only a small part of these experiments. The best performance, across a few parameter sets and data, is achieved with a full architecture. In this case we initially link a target node to a/l features which occur in the training (with a constant initial value), and only then start training.</Paragraph> <Paragraph position="10"> Since training in SNOW is always done in an on-line fashion - each example is used only once for updating the weights, and only if a mistake on it was made.</Paragraph> </Section> <Section position="5" start_page="122" end_page="125" type="metho"> <SectionTitle> 4 Incorporating Semantic Knowledge </SectionTitle> <Paragraph position="0"> In this section we describe the effect of incorporating semantic knowledge on learning PPA with SNOW.</Paragraph> <Paragraph position="1"> The information sources are briefly described in Sec. 4.1, the experimental results are reported in</Paragraph> <Paragraph position="3"> Sec. 4.2, and results with random classes, used as a control set, are presented in Sec. 4.3.</Paragraph> <Paragraph position="4"> Winnow has three parameters: a threshold 0 and two update parameters, a promotion parameter a > 1 and a demotion parameter 0 < ~ < 1. The experiments reported here were made using the full SNOW architecture, with/3 = 0.85, a = ~, 0 = 1, and all the weights initialized to 0.1.</Paragraph> <Section position="1" start_page="123" end_page="124" type="sub_section"> <SectionTitle> 4.1 Semantic Data Sources </SectionTitle> <Paragraph position="0"> The semantic data sources specify for each noun a set of semantic classes. These classes result from a general linguistic study, hence not biased so as to present data in the context of PPA. In addition, the vocabularies which the semantic data cover overlaps our train and test data vocabulary only partially.</Paragraph> <Paragraph position="1"> Table 1 shows a summary of the class data. The knowledge sources which were incorporated are: WordNet(WN): WordNet-l.6 noun class information was used at various granularity levels. In the highest level, denoted by WN1, nouns are classified according to their synsets. The lower levels are obtained by successively using the hypernym relation defined in WordNet. Thus, WN2 is obtained by replacing each WN1 synset with the set of hypernyms to which it points, WN3 - by performing a similar process on the WN2 hypernyms, etc. We have used WN1, WN5, WN10, and WN15, Table 1 lists properties of these datasets.</Paragraph> <Paragraph position="2"> CoreLex(CL): The Corelex database (Buitelaar, 1998) was derived from WordNet as part of a linguistic research attempting to provide a unified approach to the systematic polysemy and underspecification of nouns. Systematic polysemy is the phenomena of word senses that are systematically related and therefore predictable over classes of lexical items.</Paragraph> <Paragraph position="3"> The thesis behind this data base is that acknowledging the systematic nature of polysemy allows one to structure ontologies for lexical semantic processing that may help in generating more appropriate interpretations within context. The data base establishes an ontology and semantic database of 126 semantic types, covering around 40,000 nouns that were derived by an analysis of sense distributions in WordNet.</Paragraph> <Paragraph position="4"> It is clear that with such a coarse-grained ontology, a lot of information is being lost. This is a many-to-one mapping in which many words fall into a class due only to one of their senses, and there are cases of incomplete and inaccurate information.</Paragraph> <Paragraph position="5"> For example, observatory falls into the class of ax'tifact state; words like dog, lion, table are missing from the vocabulary.</Paragraph> <Paragraph position="6"> Format Features (FF): These are two classes into which one can classify nouns using simple heuristics. The first consists of numbers (e.g., 1, 2, 100, three, million), and the second contains proper nouns. Each noun beginning with a capital letter was classified as a proper noun, which clearly gives a very crude approximation.</Paragraph> </Section> <Section position="2" start_page="124" end_page="124" type="sub_section"> <SectionTitle> 4.2 Experimental Results </SectionTitle> <Paragraph position="0"> In this section we present results of incorporating various semantic data and their combinations. Since the classes were not compiled specifically for the PPA problem, some of the class information may be irrelevant or even slightly misleading. The results provide an assesment of the relative relevance of each knowledge source.</Paragraph> <Paragraph position="1"> When a noun belongs to a class, one may replace the explicit noun feature by its classes. Using the classes in addition to the original noun (Brill and R~nik, 1994; Resnik, 1992; Resnik, 1995)seems, however, a better strategy. Consider, for example, the feature <prep,indirect-object=n2>. Suppose the noun n2 belongs to two classes cl and c2. The class information will be incorporated by creating two additional features: <prep,indirect-object=c 1 > and <prep,indirect-object=c2>, thereby enhancing the feature set without losing the original information. As mentioned above, giving up the original feature yielded degraded results.</Paragraph> <Paragraph position="2"> The results of adding features from a single knowledge source, presented in Table 2, show that FF have yielded small improvements over the lena set, within the noise-level; the WN1 synset information caused a slight degradation, and the CL and other WN knowledge resulted in a significant improvement over the lemma case.</Paragraph> <Paragraph position="3"> An important property of the CL class information is that each CL class defines a distinct set of nouns, as each noun belongs to one CL class. The synset (WN1) distribution differs greatly from that of the CL classes; each noun may belong to a few synsets - allowing more potential conflicts. That property of the synset distribution gives rise to the performance degradation.</Paragraph> <Paragraph position="4"> Another important difference between CL and WN1 classes is their granularity. There are around 60000 synsets, whereas there are only 126 CL classes. The finer synset granularity means that a synset carries less information; thus, the CL classes add richer disjunctions than WN synsets do. The results of CL, WNS, WN10, and WN15 improve over the FF set, these results are within the noise level (cf. Sec. 2). The FF set covers relatively few nouns, hence the improvement it yields is quite small. The Word-Net and CL vocabularies do not include those beginning with a capital as well as numbers, therefore the WN and CL knowledge may be augmented with the FF information without loss of consistency. Nevertheless, since each number-word (e.g., &quot;one&quot;, &quot;two&quot;, etc.) belongs to a different synset, augmenting WNI with a numeric class is not expected to be very effective because the words &quot;one&quot;, &quot;two&quot;, and &quot;1&quot; will all to the most common attachment in the training corpus, namely (v). :temma is our basic feature set, as in Sec. 2 The other columns present the prediction accuracy when adding each of our knowledge sources separately. belong to different classes: synset(one), synset(two), and FF(is-number), respectively.</Paragraph> <Paragraph position="5"> As a measure of numeric class assignment, we have examined the words: &quot;one&quot;, &quot;two&quot;, &quot;three&quot;, &quot;ten&quot;, &quot;hundred&quot; and &quot;million&quot;; only CL, WN3 and subsequent WN knowledge sources assign the same hypernym to these words, therefore we have augmented these sources.</Paragraph> <Paragraph position="6"> The results are presented in Table 3, comparison with Table 2 shows that augmenting with FF knowledge yielded a slight improvement only for the CL set. There may be two explanations for that: (i) the CL classes are more appropriate for the PPA problem than the WN hypernyms, therefore the FF information fit with less conflicts. (ii) The coverage of CL nouns is about 70% that of WN for the test data (cf. Table 1), therefore there axe more examples in which the CL and FF classes do not conflict. This issue requires further study.</Paragraph> </Section> <Section position="3" start_page="124" end_page="124" type="sub_section"> <SectionTitle> 4.3 Comparison with Random Classes </SectionTitle> <Paragraph position="0"> Adding semantic class information improved SNOW learning results. However, adding class information is equivalent to adding disjunctions of the original features and, taldng aside the semantic origin of the classes, the mere introduction of disjunctions enriches the knowledge representation and may yield a performance improvement.</Paragraph> <Paragraph position="1"> The motivation for using semantic classes goes, however, beyond this structural information. Nouns which haven't appeared in the training data may appear in the test data under a known class; such nouns will thus be handled based on the experience gathered for the class.</Paragraph> <Paragraph position="2"> In this section we attempt to isolate the semantic content of the classes from their disjunctive meaning. Random classes, which mimic in different aspects the structure of the semantic CL classes, were constructed. Comparing the results obtained with these classes with the results using CL classes, one can see the influence of the semantic aspect of CL classes. Only some of the randomization strategies used axe described here, these are: \[CL200:\] 200 classes uniformly distributed over CL nouns.</Paragraph> <Paragraph position="3"> \[CL126:\] 126 classes uniformly distributed over CL nouns. Here the number of classes in CL is maintained. null \[CL-PERM:\] A permutation of CL nouns among their classes. This random structure preserves the original class distribution of CL.</Paragraph> <Paragraph position="4"> The random class results, shown in Table 4, indicate that indeed some of the gain in using classes may be due to the structural additions. However, the improved performance introduced by semantically meaningful CL classification is a lot more significant. null</Paragraph> </Section> <Section position="4" start_page="124" end_page="125" type="sub_section"> <SectionTitle> 4.4 Comparison with other works </SectionTitle> <Paragraph position="0"> This section presents a comparison of our work with other works on the PPA task. In order to obtain a fair comparison we have tested our system on the complete data set, including the preposition of (cf. Sec. 2). The results are compared with a maximum-entropy method (Ratnaparkhi et al., 1994), transformation-based learning (TBL, Brill and Resnik (1994)), an instantiation of the back-off estimation (Collins and Brooks, 1995) and a memory-based method (Zavrel et al., 1997). All these works have used the same train and test data set. Table 5 presents the comparison.</Paragraph> <Paragraph position="1"> In all cases, the quoted figures axe the best results obtained by the authors; with the exception of the Brill and Resnik (1994) result, which was obtained by Zavrel et al. (1997) using the same method. Originally, TBL was evaluated by Brill and Resnik (1994) on a smaller data set.</Paragraph> <Paragraph position="2"> Although all systems have used the same data, they have not used similar feature sets. Both Collins and Brooks (1995) and Zavrel et al. (1997) have enhanced the feature generation in various ways; as described in this paper, this was also done for SNOW.</Paragraph> </Section> </Section> class="xml-element"></Paper>