File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1224_metho.xml
Size: 20,709 bytes
Last Modified: 2025-10-06 14:15:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1224"> <Title>Do Not Forget: Full Memory in Memory-Based Learning of Word Pronunciation *</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The word-pronunciation data </SectionTitle> <Paragraph position="0"> Converting written words to stressed phonemic transcription, i.e., word pronunciation, is a well-known benchmark task in machine learning (Stanfill and Waltz, 1986; Sejnowski and Rosenberg, 1987; Shavlik, Mooney, and Towell, 1991; Dietterich, Hild, and Baklri, 1990; Wolpert, 1990). We define the task as the conversion of fixed-sized instances representing parts of words to a class representing the phoneme and the stress marker of the instance's middle letter. To genexate the instances, windowing is used (Sejnowski and Rosenberg, 1987). Table I displays example instances and their classifications generated on the basis of the sample word booking. Classificatious, i.e., phonemes with stress markers (henceforth PSs), are denoted by composite labels. For exampie, the first instance in Table 1, -_book, maps to dass labd /b/l, denoting a/b/ which is the first phoneme of a syllable receiving primary stress. In this study, we chose a fixed window width of seven letters, which offers sufficient context information for adequate performance, though extension of the window decreases ambiguity within the data set (Van den Bosch, 1997).</Paragraph> <Paragraph position="1"> The task, henceforth referred to as Qs (Graphemephoneme conversion and stress assignment) is similar to the NBTTALK task presented by Sejnowski and Rosenberg (1986), but is performed on a laxger corpus, of 77,565 English word-pronunciation pairs, extracted from the cBr.Bx lexical data base (Burnage, 1990). Converted into fixed-sized instance, the van den Bosch and Daelemans 196 Memory-Based Learning of Word Pronunciation</Paragraph> <Paragraph position="3"/> <Paragraph position="5"> task from the word booking.</Paragraph> <Paragraph position="6"> full instance base representing the as task contains 675,745 instances. The task features 159 classes (combined phonemes and stress markers). The coding of the output as 159 atomic ('local') classes combining grapheme-phoneme conversion and stress assignment is one out of many types of output coding (Shavlik, Mooney, and Towel\], 1991), e.g., distributed bit coding using articulatory features (Sejnowski and Rosenberg, 1987), error-correcting output coding (Diettefich, Hild, and Bakid, 1990), or split discrete coding of gmpheme-phoneme conversion and stress assignment (Van den Bosch, 1997).</Paragraph> <Paragraph position="7"> While these studies point at back-propagation learning (Rumelhart, Hinton, and Williams, 1986), using distributed output code, as the better petformer as compared to ID3 (Quinlan, 1986), a symbolic inductive-learning decision tree algorithm (Dietterich, Hild, and Bakid, 1990; Shavllk, Mooney, and Towel\], 1991), unless IV3 was equipped with error-correcting output codes and additional manual tweaks (Dietterich, Hild, and Bakiri, 1990). Systematic experiments with the data also used in this paper have indicated that both back-propagation and decision-tree learning (using either distributed or atomic output coding) ate consistently and significantly outperformed by memory-based learning of gmpheme-phoneme conversion, stress assignment, and the combination of the two (Van den Bosch, 1997), using atomic output coding. Our choice for atomic output classes in the present study is motivated by the latte~ results.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Algorithm and experimental setup </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Memory-based learning in IBI-IG </SectionTitle> <Paragraph position="0"> In the experiments reported here, we employ IBI-IG (Daelemaus and Van den Bosch, 1992; Daelemans, Van den Bosch, and Weijters, 1997b), which has been demonstrated to perform adequately, and signitleant\]y better than eager-learning algorithms on the os task (Van den Bosch, 1997). ZBI-IG constructs an instance base daring learning. An instance in the instance base consists of a fixed-length vector of n feature-value pairs (here, n = 7), an information field containing the classification of that particular feature-value vector, and an information field containing the occurrences of the instance with its classification in the full training set. The latter information field thus enables the storage of instance types rather than the more extensive storage of identical instance tokens. After the instance base is built, new (test) instances are classified by matching them to all instance types in the instance base, and by calculating with each match the distance between the new instance X and the memory instance type Y, A(X, Y), using the function given in Eq. 1:</Paragraph> <Paragraph position="2"> where W(fi) is the weight of the ith feature, and 6(zl, yi) is the distance between the values of the ith fcature in the instances X and Y. When the values of the instance features are symbolic, as with the Gs task (i.e., feature values are letters), a simple distance function is used (Eq. 2):</Paragraph> <Paragraph position="4"> The classification of the memory instance type Y with the smallest A(X,Y) is then taken as the classification of X. This procedure is also known as 1-NN, i.e., a search for the single nearest neighbour, the simplest variant of k-NN (Devijver and Kittler, 1982).</Paragraph> <Paragraph position="5"> The weighting function of IBI-IG, W(fi), represents the information gain of feature fi. Weighting features in k-NN ~ezs such as IB 1-IG is an active field of research (cf. (Wettschereck, 1995; Wettschereck, Aha, and Mohrl, 1997), for comprehensive overviews and discussion). Information gain is a function from information theory also used in zv3 (Qnlnlan, 1986) and c4.5 (Quinlan, 1993). The information gain of a feature expresses its relative relevance compared to the other features when performing the mapping from input to classification.</Paragraph> <Paragraph position="6"> The idea behind computing the information gain of features is to interpret the training set as an information source capable of generating a number of messages (i.e., classifications) with a certain probability. The information entropy/it of such an information source can be compared in turn for each of van den Bosch and Daelemans 197 Memory-Based Learning of Word Pronunciation the features characterising the instances (let n equal the number of features), to the average information entropy of the information source when the value of those features are known.</Paragraph> <Paragraph position="7"> Data-base information entropy H(D) is equal to the number of bits of information needed to know the classification given an instance. It is computed by equation 3, where p/ (the probability of classification i) is estimated by its relative frequency in the traini~,g set.</Paragraph> <Paragraph position="9"> To determine the information gain of each of the n features fl-.. f,~, we compute the average information entropy for each feature and subtract it f~om the information entropy of the data base. To compute the average information entropy for a feature fi, given in equation 4, we take the average information entropy of the data base restricted to each possible value for the feature. The expression D\[y~=~\] refers to those patterns in the data base that have value vj for feature fi, j is the number of possible values of f~, and V is the set of possible values for feature f~. Finally, IDI is the number of patterns in the (sub) data base.</Paragraph> <Paragraph position="11"> Information gain of feature f~ is then obtained by equation 5.</Paragraph> <Paragraph position="13"> Using the weighting function W(fi) acknowledges the fact that for some tasks, such as the current GS task, some features axe fax more relevant (important) than other features. Using it, instances that match on a feature with a relatively high information gain axe regarded as less distant (more alike) than instances that match on a feature with a lower information gain.</Paragraph> <Paragraph position="14"> Finding a nearest neighbour to a test instance may result in two or more candidate ne~aest-neighbour instance types at an identical distance to the test instance, yet associated with different classes. The implementation oflBl-IG used here handles such eases in the following way. First, IBI-IG selects the class with the highest occurrence within the merged set of classes of the best-mateblng instance types. In case of occurrence ties, the classification is selected that has the highest overall occurrence in the training set.</Paragraph> <Paragraph position="15"> (Daehmans, Van den Bosch, and Weijters, 1997b).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Setup </SectionTitle> <Paragraph position="0"> We performed a series of experiments in which m 1-IG is applied to the Gs data set, systematically edited according to each of the three tested criteria (plus the baseline random criterion) described in the next section. We performed the following global proce- null dure: 1. We partioned the full Gs data set into a training set of 608,228 instances (90% of the full data set) and a test set of 67,517 instances (10%).</Paragraph> <Paragraph position="1"> For use with IB 1-IG, which stores instance types rather than instance tokens, the data set was reduced to contain 222,601 instance types (i.e., unique combinations of feature-value vectors and their classifications), with frequency information. null 2. For each exceptionality criterion (i.e., typicality, class prediction strength, friendly-neighbourhood size, and random selection), (a) we created four edited instance bases by removing 1%, 2%, 5%, and 10% of the most exceptional instance types (according to the criterion) from the training set, respectively. null (b) For each of these increasingly edited training sets, we performed one experiment in which IBI-IG was trained on the edited training set, and tested on the original unedited test set.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Three estimations of </SectionTitle> <Paragraph position="0"> exceptionality We investigate three methods for estimating the (degree of) exceptionality of instance types: typicality, class prediction strength, and f~iendlyneighbouthood size.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Typicality </SectionTitle> <Paragraph position="0"> In its common meaning, &quot;typicality&quot; denotes roughly the opposite of exeeptionality; atypicality can be said to be s synonym of exceptionality. We adopt a definition from (Zhang, 1992), who proposes a typicality function. Zhang computes typiealities ofiustance types by taking both their feature values and their classifications into account (Zhang, 1992).</Paragraph> <Paragraph position="1"> He adopts the notions of Jaffa.concept similarity/and inter-concept similarity (Rosch and Mervis, 1975) to do this. First, Zhang introduces a distance function slmilsr to Equation 1, in which W(fi) = 1.0 for all features (i.e., fiat Euclidean distance rather than information-gain weighted distance), in which the distance between two instances X and Y is normalised by dividing the summed squared distance by n, the number of features, and in which 6(zi, 9i) is given as Equation 2. The normalised distance function used by Zhang is given in Equation 6.</Paragraph> <Paragraph position="3"> The intra-concept similarity of instance X with classification C is its similarity (i.e., 1-distance) with all instances in the data set with the same classification C: this subset is referred to as X's family, Fara(X). Equation 7 gives the intra-concept similaxity function In~ra(X) (\]Fam(X)\[being the number of instances in X's family, and Faro(X) ~ the ith instance in that family).</Paragraph> <Paragraph position="5"> All remaining instances belong to the subset of unrelated instances, Unr(X). The inter-concept similarity of an instance X, Inter(X), is given in Equation 8 (with \[Unr(X)\[ being the number of instances unrelated to X, and Unr(X)&quot; the ith instance in that subset).</Paragraph> <Paragraph position="7"> The typicality of an instance X, Typ(X), is the quotient of X's intra-concept similarity and X's inter-concept similarity, as given in Equation 9.</Paragraph> <Paragraph position="9"> An instance type is typical when its intra-concept similarity is laxger than its inter-concept similarity, which results in a typicality larger than 1.</Paragraph> <Paragraph position="10"> An instance type is atypical when its intra-concept similarity is smaller than its inter-concept similarity, which results in a typicality between 0 and 1.</Paragraph> <Paragraph position="11"> Around typicality value 1, instances cannot be sensibly called typical or atypical; (Zhang, 1992) refers to such instances as boundary instances.</Paragraph> <Paragraph position="12"> In our experiments, we compute the typicality of all instance types in the training set, order them on their typicality, and remove 1%, 2%, 5%, and 10% of the instance types with the lowest typicality, i.e., the most atypical instance types. In addition to these four experiments, we performed an additional eight experiments using the same percentages, and editing on the basis of (i) instance types' typicality (by ordering them in reverse order) and (il) their indifference towards typicality or atypicality (i.e., the closeness of their typicality to 1.0, by ordering them in order of the absolute value of their typicality subtracted by 1.0). The experiments with removing typical and boundary instance types provide interesting comparisons with the more intuitive editing of atypical instance types.</Paragraph> <Paragraph position="13"> Table 2 provides examples of four atypical, boundary, and typical instance types found in the training set. Globally speaking, (i) the set of atypical instances tend to contain foreign spellings of loan van den Bosch and Daelemans 199 words; (ii) there is no clear characteristic of boundary instances; ~and (iii) 'certain' pronunciations, i.e., instance types with high typicality values often involve instance types of which the middle letters are at the beginning of words or immediately following a hyphen, or high-frequency instance types, or instance types mapping to a low-frequency class that always occurs with a certain spelling (dass frequency is not accounted for in Zhang's metric).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Class-predictlon strength </SectionTitle> <Paragraph position="0"> A second estimate of exceptionality is to measure how well an instance type predicts the class of all instance types within the training set (including itself). Several functions for computing class-prediction strength have been proposed, e.g., as a criterion for removing instances in memory-based (k-nn) learning algorithms, such as m3 (Aha, Kibier, and Albert, 1991) (cf. earlier work on edited k-nn (Wilson, 1972; Voisin and Devijver, 1987)); or for weighting instances in the Each\[ algorithm (Salzberg, 1990; Cost and Salzberg, 1993). We chose to implement the straightforward class-prediction strength function as proposed in (Salzberg, 1990) in two steps. First, we count (a) the number of times that the instance type is the nearest neighbour of another instance type, and (b) the number of occurrences that when the instance type is a neareat neighbour of another instance type, the classes of the two instances match. Second, the instance's class-prediction strength is computed by taking the ratio of (b) over (a). An instance type with class-prediction strength 1.0 is a perfect predictor of its own class; a class-prediction strength of 0.0 indicates that the instance type is a bad predictor of classes of other instances, presumably indicating that the instance type is exceptional.</Paragraph> <Paragraph position="1"> We computed the class-prediction strength of all instance types in the training set, ordered the instance types according to their strengths, and created edited training sets with 1%, 2%, 5%, and 10% of the instance types with the lowest class prediction strength removed, respectively. In Table 3, four sample instance types axe displayed which have elass-prediction strength 0.0, i.e., the lowest possible strength. They are never a correct nearest-ncighbour match, since they all have higher-frequency counterpart types with the same feature values. For example, the letter sequence _ algo occurs in two types, one associated with the pronunciation /'~/ (via., primary-stressed /re/, or lm in our labelling), as in algorithm and algorithms; the other associated with the pronunciation /'~/(viz.</Paragraph> <Paragraph position="2"> secondary-stressed /~/ or 2se), as in algorithmic.</Paragraph> <Paragraph position="3"> The latter instance type occurs less frequently than the former, which is the reason that the class of the former is preferred over the latter. Thus, an ambiguous type with a minority class (a minority ambiguity) can never be a correct predictor, not even possible class prediction strength (cps) 0.0.</Paragraph> <Paragraph position="4"> for itself, when using ml-iG as a classifier, which always prefers high f~equency over low f~equency in case of ties.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Prlendly-nelghbourhood size </SectionTitle> <Paragraph position="0"> A third estimate for the exceptiona\]ity of instance types is counting by how many nearest neighbours of the same class an instance type is surrounded in instance space. Given a training set of instance types, for each instance type a ranking can be made oral\] of its nearest neighbours, ordered by their distance to the instance type. The number of neaxest-neighbouz instance types in this ranking with the same class, henceforth refe~ed to as the frendly-neighbourhood size, may range between 0 and the total number of instance types of the same class. When the friendly neighbourhood is empty, the instance type only has neaxest neighbouts of different classes. The argumentation to regard a small friendly neighbourhood as an indication of an instance type's exceptionality, follows f~om the same argumentation as used with e!~s-prediction strength: when an instance type has nearest neighbours of different classes, it is vice versa a bad predictor for those classes. Thus, the smaller an instance type's friendly neighboaxhood, the more it could be regarded exceptional.</Paragraph> <Paragraph position="1"> To illustrate the computation of frend\]yneighbou~hood size, Table 4 lists fou~ examples of possible neaxest-neighbou~ zankings (truncated at ten nearest neighbours) with their respective number of friendly neighbours. The Table shows that the number of friendly neighboaxs is the number of slmilaxly-labeled instances counted from left to right in the ranking, until a disslmilaxly-labeled instance occurs.</Paragraph> <Paragraph position="2"> feature values class fns possible f~iendly-neighbourhood size (fns) 0, i.e., no friendly neighbours. Friendly-neighbouthood size and class-prediction strength a~e related functions, but differ in thei~ treatment of class ambiguity. As stated above, instance types may receive a class-prediction strength of 0.0 when they axe minority ambiguities. Counting a friendly neighbouzhood does not take class ambiguity into account; each of a set of ambiguous types necessarily has no friendly neighbouzs, since they axe eachothez's nearest neighbouts with different classes. Thus, friendiy-neighbourhood size does not discriminate between minority and majority ambiguities. In Table 5, four sample instance types axe listed with frendly-neighbouthood size 0. While some of these instance types without friendly neighbours in the training set (perhaps with friendly neighbours in the test set) are minority ambiguities (e.g., __edib 2~), others are majority ambiguities (e.g., __edib 1~), while others are not ambiguous at all but simply have a nearest neighbouz at some distance with a different class (e.g., soiree_ 0z).</Paragraph> </Section> </Section> class="xml-element"></Paper>