File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1223_intro.xml
Size: 10,371 bytes
Last Modified: 2025-10-06 14:06:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1223"> <Title>Modularity in Inductively-Learned Word Pronunciation Systems *</Title> <Section position="3" start_page="0" end_page="7" type="intro"> <SectionTitle> 2 Algorithm, Data, Methodology 2.1 Algorithm: IGTREE </SectionTitle> <Paragraph position="0"> IGTR~.E (Daelemans, Van den Bosch, and Weijters, 1997) is a top-down induction of decision trees (TDIDT) algorithm (Breiman et al., 1984; Quinlan, 1993). TDIDT is a widely-used method in supervised machine learning (Mitchell, 1997). IGTREE is designed as an optlmi~ed approximation of the instance-based learning algorithm IBI-IQ (Daelemans and Van den Bosch, 1992; Dademans, Van den Bosch, and Weijters, 1997). In I6TR~E, information gain is used as a guiding function to compress a data base of instances of a certain task into a decision tree 1. Instances are stored in the tree as paths of connected nodes ending in leaves which contain classification information. Nodes are connected via arcs denoting feature values. Information gain is used in IGTREE to determine the order in which feature values are added as arcs to the tree. Information gain is a function from information theory, and is used similarly in ID3 (Qululan, 1986) and c4.5 (Qnlnlan, 1993).</Paragraph> <Paragraph position="1"> The idea behind computing the information gain of features is to interpret the training set (i.e., the set of task instances for which all classifications ave given and which are used for training the learning algorithm) as an information source capable of generating a number of messages (i.e., classifications) with a certain probability. The information entropy H of such an information source can be compared in turn for each of the features characterlsing the instances (let n equal the number of features), to the average information entropy of the information source when the value of those features axe known.</Paragraph> <Paragraph position="2"> Data-base information entropy H(D) is equal to the number of bits of information needed to know the classification given an instance. It is computed by equation 1, where p~ (the probability of classification i) is estimated by its relative frequency in the training set.</Paragraph> <Paragraph position="4"> To determine the information gain of each of the n features fx... fn, we compute the average information entropy for each feature and subtract it from the information entropy of the data base. To compute the information entropy for a feature fl, given in equation 2, we take the weighted average information entropy of the data base restricted to each possible value for the feature. The expression DLf~=~ \] X IGTB.BE can function with any feature weighting method, such as gain ratio (QuinIaa, 1993); for all experiments reported here, information gain was used.</Paragraph> <Paragraph position="5"> van den Bosch, Weijters and Daelemans 186 Modularity in Word Pronunciation systems</Paragraph> <Paragraph position="7"> refers to those patterns in the data base that have value vj for feature f~, j is the number of possible values of f~, and V is the set of possible values for feature /~. Finally, \]DI is the number of patterns in the (sub) data base.</Paragraph> <Paragraph position="8"> 'v,/EV Information gain of feature fi is then obtained by equation 3.</Paragraph> <Paragraph position="10"> In IGTREE, feature-value information is stored in the decision tree on arcs. The first feature values, stored as arcs connected to the tree's top node, axe those representing the values of the feature with the highest information gain, followed at the second level of the tree by the values of the feature with the secondhighest information gain, etc., until the classification information represented by a path is unambiguous. Knowing the value of the most important feature may already uniquely identify a classification, in which case the other feature values of that instance need not be stored in the tree. Alternatively, it may be necessary for disambiguation to store s long path in the tree.</Paragraph> <Paragraph position="11"> Apart from storing uniquely identified class labels at leafs, IGTREE stores at each non-terminal node information on the most probable classification given the path so far. The most probable classification is the most frequently occurring classification in the subset of instances being compressed in the path being expanded. Storing the most probable class at non-terminal nodes is essential when processing new instances. Processing a new instance involves traversing the tree by matching the feature values of the test instance with arcs the tree, in the order of the feature information gain. Traversal ends when (i) a leaf is reached or when (fi) matching a feature value with an arc fails. In case (i), the classification stored at the leaf is taken as output. In case (ii), we use the most probable classification on the last non-terminal node most recently visited instead.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Data Acquisition and Preprocessing </SectionTitle> <Paragraph position="0"> The resource of word-pronunciation instances used in our experiments is the CELEX lexical data base of English (Burnage, 1990). All items in the cgLv.x data bases contain hyphenated spelling, syllabified and stressed phonemic transcriptions, and detailed morphological analyses. We extracted from the Engiish data base of CZLZX all the above information, resulting in a data base containing 77,565 unique items (word forms with syllabified, stressed pronunciations and morphdogical segmentations).</Paragraph> <Paragraph position="1"> For use in experiments with learning algorithms, the data is preprocessed to derive fixed-size instances. In the experiments reported in this paper van den Bosch, Weijters and Daelemans 187 different morpho-phonological (sub)tasks are investigated; for each (sub)task, an instance base (training set) is constructed containing instances produced by windowing (Sejnowski and Rosenbezg, 1987) and attaching to each instance the classification appropriate for the (sub)task under investigation. Table 1 displays example instances derived from the sample word booking. With this method, for each (sub) task an instance base of 675,745 instances is built.</Paragraph> <Paragraph position="2"> In the table, six classification fields axe shown, one of which is a composite field; each field refers to one of the (sub)tasks investigated here. M stands for morphological decomposition: determine whether a letter is the initial letter of a morpheme (class '1') or not (class 'O'). x is graphemic parsing2: determine whether a letter is the first or only letter of a grapheme (class '1') or not (class '0'); a grapheme is a cluster of one or more letters mapping to a single phoneme. G is grapheme-phoneme conversion: determine the phonemic mapping of the middle letter.</Paragraph> <Paragraph position="3"> y is syllabification: determine whether the middle phoneme is syllable-initial, s is stress assignment: determine the stress level of the middle phoneme.</Paragraph> <Paragraph position="4"> Finally, GS is integrated grapheme-phoneme conversion and stress assignment. The example instances in Table 1 show that each (sub)task is phrased as a classification task on the basis of windows of letters or phonemes (the stress assignment task s is investigated with both letters and phonemes as input).</Paragraph> <Paragraph position="5"> Each window represents a snapshot of a part of a word or phonemic transcription, and is labelled by the classification associated with the middle letter of the window. For example, the first letter-window instance __book is linked with label '1' for the morphological segmentation task (M), since the middle letter b is the first letter of the morpheme book;, the other instance labelled with morphological-segmentation class '1 ~ is the instance with i in the middle, since i is the first letter of the (inflectional) morpheme ing. Classifications may either be binary ('1' or '0') for the segmentation tasks (M, A, and y), or have more values, such as 62 possible phonemes (~) or tbxee stress markers (primary, secondary, or no stress, s), or a combination of these classes (159 combined phonemes and stress markers, Gs).</Paragraph> </Section> <Section position="2" start_page="0" end_page="7" type="sub_section"> <SectionTitle> 2.3 Methodology </SectionTitle> <Paragraph position="0"> Our empirical study focuses on measuring the ability of the IQTP~Z learning algorithm to use the knowledge accumulated during learning for the classification of new, unseen instances of the same (sub)task, i.e., we measure their generalisation accuracy. (Weiss and Kulikowski, 1991) describe n-fold</Paragraph> <Paragraph position="2"> fight elassif.</Paragraph> <Paragraph position="3"> context Y s</Paragraph> <Paragraph position="5"> investigated, viz. M, A, Q, Y, s, and Gs.</Paragraph> <Paragraph position="6"> suzing generalisation accaxacy. For our experiments with IGTRBE, we set up 10-fold cv experiments consisting of five steps. (i) On the basis of a data set, n paxtitionings axe generated of the data set into one tra~ing set containing ((n-1)/n)th of the data set, and one test set contslnlng (l/n)th of the data set, per partitioning. For each partitioning, the three following steps axe repeated: (ii) Information-gain values for all (seven) features axe computed on the basis of the trAi~ing set (cf. Subsection 2.1). (iii) IQTRE~. is applied to the trai~i~g set, yielding an induced decision tree (el. Subsection 2.1). (iv) The tree is tested by letting it classify all instances in the test set, which results in a percentage of incorrectly classified test instances. (v) When each of the n folds has produced an error percentage on test material, a mean generalisation error of the leaxned model is computed. (Weiss and Kulikowski, 1991) argue that by using n-fold cv, preferably with n _> 10, one can retrieve a good estimate of the true generalisation error of a leaxning algorithm given an instance base.</Paragraph> <Paragraph position="7"> Mean results can be employed further in significance tests. In our experiments, n = 10, and one-tailed t-tests axe performed.</Paragraph> </Section> </Section> class="xml-element"></Paper>