File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1223_metho.xml
Size: 22,729 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1223"> <Title>Modularity in Inductively-Learned Word Pronunciation Systems *</Title> <Section position="4" start_page="7" end_page="7" type="metho"> <SectionTitle> 3 Three word-pronunciation </SectionTitle> <Paragraph position="0"> architectures Out experiments axe grouped in three series, each involving the application of IGTR~.B to a paxticula~ word-pronunciation system. The a~chitectures of these systems axe displayed in Figure 1. In the following subsections, each system is introduced, an outline is given of the experiments performed on the system, and the results a~e briefly discussed.</Paragraph> <Section position="1" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 3.1 M-A-G-Y-S </SectionTitle> <Paragraph position="0"> The axchitectu~e of the M-A-G-Y-S system is inspixed by SGUND1 (Hunnicutt, 1976; Hunnicutt, 1980), the word-pronunciation subsystem of the MIT~kLK text-to-speech system (Allen, Hunnicutt, and Klatt, 1987). When the MITALK system is faced with an unknown word, sounD1 produces on the basis of that van den Bosch, Weijters and Daelemans 188 word a phonemic transcription with stress markers (Allen, Hunnieutt, and Klatt, 1987). This word-pronunciation process is divided into the following five processing components: 1. morphological segmentalion, which we implement as the module referred to as M; 2. graphemic parsing, module A; 3. grapheme-phoneme conversion, module G; 4. sfllabifica~ion, module y; 5. stress assignment, module s.</Paragraph> <Paragraph position="1"> The axchiteeture of the M-A-G-Y-S system is visualised in the left of Figure 1. It can be seen that the representations include direct output from previous modules, as well as representations from eaxlier modules. For example, the s module takes as input the syllable boundaries generated by the Y module, but also the phoneme string generated by the G module, and the morpheme boundaxles generated by the M module.</Paragraph> <Paragraph position="2"> M-A-G-Y-S is put to the test by applying IGTREE in 10-fold cv experiments to the five subtasks, connecting the modules after tr~i~i~g, and measuring the combined score on correctly classified phonemes and stress maxkers, which is the desired output of the word-pronunciation system. An individual module can be trained on data from C~.L~.X directly as input, but this method ignores the fact that modules in a working modular system can be expected to generate some amount of error. When one module generates an error, the subsequent module receives this error as input, assumes it is correct, and may generate another error. In a five-module system, this type of cascading errors may seriously hamper generalisation accuracy. To counteract this potential disadvantage, modules can also be trained on the output of previous modules. Modules cannot be expected to leaxn to repair completely random, irregular errors, but whenever a previous module makes con.sistent errors on a specific input, this may be recoguised by the subsequent module. Having detected a consistent error, the subsequent module is Modularity in Word Pronunciation systems</Paragraph> <Paragraph position="4"> sifted test instances by IGTREE on the five subtasks M, A, G, Y, and s, and on phonemes and stress markers jointly (PS).</Paragraph> <Paragraph position="5"> then able to repair the error and continue with successful processing. Earlier experiments performed on the tasks investigated in this paper have shown that classification errors on test instances are indeed consistently and significantly decreased when modules are trained on the output of previous modules rather than on data extracted directly from C~.LP.X (Van den Bosch, 1997). Therefore, we train the M-A-G-Y-S system, with IGTRE~., by training the modules of the system on the output of predeceasing modules.</Paragraph> <Paragraph position="6"> We henceforth refer to this type of training as adaptive tra;-;-g, referring to the adaptation of a module to the errors of a predecessing module.</Paragraph> <Paragraph position="7"> Figure 2 displays the results obtained with IGTREE under the adaptive variant of M-A-G-Y-S. The figure shows all percentages (displayed above the bars; error bars on top of the main bars indicate standard van den Bosch, Weijters and Daelemans 189 deviations) of incorrectly classified instances for each of the five subtasks, and a joint error on incorrectly classified phonemes with stress markers, which is the desired output of the system. The latter classification error, labelled PS in Figure 2, regards classification of an instance as incorrect if either or both of the phoneme and stress marker is incorrect. The figure shows that the joint error on phonemes and stress markers is 10.59% of test instances, on average. Computed in terms of transcribed words, only 35.89% of all test words are converted to stressed phonemic transcriptions flawlessly. The joint error is lower than the sum of the errors on the G subtask and the s subtask, 12.95%, suggesting that about 20% of the incorrectly classified test instances involve an incorrect classification of both the phoneme and the stress marker.</Paragraph> </Section> <Section position="2" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 8.2 M-G-S </SectionTitle> <Paragraph position="0"> The subtasks of graphemic parsing (A) and grapheme-phoneme conversion (G) are clearly related. While A attempts to parse s letter string into grsphemes, G converts gzaphemes to phonemes.</Paragraph> <Paragraph position="1"> Although they axe performed independently in M-A-G-Y-S, they can be integrated easily when the elass-'l'-instances of the A task are mapped to theiI associated phoneme rather than '1', and the class'0'-instances axe mapped to a phonemic null, /-/, rather than '0' (of. Table 1). This task integration is also used in the NETTALK model (Sejnowski and Rosenberg, 1987). A similar argument can be made for integrating the syllabification and stress assignment modules into a single stress-assignment module. Stress markers, in our definition of the stress-assignment subtask, are placed solely on the positions which are also marked as syllable boundaries (i.e., on syllable-initial phonemes). Removing the in terms of the percentage of incorrectly classified test instances by IGTREE on the three snbtasks M, G, and s, and on phonemes and stress markers jointly (PS). by IGTRBE on the GS task, in terms of the percentage incorrectly classified test instances as well as on phonemes and stress assignments computed separately. null syllabification subtask makes finding those syllable boundaries which are rdevant for stress assignment an integrated paxt of stress assignment. Syllabification (Y) and stress assignment (s) can thus be integrated in a single stress-ussignment module s.</Paragraph> <Paragraph position="2"> When both pairs of modules are reduced to single modnles, the three-modnle system M-G-S is obtained. Figure 1 displays the architecture of the M-G-S system in the middle. Experiments on this system axe performed analogous to the experiments with the M-A-G-Y-S system; Figuxe 3 displays the average percentages of generalisation errors generated by mTRP.E on the three subtasks and phonemes and stress markers jointly (the error bar labelled PS).</Paragraph> <Paragraph position="3"> Removing graphemic parsing (A) and syllabification (Y) as explicit in-between modules yields better accuracies on the grapheme-phoneme conversion (G) and stress assignment (s) subtasks than in the M-A-G-Y-S system. Both differences are signltlcant; for G, (t(19) = 43.70,p < 0.001), and for S (t(19) = 32.00,p < 0.001). The joint accaxacy on phonemes and stress markers is also significantly better in the M-G-S system than in the M-A-G-Y-S system (g(37.50,p < 0.001). Ditferent from M-A-G-Y-S, the sum of the errors on phonemes and stress markers, 8.09%, is hardly more than the joint error on PSs, 7.86%: there is haxdly an overlap in instances with incorrectly classified phonemes and stress markers. The percentage of flawlessly processed test words is 44.89%, which is maxkedly better than the 35.89% of M-A-G-Y-S.</Paragraph> </Section> <Section position="3" start_page="7" end_page="7" type="sub_section"> <SectionTitle> 3.3 GS </SectionTitle> <Paragraph position="0"> GS is a single-module system in which only one classification task is performed in one pass. The GS task integrates grapheme-phoneme conversion and stress assignment: to classify letter windows as corresponding to a phoneme wi~h a stress marker (PS).</Paragraph> <Paragraph position="1"> In the GS system, a PS can be either (i) a phoneme or a phonemic null with stress marker '0', or (ii) a phoneme with stress marker '1' (i.e., the first phoneme of a syllable receiving primary stress), or (iii) a phoneme with stress marker '2' (i.e., the first phoneme of a syllable receiving secondary stress).</Paragraph> <Paragraph position="2"> The simple architecture of GS, which does not reflect any linguistic expert knowledge about decompositions of the word-pronunciation task, is visualised as the rightmost architectaxe in Figure 1. It only assumes the presence of letters at the input, and phonemes and stress maxkers at the output. Table 1 displays example instance PS classifications generated on the basis of the word booking. The phonemes with stress markers (PSs) axe denoted by composite labels. For example, the first instance in Table 1, __book, maps to class label ~b/l, denoting a/b/ which is the first phoneme of a syllable receiving primary stress.</Paragraph> <Paragraph position="3"> The experiments with GS were performed with the same data set of word pronunciation as used with M-X-G-Y-S and M-G-S. The number of PS classes (i.e., all possible combinations of phonemes and stress markers) occurring in this data base of tasks is 159.</Paragraph> <Paragraph position="4"> Figure 4 displays the generalisation errors in terms of incorrectly classified test instances. The figure also displays the percentage of classification errors made on phonemes and stress markers computed separately.</Paragraph> <Paragraph position="5"> IGTEEE yields significantly better generalisation accuracy on phonemes and stress markers, both jointly and independently. In terms of PSs, the accuracy on GS is significantly better than that of M-G-S</Paragraph> <Paragraph position="7"> on flawlessly transcribed test words, 59.38%, is also considerably better than that of the modnlax systems. Compared to accuracies reported in related zeseaxch on learning English word pronunciation (Sejnowski and Rosenbezg, 1987; Wolpert, 1990; Dietvan den Bosch, Weijters and Daelemans 190 Modularity in Word Pronunciation systems</Paragraph> <Paragraph position="9"> numbers of nodes needed for the trees of the subtasks specified by their labels.</Paragraph> <Paragraph position="10"> terich, Kiid, and Bakifi, 1995; Yvon, 1996) and on general quality demands of text-to-speech applications, an error of 3.79% on phonemes and 30.62% on words can be considered adequate, though still not excellent (Y=von, 1996; Van den Bosch, 1997).</Paragraph> </Section> </Section> <Section position="5" start_page="7" end_page="7" type="metho"> <SectionTitle> 4 Comparisons of M-A-G-Y-S, M-G-S, and GS </SectionTitle> <Paragraph position="0"> We have given significance results showing that, under our experimental conditions and using IGTREE as the learning algorithm, optimal generalisation accuracy on word pronunciation is obtained with GS, the system that does not incorporate any explicit decomposition of the word-pronunciation task. In this section we perform two additional comparisons of the three systems. First, we compare the sizes of the trees constructed by IGTREE on the three systems; second, we analyse the positive and negative effects of learning the subtasks in their specific systems' context.</Paragraph> <Paragraph position="1"> Tree sizes An advantage of using less or no decompositions in terms ofcomputationul ei~ciency is the total amount of memory needed for storing the trees. Although the applieation of IGTREE generally results in small trees that fit well inside small computer memories (for out modulax (sub)tasks, tree sizes waxy from 64,821 nodes for the M-modules to 153,678 nodes for the G-module in M-A-G-Y-S, occupying 453,747 to 1,075,746 bytes of memory), keeping five trees in memory would not be a desirable feature for a system optimised on memory use. Figure 5 displays the summed number of nodes for each of the four IGTReE-tramed systems under the adaptive vaxiant.</Paragraph> <Paragraph position="2"> Each bax is divided into compartments indicating the amount of nodes in the trees generated for each of the modular subtasks.</Paragraph> <Paragraph position="3"> van den Bosch. Weijters and Daelemans 191 Figure 5 shows that the model with the best generalisation accuracy, GS, is also the model taking up the smallest number of nodes. The amount of nodes in the single Gs tree, 111,062, is not only smaller than the sum of the amount of nodes needed for the G and s modules in the M-G-S system (204,345 nodes); it is even smaller than the single tree constructed for the G subtask in the M-G-S system (125,182 nodes).</Paragraph> <Paragraph position="4"> A minor difference in tree size can be seen between the trees built for the G-module in the M-G-S system, 125,182 nodes, and the G-module in the M-A-G-Y-S system, 153,678 nodes. A similar difference can be seen for the s-modules, taking up 79,163 nodes in the M-G-S system, and 96,998 nodes in the M-A-G-Y-S system. The size of the trees built for modules appears to increase when the module is preceded by more modules, which suggests that IGTREE is faced with a more complex task, including potentially erroneous output from more modules, when building a tree for a module further down a sequence of modules. null Utility effects The paxticunax sequence of the five modules as in the M-A-G-Y-S system reflects a number of assumptions on the utilit~l of using output from one subtask as input to another subtask. Morphological knowledge is useful as input to grapheme-phoneme conversion (e.g., to avoid pronouncing ph in loophole as/f/, or red in barred as/ted/); graphemic parsing is useful as input to grapheme-phoneme conversion (e.g., to avoid the pronunciation of gh in through); etc.</Paragraph> <Paragraph position="5"> Thus, feeding the output of a module A into a subsequent module B implies that one expects to perform better on module B with A's input than without.</Paragraph> <Paragraph position="6"> The accuracy results obtained with the modules of the M-A-G-Y-S, M-G-S, and GS systems can serve as tests for their respective underlying utility assumptions, when they axe compared to the accuracies obtained with their snbtasks learned in isolation.</Paragraph> <Paragraph position="7"> To measure the utility/effects of including the outputs of modules as inputs to other modules, we performed the following experiments: 1. We applied IGTREE in 10-fold cv experiments to each of the five subtasks M, A, G, Y, and s, only using letters (with the M, A, G, and s snbtasks) or phonemes (with the Y and the s subtasks) as input, and their respective classification as output (cf. Table 1). The input is directly extracted from CELEX. These experiments provide the baseline score for each subtask, and axe referred to as the isolated experiments.</Paragraph> <Paragraph position="8"> 2. We applied IGTIIEE in 10-fold Cv experiments to all subtasks of the M-A-G-Y-S, M-G-S, aald GS systems, training end testing on input extracted directly from CP.LEX. The results from these experiments reflect what wound be the accuracy of tasks (M, A, G, Y, and s) as modules or partial tasks in the M-A-O-Y-S, M-O-S, and GS systems. For each module, in each system, the utility of tra;~ing the module with ideal data (middle) and actual, modular data under the adaptive variant (fight), is compared against the accuracy obtained with learning the subtasks in isolation (left). Accuracies are given in percentage of incorrectly classified test instances. the modular systems when each module would perform perfectly flawless. We refer to these experiments as ideal With the results of these experiments we measure, for each subtask in each of the three systems, the utility effect of including the input of preceding modules, for the ideal case (with input straight from CP.LEX) as well as for the actual case (with input from preceding modules). A utility effect is the difference between IGTItEE'S generalJsation error on the subtask in modular context (either ideal or actual) and its accuracy on the same subtask in isolation.</Paragraph> <Paragraph position="9"> Table 2 lists all computed utility effects.</Paragraph> <Paragraph position="10"> For the ease of the M-A-G-Y-S system, it can be seen that the only large utility effect, even in the ideal case, could be obtained with the stress-assignment subtask. In the isolated case, the input consists of phonemes; in the M-A-G-Y-S system, the input contains morpheme boundaries, phonemes, and syllable boundaries. The ideal positive effect on the s module of 5.29% less errors turns out to be a positive effect of 2.68% in the actual system. The latter positive effect is outweighed by a rather large negative utility effect on the grapheme-phoneme conversion task of-3.95%. Both the A and y subtasks do not profit from morphological boundaries as input, even in the ideal case; in the actual M-A-G-Y-S system, the utility effect of including morphological boundaries from M and phonemes from G in the syllabification module Y is markedly negative: -2.16%.</Paragraph> <Paragraph position="11"> In the M-G-S system, the utility effects are generally less negative than in the M-A-G-Y-S system. There is a small utility effect in the ideal case with including morphological boundaries as input to grapheme-phoneme conversion; in the actual M-Q-S system, the utility effect is negative (-0.27%). The stress-assignment module benefits from including morphological boundaries and phonemes in its input, both in the ideal case and in the actual M-G-S system.</Paragraph> <Paragraph position="12"> The Gs system does not contain separate modules, but it is possible to compare the errors made on phonemes and stress assignments separately to the results obtained on the subtasks learned in isolation. Grapheme-phoneme conversion is learned with almost the same accuracy when learned in isolation as when learned as partial task of the Gs task. Learning the grapheme-phoneme task, IGTR~.~. is neither helped nor hampered significantly by learning stress assignment simultaneously. There is a positive utility effect in learning stress assignment, however. When stress assignment is learned in isolation with letters as input, IGTI~B classifies 4.71% of test instances incorrectly, on average. (This is a lower error than obtained with learning stress assignment on the basis of phonemes, indicating that stress assignment should take letters as input rather than phonemes.) When the stress-assignment task is learned along with grapheme-phoneme conversion in the Gs system, a marked improvement is obtained: 0.74% less classification errors are made.</Paragraph> <Paragraph position="13"> Snmmaxising, comparing the accuracies on modulax subtasks to the accuracies on their isolated counterpart tasks shows only a few positive utility effects in the actual system, all obtained with stress assignment. The largest utility effect is found on the stress-assigument subtask of M-G-S. However, this positive utility eifect does not lead to optimal accuracy on the s subtask; in the Gs system, stress assignment is performed with letters as input, yielding the best accuracy on stress assignment in our investigations, viz. 3.97% incorrectly classified test instances.</Paragraph> </Section> <Section position="6" start_page="7" end_page="7" type="metho"> <SectionTitle> 5 Related work </SectionTitle> <Paragraph position="0"> The classical NETTXLE paper by (Sejnowski and P~osenberg, 1987) can be seen as a primaxy source of inspiration for the present study; it has been so for a considerable amount of related work. Although it has been cfiticised for being vague and presumptuons and for presenting generalisation accuracies that can be improved easily with other learning methods (Stanfill and Waltz, 1986; Wolpert, 1990; Weijters, 1991; Yvon, 1996), it was the first paper to investigate gtapheme-phoneme conversion as an interesting application for general-purpose learning algofithms. However, few reports have been made on van den Bosch, Weijters and Daelemans 192 Modularity in Word Pronunciation systems the joint accuracies on stress markers and phonemes in work on the NETTALK data. To our knowledge, only (Shsvlik, Mooney, and Towell, 1991) and (Dietterich, Hild, and Bnkiri, 1995) provides such reports. In terms of incorrectly processed test instances, (Shavlik, Mooney, and Towcll, 1991) obtain better performance with the back-propagation algorithm trained on distributed output (27.7% errors) than with the IV3 (Qnlnlan, 1986) decision-tree algorithm (34.7% errors), both trained and tested on small non-overlapping sets of about 1,000 instances. (Dietterich, Hild, and Baklri, 1995) reports similar errors on similarly-sized tradning and test sets (29.1% for BP and 34.4% for Iv3); with a larger training set of 19,003 words fxom the NETT&LK data and an input encoding tlfteen letters, previous phoneme and stress classifications, some domain-specific features, and error-correcting output codes IV3 generates 8.6% errors on test instances (Dietterich, Hild, and Bakiri, 1995), which does not compare favourably to the results obtained with the NETTALK-Iike GS task (a valid comparison cannot be made; the data employed in the current study contains considerably more instances).</Paragraph> <Paragraph position="1"> An interesting counterargument against the representation of the word-pronunciation task using fixed-size windows, put forward by Yvon (Yvon, 1996), is that an induetive-leaxning approach to grapheme-phoneme conversion should be based on associating vaxiable-length chunks of letters to variable-length chunks of phonemes. The chunk-based approach is shown to be applicable, with adequate accuracy, to several corpora, including corpora of French word pronunciations and, as mentioned above, the NBTTALK data (Yvon, 1996). Experiments on other (larger) corpora, comparing both approaches, would be needed to analyse their differences empirically.</Paragraph> </Section> class="xml-element"></Paper>