File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/e93-1007_intro.xml
Size: 15,992 bytes
Last Modified: 2025-10-06 14:05:21
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1007"> <Title>Data-Oriented Methods for Grapheme-to-Phoneme Conversion</Title> <Section position="3" start_page="45" end_page="48" type="intro"> <SectionTitle> 2 Data-Oriented Text-to-speech Conversion </SectionTitle> <Paragraph position="0"> The algorithms we have applied in our research are similarity-based and data-oriented. The phonemisation problem is interpreted as a classification task. Given a target grapheme and its context, the corresponding phoneme should be predicted. The algorithms we used to learn this task are supervised and data-intensive in the sense that a large number of examples is provided of input representations with their correct category (in this case a phonetic transcription). Within asupervised, similarity-based approach, the degree in which abstractions are extracted from the examples may be different, as may be the time when abstractions are created: during training in aggressive abstraction, during performance in lazy learning. For grapheme-to-phoneme conversion, we claim a data-intensive, lazy learning approach is appropriate to capture the intricate interactions between regularities, subregularities, and exceptions that characterise the domain.</Paragraph> <Section position="1" start_page="45" end_page="46" type="sub_section"> <SectionTitle> 2.1 Training and Test Set Encoding </SectionTitle> <Paragraph position="0"> Training and test set were randomly selected from a Dutch text-to-speech vocabulary data base. From the 70,000 word dataset, 20,000 were randomly selected and randomly divided into 18,500 training words and 1,500 test words. In both sets, each graphemic word is accompanied by its pronunciation in the form of a string of phonemes. In cases where phonemes correspond to grapheme clusters (i.e. there is an alignment problem of grapheme strings with their corresponding phoneme strings), as is the case in, e.g., <schoenen> (shoes)/sXuno/, one grapheme of that cluster is algorithmically mapped to the phoneme, and the remaining graphemes are mapped to phonetic nulls, represented by hyphens.</Paragraph> <Paragraph position="1"> In the example of <schoenen>, this phonetic null insertion results in the following alignment:</Paragraph> <Paragraph position="3"> To provide a learning system with sufficient information about the phonemisation task, context information must be added. In the models described below, this is done by using graphemic windows (compare \[Sejnowski and l~senberg, 1987\]), i.e. fixed-length parts of words in which one grapheme is mapped to a phoneme; the other graphemes serve as context. For example, using a window with one left context grapheme and two right context graphemes (from here on written as '1-1-2'), the application of this window on the word < boek> (book), pronounced as/buk/, would result in the four pattern-category pairs of Table 1.</Paragraph> <Paragraph position="4"> coding on the word < boek > (book). Underscores represent spaces, a hyphen represents a phonetic null. This approach implies that dependencies stretching longer than the length of the graphemic window cannot be learned.</Paragraph> </Section> <Section position="2" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 2.2 Instance-Based Learning </SectionTitle> <Paragraph position="0"> As an example of a lazy learning approach, we experimented with Instance-Based Learning (IBL, \[Aha et al., 1991\]). IBL is a framework and methodology for incremental supervised machine learning. The distinguishing feature of IBL is the fact that no explicit abstractions are constructed on the basis of the training examples during the training phase. A selection of the training items themselves is used to classify new inputs. IBL shares with Memory-Based Reasoning (MBR, \[Stanfill and Waltz, 1986\]) and Case-Based Reasoning (CBR, \[Riesbeck and Schank, 1989\]) the hypothesis that much of intelligent behaviour is based on the immediate use of stored episodes of earlier experience rather than on the use of explicitly constructed abstractions extracted from this experience (e.g. in the form of rules or decision trees). In the present context of learning linguistic mappings, the hypothesis would be that much of language behaviour is based on this type of memory-based processing rather than on rule-based processing. In linguistics, a similar emphasis on analogy to stored examples instead of explicit but inaccessible rules, is present in the work of a.o. \[Derwing and Skousen, 1989\]. IBL is inspired to some extent on psychological research on exemplar-based categorisation (as opposed to classical and probabilistic categorisation, \[Smith and Medin, 1981\]). Finally, as far as algorithms are concerned, IBL finds its inspiration in statistical pattern recognition, especially the rich research tradition on the nearest-neighbour decision rule (see e.g. \[Devijver and Kittler, 1982\], for an overview).</Paragraph> <Paragraph position="1"> The main datastructure in our version of IBL is the exemplar, a memory structure representing about each pattern the following information: (i) Its distribution over the different categories (training patterns may be ambiguous between different categories, so the memory structure should keep information about how many times each category was assigned to a particular pattern). (ii) Its category. This is simply the category with highest frequency in the distribution of a pattern, or a random selection to break a tie. (iii) Other bookkeeping information (performance data, frequency of pattern in training set, etc.) Training. For each training pattern, it is checked whether an exemplar for it is already present in memory. If this is the case, the frequency of its category is incremented in the distribution field of the corresponding memory structure. If the new training item has not yet been stored in memory, a new memory structure is created. In learning linguistic mappings (a noisy domain), learning in IBL often is helped by forgetting poorly performing or unrepresentative training items. In this research a simple technique was used to prune memory: each new training item is first classified using the memory structures already present. If it is categorised correctly, it is skipped.</Paragraph> <Paragraph position="2"> We have experimented also with more elaborate storage saving techniques (based on prototypicality and performance of training patterns), but the results are preliminary and will not be reported here.</Paragraph> <Paragraph position="3"> Testing. If the test pattern is present in memory, the category with the highest frequency associated with it is used. If it is not in memory, all memory items are sorted according to the similarity of their pattern to the test pattern. The (most frequent) category of the highest ranking exemplar is then predicted as category of the test pattern. When using a Euclidean distance metric (geometrical distance between two patterns in pattern space), all features are interpreted as being equally important. But this is of course not necessarily the case. We extended the basic IBL algorithm proposed by \[Aha et al., 1991\] with a technique for assigning a different importance to different features. Our approach to the problem of weighing the relative importance of features is based on the concept of Information Gain (IG), also used in learning inductive decision trees, \[Quinlan, 1986\], and first introduced (as far as we know) in IBL in \[Daelemans and Van den Bosch, 1992\] in the context of a syllable segmentation task. The idea is to interpret the training set as an information source capable of generating a number of messages (the different categories) with a certain probability. The information entropy of such an information source can be compared in turn for each feature to the average information entropy of the information source when the value of that feature is known. The difference is the IG value for that feature. The (normalised) IG value is used as a weight for that feature during similarity matching. Figure 2 shows the pattern of information-gain values for the different positions in the 2-1-3 grapheme window. Unsurprisingly, the target grapheme is most informative, and context features become less informative the further they are removed from the target. We also found that right context is more informative than left context (compare \[Weijters, 1991\]).</Paragraph> </Section> <Section position="3" start_page="46" end_page="48" type="sub_section"> <SectionTitle> 2.3 Table Lookup with Defaults </SectionTitle> <Paragraph position="0"> Our table lookup model can be seen as a link between straightforward lexical lookup and similarity-based reasoning. Lexical lookup of word-pronunciation pairs has various disadvantages, an important one being that this approach only works for the words that are stored in the lexicon and not for new words.</Paragraph> <Paragraph position="1"> Without the possibility of manipulating graphemic strings smaller than whole words, there is no way that lexical lookup can provide generalisations on the basis of which new words can be transliterated.</Paragraph> <Paragraph position="2"> The table lookup model presented here takes as its training set a text-to-speech lexicon, but solves the problems of lacking generalisation power and efli- null ciency by compressing it into a text-to-speech lookup table. The main strategy behind the model is to dynamically determine which left and right contexts are minimally sufficient to be able to map a grapheme to the correct phoneme with absolute certainty 1.</Paragraph> <Paragraph position="3"> The context needed to disambiguate a grapheme-to-phoneme mapping can be of very different width. Extreme examples in Dutch are on the one hand the c-cedille, present in a small number of loan words (e.g., <re,u:>), always pronounced as/s/regardless of left or right context, and on the other hand the < e>, which can map to various phonemes (e.g.,/o/, /PS/,/e/) in various contexts. For example, the disambiguation of the pronunciation of the final < e> in words ending with <-ster> (either star or female profession suffix) sometimes involves taking into account large left contexts, as in the examples <venster> (window) and <diensler> (servant), in which the final <e> is pronounced /0/, versus <morgenster> (morning star), in which the final < e> is pronounced /E/. To disambiguate between these three cases, it is necessary to go back five positions in these words to find the first grapheme which the words do not have in common.</Paragraph> <Paragraph position="4"> Table Construction. The algorithm starts by searching for all unambiguous one-to-one grapheme-phoneme mappings, and storing these mappings (patterns) in the lookup table, more specifically in the 0-1-0 subtable. The few unambiguous 0-1-0 patterns in our training set include the < f> -/s/case mentioned earlier. The next step of the algorithm is to extend the width of the graphemic window by 1 Here, absolute certMnty of a grapheme-phoneme correspondence does only express the fact that that correspondence is unambiguous in the training set of the model.</Paragraph> <Paragraph position="5"> one character. We chose to start by extending the window on the right (i.e., a 0-1-1 window), because, as also reflected earlier in the Information Gain metric used in the IBL model, right context appears to contain slightly more valuable information than the equivalent left context 2 . The algorithm then searches for all certain 0-1-1 patterns to store in the 0-1-1 subtable. Compression is achieved because extensions of unambiguous patterns in the 0-1-0 subtable do not have to be stored in the 0-1-1 subtable. This procedure of extending the window and storing all certain patterns that have not been stored earlier is then repeated (extending 0-1-1 to 1-1-1, then to 11-2, etc.), and stops when the whole training corpus is compressed in the lookup table, and all grapheme-phoneme mappings in the corpus are supplied with sufficient left and right contexts. The model evaluated below is calculated up to the 5-1-5 window.</Paragraph> <Paragraph position="6"> At that point, the lookup table covers 99.5% of all grapheme-phoneme mappings in the training set. As a measure of the amount of compression, in number of bytes, the size of the set of linked tables (including the default table discussed below) is 5.8% of the size of the part of the lexicon used as training set 3.</Paragraph> <Paragraph position="7"> Figure 3 displays the magnitudes of the subtables.</Paragraph> <Paragraph position="8"> It can clearly be seen that most ambiguity is resolved with relatively small contexts. The majority of the ambiguity in the training set is already resolved at the 2-1-2 subtable, after which further extension of window width gradually decreases the number of stored patterns (i.e., resolved ambiguities).</Paragraph> <Paragraph position="9"> Retrieval. The pronunciation of a word can be retrieved by taking each grapheme of that word separately, and searching in the lookup table for a matching graphemic pattern. First, the grapheme is looked up in the 0-1-0 subtable. If it does not match with any graphemic pattern stored in that table, the single grapheme pattern is extended to a 0-1-1 pattern.</Paragraph> <Paragraph position="10"> This procedure is then repeated until a matching pattern with a minimal context is found, returning a 'certain' grapheme-phoneme mapping. After all graphemes have been processed this way, the phonemic mappings are concatenated to form the pronunciation of the word.</Paragraph> <Paragraph position="11"> An example of retrieving the pronunciation of a word by table lookup is given in Table 2. As this example illustrates, the contexts needed for disambiguating between output categories are generally metry, led us to a new, more generic and domainindependent, conceptualisation and implementation of the Table Lookup method, in which context features axe ordered according to their information gain, and patterns axe stored in a single trie instead of in separate tables.</Paragraph> <Paragraph position="13"> nunciation of< aanbieding > (offer). Each row contains an unambiguous pattern with minimal context found by the lookup algorithm. Underscores represent spaces.</Paragraph> <Paragraph position="14"> In case of unseen test words that contain grapheme patterns not present in the training set, the lookup algorithm will not be able to retrieve that specific mapping* This problem is handled in our model by adding to the lookup table a second table which contains all occurring graphemic patterns in the training set of a fixed window width (1-1-1), coupled with their most frequently occurring (default) phonemic mapping. Whenever lookup table retrieval fails and a match can be found between the test pattern and a 1-1-1 default pattern, this default table provides a 'best guess' which in many cases still turns out to be correct* To cover for those particular cases where no matching can be found between the test pattern and the 1-1-1 default patterns, a small third default table is added to the model, containing for each grapheme its most frequently occurring phonemic mapping regardless of context (0-1-0), returning a 'final guess'. It is important to see that generalisation in this approach arises from two different mechanisms: (i) the fact that spellings of different words contain identical grapheme substrings, and (ii) the default tables which reflect probabilities of mappings in the training set. More sophisticated reasoning methods can be used instead of the default table: at present we are investigating the consequences of substituting case-based reasoning such as implemented in IBL for the present default tables.</Paragraph> </Section> </Section> class="xml-element"></Paper>