File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1050_metho.xml
Size: 14,131 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1050"> <Title>Weighted Rational Transductions and their Application to Human Language Processing</Title> <Section position="4" start_page="264" end_page="265" type="metho"> <SectionTitle> 3. Speech Recognition </SectionTitle> <Paragraph position="0"> In our first application, we elaborate on how to describe a speech recognizer as a transduction cascade. Recall we decompose the problem into a language, O, of acoustic observation sequences, a transduction, A, from acoustic observation sequences to phone sequences, a transduction, D, from phone sequences to word sequences and a weighted language, M, specifying the language model (see Figure 1). Each of these can be represented as a finite-state automaton (to some approximation). null The trivial automaton for the acoustic observation language, O, is defined for a given utterance as depicted in Figure 2a.</Paragraph> <Paragraph position="1"> Each state represents a fixed point in time ti, and each transition has a label, oi, drawn from a finite alphabet that quantizes the acoustic waveform between adjacent time points and is assigned probability 1.0.</Paragraph> <Paragraph position="2"> The automaton for the acoustic observation sequence to phone sequence transduction, A, is defined in terms of phone models.</Paragraph> <Paragraph position="3"> A phone model is defined as a transducer from a subsequence of acoustic observation labels to a specific phone, and assigns to each subsequence a likelihood that the specified phone produced it. Thus, different paths through a phone model correspond to different acoustic realizations of the phone.</Paragraph> <Paragraph position="4"> Figure 2b depicts a common topology for such a phone model.</Paragraph> <Paragraph position="5"> A is then defined as the closure of the sum ofthephone models.</Paragraph> <Paragraph position="6"> The automaton for the phone sequence to word sequence transduction, D, is defined similarly to that for A. We define a word model as a transducer from a subsequence of phone labels to a specific word, which assigns to each subsequence a likelihood that the specified word produced it. Thus, different paths through a word model correspond to different phonetic realizations of the word. Figure 2c depicts a common topology for such a word model. D is then defined as the closure of the sum of the phone models.</Paragraph> <Paragraph position="7"> Finally, the language model, M, is commonly an N-gram model, encodable as a WFSA. Combining these automata, (0 o A o D o M)(w) is thus an automaton that assigns a probability to each word sequence, and the highest-probability path through that automaton estimates the most likely word sequence for the given utterance.</Paragraph> <Paragraph position="8"> The finite-state modeling for speech recognition that we have just described is hardly novel. In fact, it is equivalent to that presented in \[12\], in the sense that it generates the same weighted language. However, the transduction cascade approach presented here allows one to view the computations in new ways.</Paragraph> <Paragraph position="9"> For instance, because composition, o, is associative, we see that the computation of max,o(O o A o D o M)(w) can be organized in several ways. A conventional integrated-search, speech recognizer computes maxw(O o (A o D o M))(w).</Paragraph> <Paragraph position="10"> In other words, the phone, word, and language models are, in effect, compiled together into one large transducer which is then applied to the input observation sequence \[12\]. On the other hand, one can use a more modular, staged computation, maxw(((O o A) o D) o M)(w). In other words, first the acoustic observations are transduced into a phone lattice represented as an automaton labeled by phones (phone recog- null nition). 'This lattice is in turn transduced into a word lattice (word recognition), which is then joined with the language model (language model application) \[13\].</Paragraph> <Paragraph position="11"> The best approach may depend on the specific task, which determines the size of intermediate results and the whether finite-state minimization is fruitful. By having a general package to manipulate these automata, we have been able to experiment with various alternatives. For many tasks, the complete; network, O o A o D o M, is too large to compute explicitly, regardless of the order in which the operations are applied. The solution that is usually taken is to interleave the best path computation with the composition operations and to retain only a portion of the intermediate results by discarding unpromising paths.</Paragraph> <Paragraph position="12"> So far, our presentation has used context-independent phone models. In other words, the likelihoods assigned by a phone model in A assumed conditional independence from neighboring phones. However, it has been shown that context-dependent phone models, which model a phone in the context of its adjacent phones, are very effective for improving recognition performance \[14\].</Paragraph> <Paragraph position="13"> We can include context-dependent models, such as triphone models, in our presentation by expanding our 'atomic models' in A to one for every phone in a distinct triphonic context.</Paragraph> <Paragraph position="14"> Each model will have the same form as in Figure 2b, but will have different likelihoods for the different contexts. We could also try to directly specify D in terms of the new units, but this is problematic. First, even if each word in D had only one phonetic realization, we could not directly substitute its spelling in terms of context-dependent units, since the cross-word units must be specified (because of the closure operation). In this case, a common approach is to either use left (right) context-independent units at the word starts (ends), or to build a fully context-dependent lexicon, but have special computations that insure the correct models are used at word junctures. In either case, this disallows use of phonetic networks as in Figure 2c.</Paragraph> <Paragraph position="15"> There is, however, a natural solution to these problems using a a finite-state transduction. We leave D as defined before, but interpose a new transduction, C, between A and D, to convert between context-dependent and context-independent units. In other words, we now compute maxw (O o A o C o D o M) (w).</Paragraph> <Paragraph position="16"> The form of C for triphonic models is depicted in Figure 2d.</Paragraph> <Paragraph position="17"> For each context-dependent phone model, 7, which corresponds to the (context-independent) phone 7re in the context of 7q and 7rr, there is a state qle in C for the biphone 7rlre, a state qcr for 7rcTr~ and a transition from qtc to q~ with input label 7 and output label 7rr. We have constructed such a transducer and have been able to easily convert context-independent phonetic networks into context-dependent networks for certain tasks. In those cases, we can implement full-context dependency with no special-purpose computations.</Paragraph> </Section> <Section position="5" start_page="265" end_page="266" type="metho"> <SectionTitle> 4. Chinese Text Segmentation </SectionTitle> <Paragraph position="0"> Our second application is to text processing, namely the tokenization of Chinese text into words, and the assignment of pronunciations to those words. In Chinese orthography, most characters represent (monosyllabic) morphemes, and as in English, words may consist of one or more morphemes.</Paragraph> <Paragraph position="1"> Given that Chinese does not use whitespace to delimit words, it is necessary to 'reconstruct' the grouping of characters into words. For example, we want to say that the sentence \[\] 3~ l~,~g~-~ &quot;How do you say octopus in Japanese?&quot;, consists of four words, namely \[\] 3~ ri4-wen2 'Japanese', ~, zhangl-yu2 'octopus', ~g~ zen3-mo 'how', and -~ shuol 'say'. The problem with this sentence is that \[\] ri4 is also a word (e.g. a common abbreviation for Japan) as are 3~ Y~ wen2-zhangl 'essay', and ~, yu2 'fish', so there is not a unique segmentation.</Paragraph> <Paragraph position="2"> The task of segmenting and pronouncing Chinese text is naturally thought of as a transduction problem. The Chinese dictionary s is represented as a WFST D. The input alphabet is the set of Chinese characters, and the output alphabet is the union of the set of Mandarin syllables with the set of part-of-speech labels. A given word is represented as a sequence of character-to-syllable transitions, terminated in an e-to-partof-speech transition weighted by an estimate of the negative log probability of the word. For instance, the word ~, 'octopus' would be represented as the sequence of transductions ~:zhangllO.O ~:yu210.O c:noun/13.18. A dictionary in this form can easily be minimized using standard algorithms.</Paragraph> <Paragraph position="3"> An input sentence is represented as an unweighted acceptor S, with characters as transition labels. Segmentation is then accomplished by finding the lowest weight string in S o D*. The result is a string with the words delimited by part-of-speech labels and marked with their pronunciation. For the example at hand, the best path is the correct segmentation, mapping the input sequence \[\] 3~ c~ ~, c~ F~ c-~ ~ to the sequence ri4 wen2 noun zhangl yu2 noun zen3 mo adv shuo l verb.</Paragraph> <Paragraph position="4"> As is the case with English, no Chinese dictionary covers all of the words that one will encounter in Chinese text. For example, many words that are derived via productive morphological processes are not generally to be found in the dictionary. One such case in Chinese involves words derived via the nominal plural affix r~l -men. While some words in ~I will be found in the dictionary (e.g.,/!!~ tal-men 'they'; ~ ren2-men 'people'), many attested instances will not: for example, ~f~ jiang4-men '(military) generals', ~ qingl-wal-men 'frogs'. Given that the basic dictionary is represented as a finite-state automaton, it is a simple matter to augment the model just described with standard techniques from finite-state morphology (\[15, 16\], inter alia). For in3We are currently using the 'Behavior Chinese-English Electronic Dictionary', Copyright Number 112366, from Behavior Design Corporation, R.O.C.; we also wish to thank United Informaties, Inc., R.O.C. for providing us with the Chinese text corpus that we used in estimating lexieal probabilities. Finally we thank Dr. Jyun-Sheng Chang for kindly providing us with Chinese personal name corpora.</Paragraph> <Paragraph position="5"> stance, we can represent the fact that f\] attaches to nouns by allowing e-transitions from the final states of noun entries, to the initial state of a sub-transducer containing f\]. However, for our purposes it is not sufficient merely to represent the morphological decomposition of (say) plural nouns, since we also want to estimate the cost of the resulting words. For derived words that occur in our corpus we can estimate these costs as we would the costs for an underived dictionary entry. So, ~\] jiang4-men '(military)generals' occurs and we estimate its cost at 15.02; we include this word by allowing an e-transition between ~ and f~, with a cost chosen so that the entire analysis of~\] ends up with a cost of 15.02.</Paragraph> <Paragraph position="6"> For non-occurring possible plural forms (e.g., ~//~f\] nan2gual-men 'pumpkins') we use the Good-Turing estimate (e.g.</Paragraph> <Paragraph position="7"> \[ 17\]), whereby the aggregate probability of previously unseen members of a construction is estimated as N1/N, where N is the total number of observed tokens and N1 is the number of types observed only once; again, we arrange the automaton so that noun entries may transition to f\], and the cost of the whole (previously unseen) construction comes out with the value derived from the Good-Turing estimate.</Paragraph> <Paragraph position="8"> Another large class of words that are generally not to be found in the dictionary are Chinese personal names: only famous names like ~j,~ 'Zhou Enlai' can reasonably be expected to be in a dictionary, and even many of these are missing. Full Chinese personal names are formally simple, being always of the form FAMILY+GIVEN. The FAMILY name set is restricted: there are a few hundred single-character FAMILY names, and about ten double-character ones. Given names are most commonly two characters long, occasionally one-character long: there are thus four possible name types. The difficulty is that GIVEN names can consist, in principle, of any character or pair of characters, so the possible GIVEN names are limited only by the total number of characters, though some characters are certainly far more likely than others. For a sequence of characters that is a possible name, we wish to assign a probabilityto that sequence qua name. We use a variant of an estimate proposed in \[18\]. Given a potential name of the form F1 G1 G2, where F1 is a legal FAMILY name and G1 and G2 are Chinese characters, we estimate the probability of that name as the product of the probability of finding any name in text; the probability of F1 as a FAMILY name; the probability of the first character of a double GIVEN name being G1; the probability of the second character of a double GIVEN name being G2; and the probability of a name of the ftyrm SINGLE-FAMILY+DOUBLE-GIVEN. The first probability is estimated from a count of names in a text database, whereas the last four probabilities are estimated from a large list of personal names. This model is easily incorporated into the segmenter by building a transducer restricting the names to the four licit types, with costs on the transitions for any particular name summing to an estimate of the cost of that name. This transducer is then summed with the transducer implementing the dictionary and morphological rules, and the transitive closure of the resulting transducer computed.</Paragraph> </Section> class="xml-element"></Paper>