File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1068_metho.xml
Size: 17,700 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1068"> <Title>Multi-Class Composite N-gram Language Model for Spoken Language Processing Using Multiple Word Clusters</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Word N-grams have been widely used as a statistical language model for language processing.</Paragraph> <Paragraph position="1"> Word N-grams are models that give the transition probability of the next word from the previous C6 A0BD word sequence based on a statistical analysis of the huge text corpus. Though word N-grams are more effective and flexible than rule-based grammatical constraints in many cases, their performance strongly depends on the size of training data, since they are statistical models.</Paragraph> <Paragraph position="2"> In word N-grams, the accuracy of the word prediction capability will increase according to the number of the order N, but also the number of word transition combinations will exponentially increase. Moreover, the size of training data for reliable transition probability values will also dramatically increase. This is a critical problem for spoken language in that it is difficult to collect training data sufficient enough for a reliable model. As a solution to this problem, class N-grams are proposed.</Paragraph> <Paragraph position="3"> In class N-grams, multiple words are mapped to one word class, and the transition probabilities from word to word are approximated to the probabilities from word class to word class. The performance and model size of class N-grams strongly depend on the definition of word classes. In fact, the performance of class N-grams based on the part-of-speech (POS) word class is usually quite a bit lower than that of word N-grams. Based on this fact, effective word class definitions are required for high performance in class N-grams.</Paragraph> <Paragraph position="4"> In this paper, the Multi-Class assignment is proposed for effective word class definitions. The word class is used to represent word connectivity, i.e. which words will appear in a neighboring position with what probability. In Multi-Class assignment, the word connectivity in each position of the N-grams is regarded as a different attribute, and multiple classes corresponding to each attribute are assigned to each word. For the word clustering of each Multi-Class for each word, a method is used in which word classes are formed automatically and statistically from a corpus, not using a priori knowledge as POS information. Furthermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 N-gram Language Models Based on </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Multiple Word Classes 2.1 Class N-grams </SectionTitle> <Paragraph position="0"> Word N-grams are models that statistically give the transition probability of the next word from the previousC6A0BD word sequence. This transition probability is given in the next formula.</Paragraph> <Paragraph position="2"> In word N-grams, accurate word prediction can be expected, since a word dependent, unique connectivity from word to word can be represented. On the other hand, the number of estimated parameters, i.e., the number of combinations of word</Paragraph> <Paragraph position="4"> exponentially increase according to C6, reliable estimations of each word transition probability are difficult under a large C6.</Paragraph> <Paragraph position="5"> Class N-grams are proposed to resolve the problem that a huge number of parameters is required in word N-grams. In class N-grams, the transition probability of the next word from the previous C6A0BD word sequence is given in the next formula.</Paragraph> <Paragraph position="6"> . However, accuracy of the word prediction capability will be lower than that of word N-grams with a sufficient size of training data, since the representation capability of the word dependent, unique connectivity attribute will be lost for the approximation base word class.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Problems in the Definition of Word Classes </SectionTitle> <Paragraph position="0"> In class N-grams, word classes are used to represent the connectivity between words. In the conventional word class definition, word connectivity for which words follow and that for which word precedes are treated as the same neighboring characteristics without distinction. Therefore, only the words that have the same word connectivity for the following words and the preceding word belong to the same word class, and this word class definition cannot represent the word connectivity attribute efficiently. Take &quot;a&quot; and &quot;an&quot; as an example. Both are classified by POS as an Indefinite Article, and are assigned to the same word class. In this case, information about the difference with the following word connectivity will be lost. On the other hand, a different class assignment for both words will cause the information about the community in the preceding word connectivity to be lost. This directional distinction is quite crucial for languages with reflection such as French and Japanese.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Multi-Class and Multi-Class N-grams </SectionTitle> <Paragraph position="0"> As in the previous example of &quot;a&quot; and &quot;an&quot;, following and preceding word connectivity are not always the same. Let's consider the case of different connectivity for the words that precede and follow. Multiple word classes are assigned to each word to represent the following and preceding word connectivity. As the connectivity of the word preceding &quot;a&quot; and &quot;an&quot; is the same, it is efficient to assign them to the same word class to represent the preceding word connectivity, if assigning different word classes to represent the following word connectivity at the same time. To apply these word class definitions to formula (2), the next formula is given.</Paragraph> <Paragraph position="1"> represents the word class in the N-th position in a conditional word sequence.</Paragraph> <Paragraph position="2"> We call this multiple word class definition, a Multi-Class. Similarly, we call class N-grams based on the Multi-Class, Multi-Class N-grams (Yamamoto and Sagisaka, 1999).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Automatic Extraction of Word </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Clusters 3.1 Word Clustering for Multi-Class 2-grams </SectionTitle> <Paragraph position="0"> For word clustering in class N-grams, POS information is sometimes used. Though POS information can be used for words that do not appear in the corpus, this is not always an optimal word classification for N-grams. The POS information does not accurately represent the statistical word connectivity characteristics. Better word-clustering is to be considered based on word connectivity by the reflection neighboring characteristics in the corpus. In this paper, vectors are used to represent word neighboring characteristics. The elements of the vectors are forward or backward word 2-gram probabilities to the clustering target word after being smoothed. And we consider that word pairs that have a small distance between vectors also have similar word neighboring characteristics (Brown et al., 1992) (Bai et al., 1998). In this method, the same vector is assigned to words that do not appear in the corpus, and the same word cluster will be assigned to these words. To avoid excessively rough clustering over different POS, we cluster the words under the condition that only words with the same POS can belong to the same cluster. Parts-of-speech that have the same connectivity in each Multi-Class are merged. For example, if different parts-of-speeche are assigned to &quot;a&quot; and &quot;an&quot;, these parts-of-speeche are regarded as the same for the preceding word cluster. Word clustering is thus performed in the following manner.</Paragraph> <Paragraph position="1"> is the value of the probability of the succeeding class-word 2-gram or word 2-gram, while D4 CU is the same for the preceding one.</Paragraph> <Paragraph position="2"> 3. Merge the two classes. We choose classes whose dispersion weighted with the 1-gram probability results in the lowest rise, and merge these two classes: represents the square of the Euclidean distance between vector DA</Paragraph> <Paragraph position="4"> sents the classes before merging, and CR D2CTDB represents the classes after merging. 4. Repeat step 2 until the number of classes is reduced to the desired number.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Word Clustering for Multi-Class 3-grams </SectionTitle> <Paragraph position="0"> To apply the multiple clustering for 2-grams to 3-grams, the clustering target in the conditional part is extended to a word pair from the single word in 2-grams. Number of clustering targets in the preceding class increases to CE</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> BE </SectionTitle> <Paragraph position="0"> from CE in 2grams, and the length of the vector in the succeeding class also increase to CE</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> BE </SectionTitle> <Paragraph position="0"> . Therefore, efficient word clustering is needed to keep the reliability of 3-grams after the clustering and a reasonable calculation cost.</Paragraph> <Paragraph position="1"> To avoid losing the reliability caused by the data sparseness of the word pair in the history of 3-grams, approximation is employed using distance-2 2-grams. The authority of this approximation is based on a report that the association of word 2-grams and distance-2 2-grams based on the maximum entropy method gives a good approximation of word 3-grams (Zhang et al., 1999). The vector for clustering is given in the next equation.</Paragraph> <Paragraph position="2"> value from word DC to word DD. And the POS constraints for clustering are the same as in the clustering for preceding words.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Multi-Class Composite N-grams </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Multi-Class Composite 2-grams Introducing Variable Length Word Sequences </SectionTitle> <Paragraph position="0"> Let's consider the condition such that only word sequence B4BTBNBUBNBVB5 has sufficient frequency in sequence B4CGBNBTBNBUBNBVBNBWB5. In this case, the value of word 2-gram D4B4BUCYBTB5 can be used as a reliable value for the estimation of word BU, as the frequency of sequence B4BTBNBUB5 is sufficient. The value of word 3-gram D4B4BVCYBTBNBUB5 can be used for the estimation of word BV for the same reason. For the estimation of words BT and BW,itis reasonable to use the value of the class 2-gram, since the value of the word N-gram is unreliable (note that the frequency of word sequences B4CGBNBTB5 and B4BVBNBWB5 is insufficient). Based on this idea, the transition probability of word sequence B4BTBNBUBNBVBNBWB5 from word CG is given in the next equation in the Multi-Class 2-gram.</Paragraph> <Paragraph position="2"> When word successionBTB7BUB7BV is introduced as a variable length word sequence B4BTBNBUBNBVB5, equation (9) can be changed exactly to the next equation (Deligne and Bimbot, 1995) (Masataki et al., 1996).</Paragraph> <Paragraph position="4"> Here, we find the following properties. The preceding word connectivity of word succession BTB7 BU B7BV is the same as the connectivity of word BT, the first word of BT B7 BU B7 BV. The following connectivity is the same as the last word BV. In these assignments, no new cluster is required. But conventional class N-grams require a new cluster for the new word succession.</Paragraph> <Paragraph position="6"> Applying these relations to equation (10), the next equation is obtained.</Paragraph> <Paragraph position="8"> Equation(13) means that if the frequency of the C6 word sequence is sufficient, we can partially introduce higher order word N-grams using C6 length word succession, thus maintaining the reliability of the estimated probability and formation of the Multi-Class 2-grams. We call Multi-Class Composite 2-grams that are created by partially introducing higher order word N-grams by word succession, Multi-Class 2-grams. In addition, equation (13) shows that number of parameters will not be increased so match when frequent word successions are added to the word entry. Only a 1-gram of word successionBTB7BUB7BV should be added to the conventional N-gram parameters. Multi-Class Composite 2-grams are created in the following manner.</Paragraph> <Paragraph position="9"> 1. Assign a Multi-Class 2-gram, for state initialization. null 2. Find a word pair whose frequency is above the threshold.</Paragraph> <Paragraph position="10"> 3. Create a new word succession entry for the frequent word pair and add it to a lexicon.</Paragraph> <Paragraph position="11"> The following connectivity class of the word succession is the same as the following class of the first word in the pair, and its preceding class is the same as the preceding class of the last word in it.</Paragraph> <Paragraph position="12"> 4. Replace the frequent word pair in training data to word succession, and recalculate the frequency of the word or word succession pair. Therefore, the summation of probability is always kept to 1.</Paragraph> <Paragraph position="13"> 5. Repeat step 2 with the newly added word succession, until no more word pairs are found.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Extension to Multi-Class Composite 3-grams </SectionTitle> <Paragraph position="0"> Next, we put the word succession into the formulation of Multi-Class 3-grams. The transition probability to word sequence B4BTBNBUBNBVBNBWBNBXBNBYB5 from word pair B4CGBNCHB5 is given in the next equation. null</Paragraph> <Paragraph position="2"> Where, the Multi-Classes for word succession</Paragraph> <Paragraph position="4"> In equation (17), please notice that the class sequence (not single class) is assigned to the preceding class of the word successions. the class sequence is the preceding class of the last word of the word succession and the pre-preceding class of the second from the last word. Applying these class assignments to equation (14) gives the next equation.</Paragraph> <Paragraph position="6"> In the above formation, the parameter increase from the Multi-class 3-gram is D4B4BT B7 BU B7 BV B7</Paragraph> </Section> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> BWCYCR </SectionTitle> <Paragraph position="0"> B4BTB5B5. After expanding this term, the next equation is given.</Paragraph> <Paragraph position="2"> In equation (19), the words without BU are estimated by the same or more accurate models than Multi-Class 3-grams (Multi-Class 3-grams for wordsBT, BX and BY, and word 3-gram and word 4-gram for words BV and BW ). However, for word BU, a word 2-gram is used instead of the Multi-Class 3-grams though its accuracy is lower than the Multi-Class 3-grams. To prevent this decrease in the accuracy of estimation, the next process is introduced.</Paragraph> <Paragraph position="3"> First, the 3-gram entry D4B4CR</Paragraph> <Paragraph position="5"> BUB7BVB7BWB5 is removed. After this deletion, back-off smoothing is applied to this entry as follows.</Paragraph> <Paragraph position="7"> Next, we assign the following value to the back-off parameter in equation (20). And this value is used to correct the decrease in the accuracy of the estimation of word BU.</Paragraph> <Paragraph position="9"> After this assignment, the probabilities of words BU and BX are locally incorrect. However, the total probability is correct, since the back-off parameter is used to correct the decrease in the accuracy of the estimation of word BU. In fact, applying equations (20) and (21) to equation (14) according to the above definition gives the next equation. In this equation, the probability for word BU is changed from a word 2-gram to a class 3-gram.</Paragraph> <Paragraph position="11"> In the above process, only 2 parameters are additionally used. One is word 1-grams of word successions as D4B4BT B7 BU B7 BV B7 BWB5. And the other is word 2-grams of the first two words of the word successions. The number of combinations for the first two words of the word successions is at most the number of word successions.</Paragraph> <Paragraph position="12"> Therefore, the number of increased parameters in the Multi-Class Composite 3-gram is at most the number of introduced word successions times 2.</Paragraph> </Section> class="xml-element"></Paper>