File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1015_metho.xml

Size: 17,216 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1015">
  <Title>SPEECH RECOGNITION USING A STOCHASTIC LANGUAGE MODEL INTEGRATING LOCAL AND GLOBAL CONSTRAINTS</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SPEECH RECOGNITION USING A STOCHASTIC LANGUAGE
MODEL INTEGRATING LOCAL AND GLOBAL CONSTRAINTS
Ryosuke Isotani, Shoichi Matsunaga
ATR Interpreting Telecommunications Research Laboratories
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> In this paper, we propose a new stochastic language model that integrates local and global constraints effectively and describe a speechrecognition system basedon it. Theproposedlanguagemodel uses the dependencies within adjacent words as local constraints in the same way as conventional word N-gram models. To capture the global constraints between non-contiguous words, we take into account the sequence of the function words and that of the content words which are expected to represent, respectively, the syntactic and semantic relationships between words. Furthermore, we show that assuming an independence between local- and global constraints, the number of parameters to be estimated and stored is greatly reduced.</Paragraph>
    <Paragraph position="1"> The proposed language model is incorporated into a speech recognizer based on the time-synchronous Viterbi decoding algorithm, and compared with the word bigram model and trigram model. The proposed model gives a better recognition rate than the bigram model, though slightly worse than the trigram model, with only twice as many parameters as the bigram model.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> At present, word N-gram models \[1\], especially bigram (N = 2).or trigram (N = 3) models, are recognized as effective and are widely used as language models for speech recognition. Such models, however, represent only local constraints within a few successive words and lack the ability to capture global or long distance dependencies between words. They might represent global constraints if N were set at a larger value, but it is not only computationally impractical but also inefficient because dependencies between non-contiguous words are often independent of the contents and length of the word string between them. In addition, estimating so many parameters from a finite number of text corpora would result in sparseness of data.</Paragraph>
    <Paragraph position="1"> Recently some papers treat long distance factors. In the long distance bigrams by Huang et al. \[2\] a linear combination of distance-d bigrams is used. All the preceding words in a window of fixed length are considered, and bigram probabilities are estimated for each distance d between words respectively. The extended bigram model by Wright et al. \[3\] uses a single word selected for each word according to a statistical measure as its &amp;quot;parent.&amp;quot; The extended bigrams are insensitive to the distance between the word and its parent, but this model does not utilize multiple information. The trigger pairs described in \[4, 5\] also represent relationships between non-contiguous words. They are also extracted automatically and insensitive to the distance. The way of combining the evidence from trigger pairs with local constraints (&amp;quot;the static model&amp;quot; in their term) is also given. But this approach has the disadvantage that it is computationaUy expensive. Another approach is a tree-based model \[6\], which automatically generates a binary decision tree from training data. Although it could also extract similar dependencies by setting binary questions appropriately, it has the same disadvantage as the trigger-based model.</Paragraph>
    <Paragraph position="2"> We therefore proposed a new language model based on function word N-grams 1 and content word N-grams \[7\]. Global constraints are captured effectively without significantly increasing computational cost nor number of parameters by utilizing simple linguistic tags. Function wordN-grams are mainly intended for syntactic constraints, while content word N-grams are for semantic ones. We already showed their effectiveness for Japanese speech recognition by applying them to sentence candidate selection from phrase lattices obtained by a phrase speech recognizer. We also gave a method to combine these global constraints with local constraints similar to conventional bigrams, and demonstrated that it improves performance. null In this paper, we extend and modify this model so that it can be incorporated directly into the search process in continuous speech recognition based on the time-synchronous Viterbi decoding algorithm. The new model uses the conventional word N-grams for local constraints with N being a small value, and uses function- and content word N-grams as global constraints, where N can again be small. These constraints are treated statistically in a unified manner. A similar approach is found in \[8\], where, to compute a word probability, the headwords of the two phrases immediately preceding the word are used as well as the last two words. Our model is different from this method in that the former also takes function words into consideration, and treats function words and content words separately in computing the probability to extract more effective syntactic and semantic information, respectively.</Paragraph>
    <Paragraph position="3"> In the following sections, we first explain the proposed language model, where we also show that the number of parameters can be reduced by assuming an independence between local- and global constraints. Then we describe how it is incorporated into the time-synchronous Viterbi decoding algorithm. Finally, results of speaker-dependent sentence recognition experiments are presented, where our model is compared with the word bigram and trigram models in the viewpoints of number of parameters, perplexity, and recognition rate.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="89" type="metho">
    <SectionTitle>
2. LANGUAGE MODELING
</SectionTitle>
    <Paragraph position="0"> Linguistic constraints between words in a sentence include syntactic ones and semantic ones. The syntactic constraints are often specified by the relationships between the cases of the words or phrases. Con- null the conference (CM) the 2nd from the 5th to Kyoto in (The conference will be held in Kyoto from the 2nd to the 5th.) Soredewa / tourokuyoushi -o / ookuri -itashi -masu.</Paragraph>
    <Paragraph position="1"> then the registration form (CM) send (aux. v.) (aux. v.) (Then I will send you the registration form.)  (CM: case marker, aux. v.: auxiliary verb) kaisaisare -masu.</Paragraph>
    <Paragraph position="2"> be held (aux. v.) sequently, they are expected to be reflected in the sequence of the cases of the words or phrases. Taking notice that case information is mainly conveyed by function words in Japanese, we consider function word sequences to capture syntactic constraints while ignoring content words in the sentences. On the contrary, semantic information is mostly contained in the content words. Accordingly the idea of content word sequences is also introduced to extract semantic constraints.</Paragraph>
    <Paragraph position="3"> After briefly explaining the roles of the function words and content words in Japanese sentences, we will propose a new model, model I, as an extension of the conventionalN-gram model. In this model, the relationships between function words and between content words are taken into consideration only implicitly. Then by making some assumptions, model II will be derived as an approximation of model I. Model II uses the probabilities of function word N-grams and content word N-grams directly and may be easier to grasp intuitively. 2.1. Function Words and Content Words in Japanese A common Japanese sentence consists of phrases (&amp;quot;bunsetsu&amp;quot;), each of which typically has one content word and optional function words. Figure 1 shows examples of Japanese sentences. In the figure, &amp;quot;f' represents a phrase separator. Words after &amp;quot;-&amp;quot; in a phrase are function words and all others are content words 2. The corresponding English words are given in the figure. Content words include nouns, verbs, adjectives, adverbs, etc. Function words are particles and auxiliary verbs. Japanese particles include case markers such as &amp;quot;ga&amp;quot; (subjective case marker), &amp;quot;o&amp;quot; (objective case marker) as well as words such as &amp;quot;kara (from)&amp;quot; or &amp;quot;de (in).&amp;quot; Every word in a sentence is classified either as a content word or as a function word.</Paragraph>
    <Paragraph position="4"> Paying attention only to function words and ignoring content words in sentences, &amp;quot;kara (from)&amp;quot; often comes before &amp;quot;made (to)&amp;quot; while &amp;quot;ga&amp;quot;s (subjective case markers) rarely appear in succession in a sentence. Thus, a sequence of function words is expected to reflect the syntactic constraints of a sentence. If we consider the content word sequence instead, such words as &amp;quot;sanka (participate)&amp;quot; or &amp;quot;happyou (give a presentation)&amp;quot; appear more frequently than words such as &amp;quot;okuru (send)&amp;quot; after &amp;quot;kaigi (conference).&amp;quot; On the other hand, after &amp;quot;youshi (form),&amp;quot; &amp;quot;okuru (send)&amp;quot; comes more frequently. Like these examples, a sequence of content words in a sentence is expected to be constrained by semantic relationships between words.</Paragraph>
    <Paragraph position="5"> These kinds of constraints can be described statistically. To acquire these global constraints, the proposed language model makes use of 2 These marks are for explanation only and never appearin actual Japanese text.</Paragraph>
    <Paragraph position="6"> the N-gram probabilities of both function words and content words.</Paragraph>
    <Paragraph position="7">  2.2. Proposed Language Model I Suppose a sentence S consists of a word siring wl, w2,..., wn, and denote a substring Wl, w2,..., wi as w\[. Then the probability of the sentence S is written as</Paragraph>
    <Paragraph position="9"> In conventional word N-gram models, each term of the right hand side of expression (2) is approximated as the probability given for a single word based on the final N - 1 words preceding it. In the bigram model, for example, the foll6wing approximation is adopted:</Paragraph>
    <Paragraph position="11"> The proposed model is an extension of the N-gram model and utilizes the global constraints represented by function- and content word N-grams as well. For simplicity, only a single preceding word is taken into account, both for global and local relationships. Letfi and ci denote the last function word and the last content word in the substring w{, respectively. The probability of a word wi given w{ -1 is, takingfi_l and ci-i into consideration as well as wi-l, represented approximately as follows:</Paragraph>
    <Paragraph position="13"> We refer to the model based on equation (5) as &amp;quot;proposed model I.&amp;quot; Figure 2 shows how the word dependencies are taken into account in  &amp;quot;7&amp;quot;',-. c f c f f c c f c: content word . .~J r~ J / 7 f: function word  this model. The probability of each word in a sentence is determined by the pleceding content- and function-word pair. If content words and function words appear alternately, this model reduces to the trigram model. But when, for example, a function word is preceded by more than one content word, the most recent function word is used to predict it instead of the last word but two (wi-2). 2.3. Proposed Model H -- R, eduction of the Number of Parameters The following two assumptions are introduced as an approximation to reduce the number of parameters:  1. Mutual information between wi and wi-1 is independent of fi-1 if wi-1 is a content word, and independent of ci-i ff wi-1 is a function word, i.e., the following approximations hold;</Paragraph>
    <Paragraph position="15"> if Wi-1 is a function word.</Paragraph>
    <Paragraph position="16"> 2. The appearance of a content word and that of a function word are mutually independent when they are located non-contiguously in a sentence, i.e.,</Paragraph>
    <Paragraph position="18"> if wi-1 and wi are content words, and</Paragraph>
    <Paragraph position="20"> if wi-~ and Wi are function words.</Paragraph>
    <Paragraph position="21"> From these approximations, expression (5) is rewritten as</Paragraph>
    <Paragraph position="23"> wi-l: function word, wi: content word (= ci)</Paragraph>
    <Paragraph position="25"> where PL and PC represent the probabilities of local and global constraints between words. To be more exact, Pc(f i) is the probability that the i-th word isfl knowing that it is a function word, and PG(fi \[f~-l) is the probability that the i-th word isfi given that the most recent function word isfi-1 and also knowing that the i-th word is a function word. Pc(ci) and PG(CilCi-l) are explained in the same way. In other words, Pc(') denotes a probability in the function (or content) word sequences obtained by extracting only function (or content) words from sentences. Notice should be taken that PG(') is used only when two function (or content) words appear non-contiguously. We refer to the model based on equation (10) as &amp;quot;proposed model II.&amp;quot; This approximate equation shows that the probabilities of words in a sentence are expressed as the product of word bigram probabilities and function word (or content word) bigram probabilities, which describe local and global linguistic constraints, respectively. The term word bigram probabilities (local constraints) cf cff c cf function word bigram / content word bigram  as the compensation for the probability of word wi being multiplied twice.</Paragraph>
    <Paragraph position="26"> Figure 3 shows how the word dependencies are taken into account in this model. The probability of each word is determined from the word immediately before it, and also from the preceding word of the same category (function word or content word) ff the category of the word immediately before it is different from that of the current word. The first corresponds to the word bigram probability and the latter corresponds to the function word (or content word) bigram probability, which are computed independently. It is easy to extend this model so as to use a word trigram model or a function word (content word) trigram model.</Paragraph>
    <Paragraph position="27"> The decomposition of probabilities greatly reduces the number of parameters to be estimated. The number of parameters in each model is summarized in Table 1, where V, Vc, Vf is the vocabulary size, the number of content words, and the number of function words, respectively (V = Vc + Vf ). The word trigram model and the proposed model I has O(V 3) parameters, while the proposed model II has only O(I/2) parameters, which is comparable to the word bigram model.</Paragraph>
  </Section>
  <Section position="5" start_page="89" end_page="90" type="metho">
    <SectionTitle>
3. APPLICATION TO SPEECH
RECOGNITION
</SectionTitle>
    <Paragraph position="0"> Since, like N-gram models, the proposed language models are Markov models, they can easily be incorporated into a speech recognition system based on the time-synchronous Viterbi decoding algorithm. They could also be used in rescoring for N-best hypotheses, but it would bring some loss of information.</Paragraph>
    <Paragraph position="1"> Figure 4 shows the network representation of the language model.</Paragraph>
    <Paragraph position="2"> Symbols c i, c i, c k represent content words, andf t, f m, f n represent function words. Each node of the network is a Markov state  The proposed model was compared with the word bigram and tri-gram models in their perplexities for test sentences and in sentence recognition rates. As for the proposed model I, only perplexity was calculated. The ratios of the numbers of parameters were also calculated based on Table 1.</Paragraph>
    <Paragraph position="3"> corresponding to a word pair of either (Ci-l,fi-1) or (fi-1, Ci-1), and each arc is a transition corresponding to a word wl. In the case of the trigram model, each node would correspond to a word pair (wi-2, wi-1 ). Each arc is assigned with a probability value according to equation (5) (model I) or (10) (model H). The number of nodes is 2VcVf and the total number of arcs is 2VcVf V for both model I and model H. In the case of the trigam model, they would be V z and V 3, respectively.</Paragraph>
    <Paragraph position="4"> Ordinary time-synchronous Viterbi decoding controlled by this network is possible. As the numbers of nodes and arcs are still huge although reduced compared with the trigram model, a beam search is necessary in the decoding process.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML