File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/p05-1064_relat.xml

Size: 7,004 bytes

Last Modified: 2025-10-06 14:15:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1064">
  <Title>A Phonotactic Language Model for Spoken Language Identification</Title>
  <Section position="3" start_page="515" end_page="516" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Formal evaluations conducted by the National Institute of Science and Technology (NIST) in recent years demonstrated that the most successful approach to LID used the phonotactic content of the voice signal to discriminate between a set of languages (Singer et al., 2003). We briefly discuss previous work cast in the formalism mentioned above: tokenization, statistical language modeling, and language identification. A typical LID system is illustrated in Figure 1 (Zissman, 1996), where language dependent voice tokenizers (VT) and language models (LM) are deployed in the Parallel PRLM architecture, or P-PRLM.</Paragraph>
    <Paragraph position="1"> Figure 1. L monolingual phoneme recognition front-ends are used in parallel to tokenize the input utterance, which is analyzed by LMs to predict the spoken language</Paragraph>
    <Section position="1" start_page="515" end_page="515" type="sub_section">
      <SectionTitle>
2.1 Voice Tokenization
</SectionTitle>
      <Paragraph position="0"> A voice tokenizer is a speech recognizer that converts a spoken document into a sequence of tokens. As illustrated in Figure 2, a token can be of different sizes, ranging from a speech feature frame, to a phoneme, to a lexical word. A token is defined to describe a distinct acoustic/phonetic activity. In early research, low level spectral  http://www.nist.gov/speech/tests/ frames, which are assumed to be independent of each other, were used as a set of prototypical spectra for each language (Sugiyama, 1991). By adopting hidden Markov models, people moved beyond low-level spectral analysis towards modeling a frame sequence into a larger unit such as a phoneme and even a lexical word.</Paragraph>
      <Paragraph position="1"> Since the lexical word is language specific, the phoneme becomes the natural choice when building a language-independent voice tokenization front-end. Previous studies show that parallel language-dependent phoneme tokenizers effectively serve as the tokenization front-ends with P-PRLM being the typical example. However, a language-independent phoneme set has not been explored yet experimentally. In this paper, we would like to explore the potential of voice tokenization using a unified phoneme set.</Paragraph>
      <Paragraph position="2"> Figure 2 Tokenization at different resolutions</Paragraph>
    </Section>
    <Section position="2" start_page="515" end_page="516" type="sub_section">
      <SectionTitle>
2.2 n-gram Language Model
</SectionTitle>
      <Paragraph position="0"> With the sequence of tokens, we are able to estimate an n-gram language model (LM) from the statistics. It is generally agreed that phonotactics, i.e. the rules governing the phone/phonemes sequences admissible in a language, carry more language discriminative information than the phonemes themselves. An n-gram LM over the tokens describes well n-local phonotactics among neighboring tokens. While some systems model the phonotactics at the frame level (Torres-Carrasquillo et al., 2002), others have proposed P-PRLM. The latter has become one of the most promising solutions so far (Zissman, 1996).</Paragraph>
      <Paragraph position="1"> A variety of cues can be used by humans and machines to distinguish one language from another.</Paragraph>
      <Paragraph position="2"> These cues include phonology, prosody, morphology, and syntax in the context of an utterance.</Paragraph>
      <Paragraph position="4"> However, global phonotactic cues at the level of utterance or spoken document remains unexplored in previous work. In this paper, we pay special attention to it. A spoken language always contains a set of high frequency function words, prefixes, and suffixes, which are realized as phonetic token sub-strings in the spoken document. Individually, those substrings may be shared across languages. However, the pattern of their co-occurrences discriminates one language from another.</Paragraph>
      <Paragraph position="5"> Perceptual experiments have shown (Muthusamy, 1994) that with adequate training, human listeners' language identification ability increases when given longer excerpts of speech. Experiments have also shown that increased exposure to each language and longer training sessions improve listeners' language identification performance. Although it is not entirely clear how human listeners make use of the high-order phonotactic/prosodic cues present in longer spans of a spoken document, strong evidence shows that phonotactics over larger context provides valuable LID cues beyond n-gram, which will be further attested by our experiments in Section 4.</Paragraph>
    </Section>
    <Section position="3" start_page="516" end_page="516" type="sub_section">
      <SectionTitle>
2.3 Language Classifier
</SectionTitle>
      <Paragraph position="0"> The task of a language classifier is to make good use of the LID cues that are encoded in the model l l to hypothesize from among L languages, L , as the one that is actually spoken in a spoken document O. The LID model</Paragraph>
      <Paragraph position="2"> PRLM refers to extracted information from acoustic model and n-gram LM for language l. We have</Paragraph>
      <Paragraph position="4"> mum-likelihood classifier can be formulated as follows:</Paragraph>
      <Paragraph position="6"> The exact computation in Eq.(1) involves summing over all possible decoding of token sequences T given O. In many implementations, it is approximated by the maximum over all sequences in the sum by finding the most likely token sequence, , for each language l, using the</Paragraph>
      <Paragraph position="8"> Intuitively, individual sounds are heavily shared among different spoken languages due to the common speech production mechanism of humans.</Paragraph>
      <Paragraph position="9"> Thus, the acoustic score has little language discriminative ability. Many experiments (Yan and Barnard, 1995; Zissman, 1996) have further attested that the n-gram LM score provides more language discriminative information than their acoustic counterparts. In Figure 1, the decoding of voice tokenization is governed by the acoustic model</Paragraph>
      <Paragraph position="11"> POTl and a token sequence . The n-gram LM derives the n-local phonotactic score</Paragraph>
      <Paragraph position="13"> Clearly, the n-gram LM suffers the major shortcoming of having not exploited the global phonotactics in the larger context of a spoken utterance.</Paragraph>
      <Paragraph position="14"> Speech recognition researchers have so far chosen to only use n-gram local statistics for primarily pragmatic reasons, as this n-gram is easier to attain.</Paragraph>
      <Paragraph position="15"> In this work, a language independent voice tokenization front-end is proposed, that uses a unified acoustic model</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML