File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1064_metho.xml

Size: 17,819 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1064">
  <Title>A Phonotactic Language Model for Spoken Language Identification</Title>
  <Section position="4" start_page="516" end_page="516" type="metho">
    <SectionTitle>
AM
</SectionTitle>
    <Paragraph position="0"> l instead of multiple language dependent acoustic models</Paragraph>
    <Paragraph position="2"> l is generalized to model both local and global phonotactics.</Paragraph>
  </Section>
  <Section position="5" start_page="516" end_page="519" type="metho">
    <SectionTitle>
3 Bag-of-Sounds Paradigm
</SectionTitle>
    <Paragraph position="0"> The bag-of-sounds concept is analogous to the bag-of-words paradigm originally formulated in the context of information retrieval (IR) and text categorization (TC) (Salton 1971; Berry et al., 1995; Chu-Caroll and Carpenter, 1999). One focus of IR is to extract informative features for document representation. The bag-of-words paradigm represents a document as a vector of counts. It is believed that it is not just the words, but also the co-occurrence of words that distinguish semantic domains of text documents.</Paragraph>
    <Paragraph position="1"> Similarly, it is generally believed in LID that, although the sounds of different spoken languages overlap considerably, the phonotactics differentiates one language from another. Therefore, one can easily draw the analogy between an acoustic token in bag-of-sounds and a word in bag-of-words.</Paragraph>
    <Paragraph position="2"> Unlike words in a text document, the phonotactic information that distinguishes spoken languages is  concealed in the sound waves of spoken languages.</Paragraph>
    <Paragraph position="3"> After transcribing a spoken document into a text like document of tokens, many IR or TC techniques can then be readily applied.</Paragraph>
    <Paragraph position="4"> It is beyond the scope of this paper to discuss what would be a good voice tokenizer. We adopt phoneme size language-independent acoustic tokens to form a unified acoustic vocabulary in our voice tokenizer. Readers are referred to (Ma et al., 2005) for details of acoustic modeling.</Paragraph>
    <Section position="1" start_page="517" end_page="517" type="sub_section">
      <SectionTitle>
3.1 Vector Space Modeling
</SectionTitle>
      <Paragraph position="0"> In human languages, some words invariably occur more frequently than others. One of the most common ways of expressing this idea is known as Zipf's Law (Zipf, 1949). This law states that there is always a set of words which dominates most of the other words of the language in terms of their frequency of use. This is true both of written words and of spoken words. The short-term, or local phonotactics, is devised to describe Zipf's Law.</Paragraph>
      <Paragraph position="1"> The local phonotactic constraints can be typically described by the token n-grams, or phoneme n-grams as in (Ng et al., 2000), which represents short-term statistics such as lexical constraints.</Paragraph>
      <Paragraph position="2"> Suppose that we have a token sequence, t1 t2 t3 t4.</Paragraph>
      <Paragraph position="3"> We derive the unigram statistics from the token sequence itself. We derive the bigram statistics from t1(t2) t2(t3) t3(t4) t4(#) where the token vocabulary is expanded over the token's right context.</Paragraph>
      <Paragraph position="4"> Similarly, we derive the trigram statistics from the</Paragraph>
      <Paragraph position="6"> and right contexts. The # sign is a place holder for free context. In the interest of manageability, we propose to use up to token trigram. In this way, for an acoustic system of Y tokens, we have potentially bigram and Y trigram in the vocabulary.</Paragraph>
      <Paragraph position="8"> Meanwhile, motivated by the ideas of having both short-term and long-term phonotactic statistics, we propose to derive global phonotactics information to account for long-term phonotactics: The global phonotactic constraint is the high-order statistics of n-grams. It represents document level long-term phonotactics such as co-occurrences of n-grams. By representing a spoken document as a count vector of n-grams, also called bag-of-sounds vector, it is possible to explore the relations and higher-order statistics among the diverse n-grams through latent semantic analysis (LSA).</Paragraph>
      <Paragraph position="9"> It is often advantageous to weight the raw counts to refine the contribution of each n-gram to LID. We begin by normalizing the vectors representing the spoken document by making each vector of unit length. Our second weighting is based on the notion that an n-gram that only occurs in a few languages is more discriminative than an n-gram that occurs in nearly every document. We use the inverse-document frequency (idf) weighting scheme (Spark Jones, 1972), in which a word is weighted inversely to the number of documents in which it occurs, by means of () log /()idf w D d w= , where w is a word in the vocabulary of W token n-grams. D is the total number of documents in the training corpus from L languages. Since each language has at least one document in the training corpus, we have D L[?] .</Paragraph>
      <Paragraph position="10"> is the number of documents containing the word w. Letting be the count of word w in document d, we have the weighted count as</Paragraph>
    </Section>
    <Section position="2" start_page="517" end_page="517" type="sub_section">
      <SectionTitle>
3.2 Latent Semantic Analysis
</SectionTitle>
      <Paragraph position="0"> The fundamental idea in LSA is to reduce the dimension of a document vector, W to Q, where QW&lt;&lt; and QD&lt;&lt; , by projecting the problem into the space spanned by the rows of the closest rank-Q matrix to H in the Frobenius norm (Deerwester et al, 1990). Through singular value decomposition (SVD) of H, we construct a modified  With the SVD, we project the D document vectors in H into a reduced space , referred to as Q-space in the rest of this paper. A test document of unknown language ID is mapped to a pseudo-document in the Q-space by matrix</Paragraph>
      <Paragraph position="2"> After SVD, it is straightforward to arrive at a natural metric for the closeness between two spoken documents and in Q-space instead of their original W-dimensional space and .</Paragraph>
      <Paragraph position="4"> g cc indicates the similarity between two vectors, which can be transformed to a distance meas-</Paragraph>
      <Paragraph position="6"> In the forced-choice classification, a test document, supposedly monolingual, is classified into one of the L languages. Note that the test document is unknown to the H matrix. We assume consistency between the test document's intrinsic phonotactic pattern and one of the D patterns, that is extracted from the training data and is presented in the H matrix, so that the SVD matrices still apply to the test document, and Eq.(5) still holds for dimension reduction.</Paragraph>
    </Section>
    <Section position="3" start_page="517" end_page="519" type="sub_section">
      <SectionTitle>
3.3 Bag-of-Sounds Language Classifier
</SectionTitle>
      <Paragraph position="0"> The bag-of-sounds phonotactic LM benefits from several properties of vector space modeling and LSA.</Paragraph>
      <Paragraph position="1">  1) It allows for representing a spoken document as a vector of n-gram features, such as unigram, bigram, trigram, and the mixture of them; 2) It provides a well-defined distance metric for measurement of phonotactic distance between spoken documents; 3) It processes spoken documents in a lower dimensional Q-space, that makes the bag-of- null sounds phonotactic language modeling,</Paragraph>
      <Paragraph position="3"> and classification computationally manageable.</Paragraph>
      <Paragraph position="4"> Suppose we have only one prototypical vector and its projection in the Q-space to represent language l. Applying LSA to the term-document</Paragraph>
      <Paragraph position="6"> Apparently, it is very restrictive for each language to have just one prototypical vector, also referred to as a centroid. The pattern of language distribution is inherently multi-modal, so it is unlikely well fitted by a single vector. One solution to this problem is to span the language space with multiple vectors. Applying LSA to a term-document matrix :HW L'x , where LL assuming each language l is represented by a set of</Paragraph>
      <Paragraph position="8"> Ph , a new classifier, using k-nearest neighboring rule (Duda and Hart, 1973) , is formulated, named k-nearest classifier (KNC):</Paragraph>
      <Paragraph position="10"> for language l , as subset of corpus Ohm , and . To derive the M vectors, we choose to carry out vector quantization (VQ) to partition D  D can then be merged to form a super-document, which is further projected into a Q-space vector . This results in M prototypical centroids . Using KNC, a test vector is compared with M vectors to arrive at the k-nearest neighbors for each language, which can be computationally expensive when M is large.</Paragraph>
      <Paragraph position="12"> Alternatively, one can account for multi-modal distribution through finite mixture model. A mixture model is to represent the M discrete components with soft combination. To extend the KNC into a statistical framework, it is necessary to map our distance metric Eq.(6) into a probability measure. One way is for the distance measure to induce a family of exponential distributions with pertinent marginality constraints. In practice, what we need is a reasonable probability distribution, which sums to one, to act as a lookup table for the distance measure. We here choose to use the empirical multivariate distribution constructed by allocating the total probability mass in proportion to the distances observed with the training data. In short, this reduces the task to a histogram normalization. In this way, we map the distance to a conditional probability distribution</Paragraph>
      <Paragraph position="14"> subject to . Now that we are in the probability domain, techniques such as mixture smoothing can be readily applied to model a language class with finer fitting.</Paragraph>
      <Paragraph position="16"> Let's re-visit the task of L language forced-choice classification. Similar to KNC, suppose we have M centroids in the Q-space for each language l. Each centroid represents a class. The class conditional probability can be described as a linear combination of</Paragraph>
      <Paragraph position="18"> To establish fair comparison with P-PRLM, as shown in Figure 3, we devise our bag-of-sounds classifier to solely use the LM score</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="519" end_page="521" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> This section will experimentally analyze the performance of the proposed bag-of-sounds framework using the 1996 NIST Language Recognition Evaluation (LRE) data. The database was intended to establish a baseline of performance capability for language recognition of conversational telephone speech. The database contains recorded speech of 12 languages: Arabic, English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil and Vietnamese. We use the training set and development set from LDC CallFriend corpus  as the training data. Each conversation is segmented into overlapping sessions of about 30 seconds each, resulting in about 12,000 sessions for each language. The evaluation set consists of 1,492 30-sec sessions, each distributed among the various languages of interest. We treat a 30-sec session as a spoken document in both training and testing. We report error rates (ER) of the 1,492 test trials.</Paragraph>
    <Section position="1" start_page="519" end_page="520" type="sub_section">
      <SectionTitle>
4.1 Effect of Acoustic Vocabulary
</SectionTitle>
      <Paragraph position="0"> The choice of n-gram affects the performance of LID systems. Here we would like to see how a better choice of acoustic vocabulary can help convert a spoken document into a phonotactically discriminative space. There are two parameters that determine the acoustic vocabulary: the choice of acoustic token, and the choice of n-grams. In this paper, the former concerns the size of an acoustic system Y in the unified front-end. It is studied in more details in (Ma et al., 2005). We set Y to 32 in  See http://www.ldc.upenn.edu/. The overlap between 1996 NIST evaluation data and CallFriend database has been removed from training data as suggested in the 2003 NIST LRE  this experiment; the latter decides what features to be included in the vector space. The vector space modeling allows for multiple heterogeneous features in one vector. We introduce three types of acoustic vocabulary (AV) with mixture of token unigram, bigram, and trigram: a) AV1: 32 broad class phonemes as unigram, selected from 12 languages, also referred to as P-ASM as detailed in (Ma et al., 2005) b) AV2: AV1 augmented by 32 bigrams of AV1, amounting to 1,056 tokens  We carry out experiments with KNC classifier of 4,800 centroids. Applying k-nearest-neighboring rule, k is empirically set to 3. The error rates are reported in Table 1 for the experiments over the three AV types. It is found that high-order token n-grams improve LID performance. This reaffirms many previous findings that n-gram phonotactics serves as a valuable cue in LID.</Paragraph>
    </Section>
    <Section position="2" start_page="520" end_page="520" type="sub_section">
      <SectionTitle>
4.2 Effect of Model Size
</SectionTitle>
      <Paragraph position="0"> As discussed in KNC, one would expect to improve the phonotactic model by using more centroids. Let's examine how the number of centroid vectors M affects the performance of KNC. We set the acoustic system size Y to 128, k-nearest to 3, and only use token bigrams in the bag-of-sounds vector. In Table 2, it is not surprising to find that the performance improves as M increases. However, it is not practical to have large M because comparisons need to take place in each test trial.</Paragraph>
      <Paragraph position="1">  To reduce computation, MMC attempts to use less number of mixtures M to represent the phonotactic space. With the smoothing effect of the mixture model, we expect to use less computation to achieve similar performance as KNC. In the experiment reported in Table 3, we find that MMC (M=1,024) achieves 14.9% error rate, which almost equalizes the best result in the KNC experiment (M=12,000) with much less computation.</Paragraph>
    </Section>
    <Section position="3" start_page="520" end_page="521" type="sub_section">
      <SectionTitle>
4.3 Discussion
</SectionTitle>
      <Paragraph position="0"> The bag-of-sounds approach has achieved equal success in both 1996 and 2003 NIST LRE databases. As more results are published on the 1996 NIST LRE database, we choose it as the platform of comparison. In Table 4, we report the performance across different approaches in terms of error rate for a quick comparison. MMC presents a 12.4% ER reduction over the best reported result  (Torres-Carrasquillo et al., 2002).</Paragraph>
      <Paragraph position="1"> It is interesting to note that the bag-of-sounds classifier outperforms its P-PRLM counterpart by a wide margin (14.9% vs 22.0%). This is attributed to the global phonotactic features in</Paragraph>
      <Paragraph position="3"> performance gain in (Torres-Carrasquillo et al., 2002; Singer et al., 2003) was obtained mainly by fusing scores from several classifiers, namely GMM, P-PRLM and SVM, to benefit from both acoustic and language model scores. Noting that the bag-of-sounds classifier in this work solely relies on the LM score, it is believed that fusing with scores from other classifiers will further boost the  Besides the error rate reduction, the bag-of-sounds approach also simplifies the on-line computing procedure over its P-PRLM counterpart. It would be interesting to estimate the on-line computational need of MMC. The cost incurred has two main components: 1) the construction of the  Previous results are also reported in DCF, DET, and equal error rate (EER). Comprehensive benchmarking for bag-of-sounds phonotactic LM will be reported soon.  Results extracted from (Torres-Carrasquillo et al., 2002)  pseudo document vector, as done via Eq.(5); 2) vector comparisons. The computing cost is estimated to be per test trial (Bellegarda, 2000). For typical values of Q, this amounts to less than 0.05 Mflops. While this is more expensive than the usual table look-up in conventional n-gram LM, the performance improvement is able to justify the relatively modest computing overhead.</Paragraph>
      <Paragraph position="4"> LLM'=x</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="521" end_page="521" type="metho">
    <SectionTitle>
()QO
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have proposed a phonotactic LM approach to LID problem. The concept of bag-of-sounds is introduced, for the first time, to model phonotactics present in a spoken language over a larger context.</Paragraph>
    <Paragraph position="1"> With bag-of-sounds phonotactic LM, a spoken document can be treated as a text-like document of acoustic tokens. This way, the well-established LSA technique can be readily applied. This novel approach not only suggests a paradigm shift in LID, but also brings 12.4% error rate reduction over one of the best reported results on the 1996 NIST LRE data. It has proven to be very successful.</Paragraph>
    <Paragraph position="2"> We would like to extend this approach to other spoken document categorization tasks. In monolingual spoken document categorization, we suggest that the semantic domain can be characterized by latent phonotactic features. Thus it is straightforward to extend the proposed bag-of-sounds framework to spoken document categorization.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML