File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1003_metho.xml

Size: 21,377 bytes

Last Modified: 2025-10-06 14:08:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1003">
  <Title>Cross[?]Language Information Retrieval Cross[?]Language Unigram Model Contemporaneous English Articles Baseline Chinese Acoustic Model Baseline Chinese Language Model Chinese DictionaryASR Automatic Transcription English Article Aligned with Mandarin Story Machine TranslationStatistical Translation lexicon</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Data Sparseness in Language Modeling
</SectionTitle>
    <Paragraph position="0"> Statistical techniques have been remarkably successful in automatic speech recognition (ASR) and natural language processing (NLP) over the last two decades. This success, however, depends crucially a2This research was supported by the National Science Foundation (via Grant No- ITR-0225656 and IIS-9982329) and the Office of Naval Research (via Contract No- N00014-01-1-0685). on the availability of accurate and large amounts of suitably annotated training data and it is difficult to build a usable statistical model in their absence.</Paragraph>
    <Paragraph position="1"> Most of the success, therefore, has been witnessed in the so called resource-rich languages. More recently, there has been an increasing interest in languages such as Mandarin and Arabic for ASR and NLP, and data resources are being created for them at considerable cost. The data-resource bottleneck, however, is likely to remain for a majority of the world's languages in the foreseeable future.</Paragraph>
    <Paragraph position="2"> Methods have been proposed to bootstrap acoustic models for ASR in resource deficient languages by reusing acoustic models from resource-rich languages (Schultz and Waibel, 1998; Byrne et al., 2000). Morphological analyzers, noun-phrase chunkers, POS taggers, etc., have also been developed for resource deficient languages by exploiting translated or parallel text (Yarowsky et al., 2001). Khudanpur and Kim (2002) recently proposed using cross-lingual information retrieval (CLIR) and machine translation (MT) to improve a statistical language model (LM) in a resource-deficient language by exploiting copious amounts of text available in resource-rich languages. When transcribing a news story in a resource-deficient language, their core idea is to use the first pass output of a rudimentary ASR system as a query for CLIR, identify a contemporaneous English document on that news topic, followed by MT to provide a rough translation which, even if not fluent, is adequate to update estimates of word frequencies and the LM vocabulary. They report up to a 28% reduction in perplexity on Chinese text from the Hong Kong News corpus.</Paragraph>
    <Paragraph position="3"> In spite of their considerable success, some shortcomings remain in the method used by Khudanpur and Kim (2002). Specifically, stochastic translation lexicons estimated using the IBM method (Brown et al., 1993) from a fairly large sentence-aligned Chinese-English parallel corpus are used in their approach -- a considerable demand for a resource-deficient language. It is suggested that an easierto-obtain document-aligned comparable corpus may suffice, but no results are reported. Furthermore, for each Mandarin news story, the single best matching English article obtained via CLIR is translated and used for priming the Chinese LM, no matter how good the CLIR similarity, nor are other wellmatching English articles considered. This issue clearly deserves further attention. Finally, ASR results are not reported in their work, though their proposed solution is clearly motivated by an ASR task.</Paragraph>
    <Paragraph position="4"> We address these three issues in this paper.</Paragraph>
    <Paragraph position="5"> Section 2 begins, for the sake of completeness, with a review of the cross-lingual story-specific LM proposed by Khudanpur and Kim (2002). A notion of cross-lingual lexical triggers is proposed in Section 3, which overcomes the need for a sentence-aligned parallel corpus for obtaining translation lexicons. After a brief detour to describe topic-dependent LMs in Section 4, a description of the ASR task is provided in Section 5, and ASR results on Mandarin Broadcast News are presented in Section 6. The issue of how many English articles to retrieve and translate into Chinese is resolved by a likelihood-based scheme proposed in Section 6.1.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Cross-Lingual Story-Specific LMs
</SectionTitle>
    <Paragraph position="0"> For the sake of illustration, consider the task of sharpening a Chinese language model for transcribing Mandarin news stories by using a large corpus of contemporaneous English newswire text. Mandarin Chinese is, of course, not resource-deficient for language modeling -- 100s of millions of words are available on-line. However, we choose it for our experiments partly because it is sufficiently different from English to pose a real challenge, and because the availability of large text corpora in fact permits us to simulate controlled resource deficiency.</Paragraph>
    <Paragraph position="1"> Let a3a5a4a6a8a7a10a9a10a9a10a9a11a7a12a3a5a4a13 denote the text of a1 test stories to be transcribed by an ASR system, and let a3a15a14a6a8a7a10a9a10a9a10a9a16a7a12a3a15a14 a13 denote their corresponding or aligned English newswire articles. Correspondence here does not imply that the English document a3a5a14a17 needs to be an exact translation of the Mandarin story a3 a4 a17 .</Paragraph>
    <Paragraph position="2"> It is quite adequate, for instance, if the two stories report the same news event. This approach is expected to be helpful even when the English document is merely on the same general topic as the Mandarin story, although the closer the content of a pair of articles the better the proposed methods are likely to work. Assume for the time being that a sufficiently good Chinese-English story alignment is given.</Paragraph>
    <Paragraph position="3"> Assume further that we have at our disposal a stochastic translation dictionary -- a probabilistic model of the form a18a20a19a22a21a24a23a26a25a27a29a28 -- which provides the Chinese translation a23a31a30a33a32 of each English word a27a34a30a36a35 , where a32 and a35 respectively denote our Chinese and English vocabularies.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Computing a Cross-Lingual Unigram LM
</SectionTitle>
      <Paragraph position="0"> Let a37a18a38a21a39a27a5a25a3a15a14a17 a28 denote the relative frequency of a word a27 in the document a3 a14</Paragraph>
      <Paragraph position="2"> would be a good unigram model for the a44 -th Mandarin story a3 a4a17 . We use this cross-lingual unigram statistic to sharpen a statistical Chinese LM used for processing the test story a3 a4a17 . One way to do this is via linear interpolation</Paragraph>
      <Paragraph position="4"> of the cross-lingual unigram model (1) with a static trigram model for Chinese, where the interpolation weight a93 may be chosen off-line to maximize the likelihood of some held-out Mandarin stories. The improvement in (2) is expected from the fact that unlike the static text from which the Chinese trigram LM is estimated, a3 a14a17 is semantically close to a3 a4a17 and even the adjustment of unigram statistics, based on a stochastic translation model, may help.</Paragraph>
      <Paragraph position="5"> Figure 1 shows the data flow in this cross-lingual LM adaptation approach, where the output of the first pass of an ASR system is used by a CLIR system to find an English document a3 a14  of a Chinese Language Model using English Text.</Paragraph>
      <Paragraph position="6"> computes the statistic of (1), and the ASR system uses the LM of (2) in a second pass.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Obtaining Matching English Documents
</SectionTitle>
      <Paragraph position="0"> To illustrate how one may obtain the English document a3 a14a17 to match a Mandarin story a3 a4a17 , let us assume that we also have a stochastic reversetranslation lexicon a18a20a19a8a21a39a27a5a25a23a16a28 . One obtains from the first pass ASR output, cf. Figure 1, the relative frequency estimate a37a18a40a21a24a23a26a25a3a5a4a17 a28 of Chinese words a23 in a3a79a4a17 , a23a101a30a45a32 , and uses the translation lexicon a18a20a19a22a21a39a27a5a25a23a16a28 to compute, a47a102a27a66a30a103a35 ,</Paragraph>
      <Paragraph position="2"> an English bag-of-words representation of the Mandarin story a3 a4 a17 as used in standard vector-based information retrieval. The document with the highest TF-IDF weighted cosine-similarity to a3 a4a17 is selected:</Paragraph>
      <Paragraph position="4"> Readers familiar with information retrieval literature will recognize this to be the standard querytranslation approach to CLIR.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Obtaining Stochastic Translation Lexicons
</SectionTitle>
      <Paragraph position="0"> The translation lexicons a18a75a19a8a21a24a23a26a25a27a119a28 and a18a20a19a8a21a39a27a5a25a23a16a28 may be created out of an available electronic translation lexicon, with multiple translations of a word being treated as equally likely. Stemming and other morphological analyses may be applied to increase the vocabulary-coverage of the translation lexicons.</Paragraph>
      <Paragraph position="1"> Alternately, they may also be obtained automatically from a parallel corpus of translated and sentence-aligned Chinese-English text using statistical machine translation techniques, such as the publicly available GIZA++ tools (Och and Ney, 2000), as done by Khudanpur and Kim (2002). Unlike standard MT systems, however, we apply the translation models to entire articles, one word at a time, to get a bag of translated words -- cf. (1) and (3).</Paragraph>
      <Paragraph position="2"> Finally, for truly resource deficient languages, one may obtain a translation lexicon via optical character recognition from a printed bilingual dictionary (cf.</Paragraph>
      <Paragraph position="3"> Doerman et al (2002)). This task is arguably easier than obtaining a large LM training corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Cross-Lingual Lexical Triggers
</SectionTitle>
    <Paragraph position="0"> It seems plausible that most of the information one gets from the cross-lingual unigram LM of (1) is in the form of the altered statistics of topic-specific Chinese words conveyed by the statistics of content-bearing English words in the matching story. The translation lexicon used for obtaining the information, however, is an expensive resource. Yet, if one were only interested in the conditional distribution of Chinese words given some English words, there is no reason to require translation as an intermediate step. In a monolingual setting, the mutual information between lexical pairs co-occurring anywhere within a long &amp;quot;window&amp;quot; of each-other has been used to capture statistical dependencies not covered by a1 -gram LMs (Rosenfeld, 1996; Tillmann and Ney, 1997). We use this inspiration to propose the following notion of cross-lingual lexical triggers.</Paragraph>
    <Paragraph position="1"> In a monolingual setting, a pair of words a21a24a120a121a7a81a122a11a28 is considered a trigger-pair if, given a word-position in a sentence, the occurrence of a120 in any of the preceding word-positions significantly alters the (conditional) probability that the following word in the sentence is a122 : a120 is said to trigger a122 . E.g. the occurrence of either significantly increases the probability of or subsequently in the sentence. The set of preceding word-positions is variably defined to include all words from the beginning of the sentence, paragraph or document, or is limited to a fixed number of preceding words, limited of course by the beginning of the sentence, paragraph or document.</Paragraph>
    <Paragraph position="2"> In the cross-lingual setting, we consider a pair of words a21a39a27a119a7a12a23a16a28 , a27a103a30a123a35 and a23a49a30a123a32 , to be a trigger-pair if, given an English-Chinese pair of aligned documents, the occurrence of a27 in the English document significantly alters the (conditional) probability that the word a23 appears in the Chinese document: a27 is said to trigger a23 . It is plausible that translation-pairs will be natural candidates for trigger-pairs. It is, however, not necessary for a trigger-pair to also be a translation-pair. E.g., the occurrence of Belgrade in the English document may trigger the Chinese transliterations of Serbia and Kosovo, and possibly the translations of China, embassy and bomb! By infering trigger-pairs from a document-aligned corpus of Chinese-English articles, we expect to be able to discover semantically- or topicallyrelated pairs in addition to translation equivalences.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Identification of Cross-Lingual Triggers
</SectionTitle>
      <Paragraph position="0"> Average mutual information, which measures how much knowing the value of one random variable reduces the uncertainty of about another, has been used to identify trigger-pairs. We compute the average mutual information for every English-Chinese word pair a21a39a27a119a7a12a23a16a28 as follows.</Paragraph>
      <Paragraph position="1"> Let a124a11a3 a14a17 a7a12a3 a4a17a126a125 , a44a127a67a128a42a29a7a10a9a10a9a10a9a16a7 a1 , now be a document-aligned training corpus of English-Chinese article pairs. Let a129a130a3a121a21a39a27a119a7a12a23a16a28 denote the document frequency, i.e., the number of aligned article-pairs, in which a27 occurs in the English article and a23 in the Chinese.</Paragraph>
      <Paragraph position="2"> Let a129a130a3a92a21a39a27a26a7a11a131a23a113a28 denote the number of aligned article-pairs in which a27 occurs in the English articles but a23 does not occur in the Chinese article. Let</Paragraph>
      <Paragraph position="4"> The quantities a18a40a21a12a131a27a26a7a12a23a16a28 and a18a40a21a81a131a27a15a7a136a131a23a113a28 are similarly defined. Next let a129a130a3a121a21a39a27a29a28 denote the number of English articles in which a27 occurs, and define  We propose to select word pairs with high mutual information as cross-lingual lexical triggers.</Paragraph>
      <Paragraph position="5"> There are a25a35a152a25a83a153a154a25a32a22a25 possible English-Chinese word pairs which may be prohibitively large to search for the pairs with the highest mutual information.</Paragraph>
      <Paragraph position="6"> We filter out infrequent words in each language, say, words appearing less than 5 times, then measure a140 a21a39a27a119a141a12a23a16a28 for all possible pairs from the remaining words, sort them by a140 a21a39a27a119a141a12a23a16a28 , and select, say, the top 1 million pairs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Estimating Trigger LM Probabilities
</SectionTitle>
      <Paragraph position="0"> Once we have chosen a set of trigger-pairs, the next step is to estimate a probability a18a127a155a119a61a85a58a60a59a15a21a24a23a26a25a27a119a28 in lieu of the translation probability a18a75a19a22a21a24a23a26a25a27a29a28 in (1), and a probability a18 a155a119a61a64a58a59 a21a39a27a5a25a23a16a28 in (3).</Paragraph>
      <Paragraph position="1"> Following the maximum likelihood approach proposed by Tillman and Ney (1997), one could choose the trigger probability a18a20a155a119a61a64a58a60a59a15a21a24a23a26a25a27a29a28 to be based on the unigram frequency of a23 among Chinese word tokens in that subset of aligned documents a3 a4a17 which have</Paragraph>
      <Paragraph position="3"> As an ad hoc alternative to (4), we also use</Paragraph>
      <Paragraph position="5"> where we set a140 a21a39a27a119a141a12a23a16a28a76a67a166a165 whenever a21a39a27a119a7a12a23a16a28 is not a trigger-pair, and find it to be somewhat more effective (cf. Section 6.2). Thus (5) is used henceforth in this paper. Analogous to (1), we set</Paragraph>
      <Paragraph position="7"> and, again, we build the interpolated model</Paragraph>
      <Paragraph position="9"/>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Topic-Dependent Language Models
</SectionTitle>
    <Paragraph position="0"> The linear interpolation of the story-dependent unigram models (1) and (6) with a story-independent trigram model, as described above, is very reminiscent of monolingual topic-dependent language models (cf. e.g. (Iyer and Ostendorf, 1999)). This motivates us to construct topic-dependent LMs and contrast their performance with these models.</Paragraph>
    <Paragraph position="1"> To this end, we represent each Chinese article in the training corpus by a bag-of-words vector, and cluster the vectors using a standard K-means algorithm. We use random initialization to seed the algorithm, and a standard TF-IDF weighted cosine-similarity as the &amp;quot;metric&amp;quot; for clustering. We perform a few iterations of the K-means algorithm, and deem the resulting clusters as representing different topics. We then use a bag-of-words centroid created from all the articles in a cluster to represent each topic. Topic-dependent trigram LMs, denoted a18 a118 a21a24a23a57a90a53a25a23a57a90 a54 a6 a7a12a23a57a90 a54a92a91 a28 , are also computed for each topic exclusively from the articles in the a169 -th cluster, a42a66a43a170a169a38a43a123a171 .</Paragraph>
    <Paragraph position="2"> Each Mandarin test story is represented by a bag-of-words vector a37a18a137a21a24a23a26a25a3 a4a17 a28 generated from the first-pass ASR output, and the topic-centroid a172 a17 having the highest TF-IDF weighted cosine-similarity to it is chosen as the topic of a3 a4a17 . Topic-dependent LMs are then constructed for each story a3a79a4a17 as</Paragraph>
    <Paragraph position="4"> Alternatives to topic-dependent LMs for exploiting long-range dependencies include cache LMs and monolingual lexical triggers; both unlikely to be as effective in the presence of significant ASR errors.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 ASR Training and Test Corpora
</SectionTitle>
    <Paragraph position="0"> We investigate the use of the techniques described above for improving ASR performance on Mandarin news broadcasts using English newswire texts.</Paragraph>
    <Paragraph position="1"> We have chosen the experimental ASR setup created in the 2000 Johns Hopkins Summer Workshop to study Mandarin pronunciation modeling, extensive details about which are available in Fung et al (2000). The acoustic training data (a175 10 hours) for their ASR system was obtained from the 1997 Mandarin Broadcast News distribution, and context-dependent state-clustered models were estimated using initials and finals as subword units. Two Chinese text corpora and an English corpus are used to estimate LMs in our experiments. A vocabulary a32 of 51K Chinese words, used in the ASR system, is also used to segment the training text. This vocabulary gives an OOV rate of 5% on the test data.</Paragraph>
    <Paragraph position="2"> XINHUA: We use the Xinhua News corpus of about 13 million words to represent the scenario when the amount of available LM training text borders on adequate, and estimate a baseline trigram LM for one set of experiments.</Paragraph>
    <Paragraph position="3"> HUB-4NE: We also estimate a trigram model from only the 96K words in the transcriptions used for training acoustic models in our ASR system.</Paragraph>
    <Paragraph position="4"> This corpus represents the scenario when little or no additional text is available to train LMs.</Paragraph>
    <Paragraph position="5"> NAB-TDT: English text contemporaneous with the test data is often easily available. For our test set, described below, we select (from the North American News Text corpus) articles published in 1997 in The Los Angeles Times and The Washington Post, and articles from 1998 in the New York Times and the Associated Press news service (from TDT-2 corpus). This amounts to a collection of roughly 45,000 articles containing about 30-million words of English text; a modest collection by CLIR standards.</Paragraph>
    <Paragraph position="6"> Our ASR test set is a subset (Fung et al (2000)) of the NIST 1997 and 1998 HUB-4NE benchmark tests, containing Mandarin news broadcasts from three sources for a total of about 9800 words.</Paragraph>
    <Paragraph position="7"> We generate two sets of lattices using the baseline acoustic models and bigram LMs estimated from XINHUA and HUB-4NE. All our LMs are evaluated by rescoring a176a29a165a119a165 -best lists extracted from these two sets of lattices. The a176a29a165a119a165 -best lists from the XINHUA bigram LM are used in all XINHUA experiments, and those from the HUB-4NE bigram LM in all HUB-4NE experiments. We report both word error rates (WER) and character error rates (CER), the latter being independent of any difference in segmentation of the ASR output and reference transcriptions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML