File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2037_metho.xml
Size: 6,133 bytes
Last Modified: 2025-10-06 14:10:13
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2037"> <Title>Selecting relevant text subsets from web-data for building topic speci c language models</Title> <Section position="4" start_page="145" end_page="146" type="metho"> <SectionTitle> 3 Incremental Selection </SectionTitle> <Paragraph position="0"> To address the issue of distributional similarity we developed an incremental greedy selection scheme based on relative entropy which selects a sentence if adding it to the already selected set of sentences reduces the relative entropy with respect to the in-domain data distribution.</Paragraph> <Paragraph position="1"> Let us denote the language model built from in-domain data as P and let Pinit be a language model for initialization purposes which we estimate by bagging samples from the same in-domain data. To describe our algorithm we will employ the paradigm of unigram probabilities though the method generalizes to higher n-grams also.</Paragraph> <Paragraph position="2"> Let W(i) be a initial set of counts for the words i in the vocabulary V initialized using Pinit. We denote the count of word i in the jth sentence sj of web-data with mij. Let nj = summationtexti mij be the number of words in the sentence and N = summationtexti W(i) be the total number of words already selected. The relative entropy of the maximum likelihood estimate of the language model of the selected sentences to the initial model P is given by</Paragraph> <Paragraph position="4"> Direct computation of R.E using the above expressions for every sentence in the web-data will have a very high computational cost since O(V ) computations per sentence in the web-data are required. However given the fact that mij is sparse, we can split the summation H(j) into</Paragraph> <Paragraph position="6"> Intuitively, the term T1 measures the decrease in probability mass because of adding nj words more to the corpus and the term T2 measures the in-domain distribution P weighted improvement in probability for words with non-zero mij.</Paragraph> <Paragraph position="7"> For the R.E to decrease with selection of sentence sj we require T1 < T2. To make the selection more re ned we can impose a condition T1 + thr(j) < T2 where thr(j) is a function of j. A good choice for thr(j) based on empirical study is a function that declines at the same rate as the ratio ln (N+nj)N [?] nj/N [?] 1/kj where k is the average number of words for every sentence.</Paragraph> <Paragraph position="8"> The proposed algorithm is sequential and greedy in nature and can bene t from randomization of the order in which it scans the corpus. We generate permutes of the corpus by scanning through the corpus and randomly swapping sentences. Next we do sequential selection on each permutation and merge the selected sets.</Paragraph> <Paragraph position="9"> The choice of using maximum likelihood estimation for estimating the intermediate language models for W(j) is motivated by the simpli cation in the entropy calculation which reduces the order from O(V ) to O(k). However, maximum likelihood estimation of language models is poor when compared to smoothing based estimation. To balance the computation cost and estimation accuracy, we modify the counts W(j) using Kneser-Ney smoothing periodically after xed number of sentences.</Paragraph> </Section> <Section position="5" start_page="146" end_page="147" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Our experiments were conducted on medical domain data collected for building the English ASR of our English-Persian Speech to Speech translation project (Georgiou et al., 2003). We have 50K in-domain sentences for this task available. We downloaded around 60GB data from the web using automatically generated queries which after ltering and normalization amount to 150M words. The test set for perplexity evaluations consists of 5000 sentences(35K words) and the heldout set had 2000 sentences (12K words). The test set for word error rate evaluation consisted of 520 utterances. A generic conversational speech language model was built from the WSJ, Fisher and SWB corpora interpolated with the CMU LM. All language models built from web-data and in-domain data were interpolated with this language model with the interpolation weight determined on the heldout set.</Paragraph> <Paragraph position="1"> We rst compare our proposed algorithm against baselines based on perplexity(PPL), BLEU and LPU classi cation in terms of test set perplexity. As the comparison shows the proposed algorithm outperforms the rank-and-select schemes with just 1/10th of data. Table 1 shows the test set perplexity with different amounts of initial in-domain data. Table 2 shows the number of sentences selected for the best perplexity on the heldout set by the above schemes.</Paragraph> <Paragraph position="2"> The average relative perplexity reduction is around 6%. In addition to the PPL and WER improvements we were able to acheive a factor of 5 reduction in the number of estimated language model parameters (bigram+trigram) and a 30% reduction in the vocab- null model for different number of initial sentences.</Paragraph> <Paragraph position="3"> ulary size. No Web refers to the language model built from just in-domain data with no web-data. All-Web refers to the case where the entire web-data was used.</Paragraph> <Paragraph position="4"> The WER results in Table 3 show that adding data from the web without proper ltering can actually harm the performance of the speech recognition system when the initial in-domain data size increases. This can be attributed to the large increase in vocabulary size which increases the acoustic decoder perplexity. The average reduction in WER using the proposed scheme is close to 3% relative. It is interesting to note that for our data selection scheme the perplexity improvments correlate surprisingly well with WER improvments. A plausible explanation is that the perplexity improvments are accompanied by a signi cant reduction in the number of language model parameters.</Paragraph> </Section> class="xml-element"></Paper>