File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-2037_intro.xml
Size: 6,205 bytes
Last Modified: 2025-10-06 14:03:28
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2037"> <Title>Selecting relevant text subsets from web-data for building topic speci c language models</Title> <Section position="3" start_page="0" end_page="145" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> One of the main challenges in the rapid deployment of NLP applications is the lack of in-domain data required for training statistical models. Language models, especially n-gram based, are key components of most NLP applications, such as speech recognition and machine translation, where they serve as priors in the decoding process. To estimate a n-gram language model we require examples of in-domain transcribed utterances, which in absence of readily available relevant corpora have to be collected manually. This poses severe constraints in terms of both system turnaround time and cost.</Paragraph> <Paragraph position="1"> This led to a growing interest in using the World Wide Web (WWW) as a corpus for NLP (Lapata, 2005; Resnik and Smith, 2003). The web can serve as a good resource for automatically gathering data for building task-speci c language models. Web-pages of interest can be identi ed by generating query terms either manually or automatically from an initial set of in-domain sentences by measures such as TFIDF or Relative Entropy (R.E). These webpages can then be converted to a text corpus (which we will refer to as web-data) by appropriate preprocessing. However text gathered from the web will rarely t the demands or the nature of the domain of interest completely. Even with the best queries and web crawling schemes, both the style and content of the web-data will usually differ signi cantly from the speci c needs. For example, a speech recognition system requires conversational style text whereas most of the data on the web is literary.</Paragraph> <Paragraph position="2"> The mismatch between in-domain data and web-data can be seen as a semi-supervised learning problem. We can model the web-data as a mix of sentences from two classes: in-domain (I) and noise (N) (or out-of-domain). The labels I and N are latent and unknown for the sentences in web-data but we usually have a small number of examples of in-domain examples I. Selecting the right labels for the unlabeled set is important for bene ting from it.</Paragraph> <Paragraph position="3"> Recent research on semi-supervised learning shows that in many cases (Nigam et al., 2000; Zhu, 2005) poor preprocessing of unlabeled data might actually lower the performance of classi ers. We found similar results in our language modeling experiments where the presence of a large set of noisy N examples in training actually lowers the performance slightly in both perplexity and WER terms. Recent literature on building language models from text acquired from the web addresses this issue partly by using various rank-and-select schemes for identifying the set I (Ostendorf et al., 2005; Sethy, 2005; Sarikaya, 2005). However we believe that similar to the question of balance (Zhu, 2005) in semi-supervised learning for classi cation, we need to address the question of distributional similarity while selecting the appropriate utterances for building a language model from noisy data. The subset of sentences from web-data which are selected to build the adaptation language should have a distribution similar to the in-domain data model.</Paragraph> <Paragraph position="4"> To address the issue of distributional similarity we present an incremental algorithm which compares the distribution of the selected set and the in-domain examples by using a relative entropy (R.E) criterion.</Paragraph> <Paragraph position="5"> We will review in section 2 some of the ranking schemes which provide baselines for performance comparison and in section 3 we describe the proposed algorithm. Experimental results are provided in section 4, before we conclude with a summary of this work and directions for the future.</Paragraph> <Paragraph position="6"> 2 Rank and select methods for text cleaning The central idea behind text cleanup schemes in recent literature, on using web-data for language modeling, is to use a scoring function that measures the similarity of each observed sentence in the web-data to the in-domain set and assigns an appropriate score. The subsequent step is to set a threshold in terms of either the minimum score or the number of top scoring sentences. The threshold can usually be xed using a heldout set. Ostendorf (2005) use perplexity from an in-domain n-gram language model as a scoring function. More recently, a modi ed version of the BLEU metric which measures sentence similarity in machine translation has been proposed by Sarikaya (2005) as a scoring function.</Paragraph> <Paragraph position="7"> Instead of explicit ranking and thresholding it is also possible to design a classi er in a learning from positive and unlabeled examples framework (LPU) (Liu et al., 2003). In this system, a subset of the unlabeled set is selected as the negative or the noise set N. A two class classi er is then trained using the in-domain set and the negative set. The classi er is then used to label the sentences in the web-data.</Paragraph> <Paragraph position="8"> The classi er can then be iteratively re ned by using a better and larger subset of the I/N sentences selected in each iteration.</Paragraph> <Paragraph position="9"> Rank ordering schemes do not address the issue of distributional similarity and select many sentences which already have a high probability in the in-domain text. Adapting models on such data has the tendency to skew the distribution even further towards the center. For example, in our doctor-patient interaction task short sentences containing the word 'okay' such as 'okay','yes okay', 'okay okay' were very frequent in the in-domain data. Perplexity or other similarity measures give a high score to all such examples in the web-data boosting the probability of these words even further while other pertinent sentences unseen in the in-domain data such as 'Can you stand up please?' are ranked low and get rejected.</Paragraph> </Section> class="xml-element"></Paper>