File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/n06-2037_abstr.xml
Size: 1,192 bytes
Last Modified: 2025-10-06 13:44:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2037"> <Title>Selecting relevant text subsets from web-data for building topic speci c language models</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In this paper we present a scheme to select relevant subsets of sentences from a large generic corpus such as text acquired from the web. A relative entropy (R.E) based criterion is used to incrementally select sentences whose distribution matches the domain of interest. Experimental results show that by using the proposed sub-set selection scheme we can get significant performance improvement in both Word Error Rate (WER) and Perplexity (PPL) over the models built from the entire web-corpus by using just 10% of the data. In addition incremental data selection enables us to achieve signi cant reduction in the vocabulary size as well as number of n-grams in the adapted language model. To demonstrate the gains from our method we provide a comparative analysis with a number of methods proposed in recent language modeling literature for cleaning up text.</Paragraph> </Section> class="xml-element"></Paper>