File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/06/n06-2037_abstr.xml

Size: 1,192 bytes

Last Modified: 2025-10-06 13:44:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2037">
  <Title>Selecting relevant text subsets from web-data for building topic speci c language models</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In this paper we present a scheme to select relevant subsets of sentences from a large generic corpus such as text acquired from the web. A relative entropy (R.E) based criterion is used to incrementally select sentences whose distribution matches the domain of interest. Experimental results show that by using the proposed sub-set selection scheme we can get significant performance improvement in both Word Error Rate (WER) and Perplexity (PPL) over the models built from the entire web-corpus by using just 10% of the data. In addition incremental data selection enables us to achieve signi cant reduction in the vocabulary size as well as number of n-grams in the adapted language model. To demonstrate the gains from our method we provide a comparative analysis with a number of methods proposed in recent language modeling literature for cleaning up text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML