File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/n06-2037_concl.xml
Size: 1,321 bytes
Last Modified: 2025-10-06 13:55:14
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2037"> <Title>Selecting relevant text subsets from web-data for building topic speci c language models</Title> <Section position="6" start_page="147" end_page="147" type="concl"> <SectionTitle> 5 Conclusion and Future Work </SectionTitle> <Paragraph position="0"> In this paper we have presented a computationally ef cient scheme for selecting a subset of data from an unclean generic corpus such as data acquired from the web. Our results indicate that with this scheme, we can identify small subsets of sentences (about 1/10th of the original corpus), with which we can build language models which are substantially smaller in size and yet have better performance in models for different number of initial sentences. both perplexity and WER terms compared to models built using the entire corpus. Although our focus in the paper was on web-data, we believe the proposed method can be used for adaptation of topic speci c models from large generic corpora.</Paragraph> <Paragraph position="1"> We are currently exploring ways to use multiple bagged in-domain language models for the selection process. Instead of sequential scan of the corpus, we are exploring the use of rank-and-select methods to give a better search sequence.</Paragraph> </Section> class="xml-element"></Paper>