File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1022_intro.xml

Size: 2,685 bytes

Last Modified: 2025-10-06 14:02:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1022">
  <Title>Automatic Learning of Language Model Structure</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In spite of novel algorithmic developments and the increased availability of large text corpora, statistical language modeling remains a di cult problem, particularly for languages with rich morphology. Such languages typically exhibit a large number of word types in relation to word tokens in a given text, which leads to high perplexity and a large number of unseen word contexts. As a result, probability estimates are often unreliable, even when using standard smoothing and parameter reduction techniques. Recently, a new language modeling approach, called factored language models (FLMs), has been developed. FLMs are a generalization of standard language models in that they allow a larger set of conditioning variables for predicting the current word. In addition to the preceding words, any number of additional variables can be included (e.g. morphological, syntactic, or semantic word features). Since such features are typically shared across multiple words, they can be used to obtained better smoothed probability estimates when training data is sparse. However, the space of possible models is extremely large, due to many di erent ways of choosing subsets of conditioning word features, backo procedures, and discounting methods. Usually, this space cannot be searched exhaustively, and optimizing models by a knowledge-inspired manual search procedure often leads to suboptimal results since only a small portion of the search space can be explored. In this paper we investigate the possibility of determining the structure of factored language models (i.e. the set of conditioning variables, the backo procedure and the discounting parameters) by a data-driven search procedure, viz. Genetic Algorithms (GAs). We apply this technique to two di erent tasks (language modeling for Arabic and Turkish) and show that GAs lead to better models than either knowledge-inspired manual search or random search. The remainder of this paper is structured as follows: Section 2 describes the details of the factored language modeling approach. The application of GAs to the problem of determining language model structure is explained in Section 3. The corpora used in the present study are described in Section 4 and experiments and results are presented in Section 5.</Paragraph>
    <Paragraph position="1"> Section 6 compares the present study to related work and Section 7 concludes.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML