File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1080_intro.xml
Size: 2,925 bytes
Last Modified: 2025-10-06 14:06:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1080"> <Title>A Pylonic Decision-Tree Language Model with Optimal Question Selection</Title> <Section position="2" start_page="0" end_page="606" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In many applications such as automatic speech recognition, machine translation, spelling correction, etc., a statistical language model (LM) is needed to assign ~probabilities to sentences.</Paragraph> <Paragraph position="1"> This probability assignment may be used, e.g., to choose one of many transcriptions hypothesized by the recognizer or to make decisions about capitalization. Without any loss of generality, we consider models that operate left-to-right on the sentences, assigning a probability to the next word given its word history. Specifically, we consider statistical LM's which compute probabilities of the type P{wn \]Wl, W2,..-, Wn--1}, where wi denotes the i-th word in the text.</Paragraph> <Paragraph position="2"> Even for a small vocabulary, the space of word histories is so large that any attempt to estimate the conditional probabilities for each distinct history from raw frequencies is infeasible. To make the problem manageable, one partitions the word histories into some classes C(wl,w2,...,Wn-1), and identifies the word probabilities with P{wn \[ C(wl, w2,. . . , Wn-1)}. Such probabilities are easier to estimate as each class gets significantly more counts from a training corpus. With this setup, building a language model becomes a classification problem: group the word histories into a small number of classes while preserving their predictive power.</Paragraph> <Paragraph position="3"> Currently, popular N-gram models classify the word histories by their last N - 1 words.</Paragraph> <Paragraph position="4"> N varies from 2 to 4 and the trigram model P{wn \[Wn-2, wn-1} is commonly used. Although these simple models perform surprisingly well, there is much room for improvement. The approach used in this paper is to classify the histories by means of a decision tree: to cluster word histories Wl,W2,... ,wn-1 for which the distributions of the following word Wn in a training corpus are similar. The decision tree is pylonic in the sense that histories at different nodes in the tree may be recombined in a new node to increase the complexity of questions and avoid data fragmentation.</Paragraph> <Paragraph position="5"> The method has been tried before (Bahl et al., 1989) and had promising results. In the work presented here we made two major changes to the previous attempts: we have used an optimal tree growing algorithm (Chou, 1991) not known at the time of publication of (Bahl et al., 1989), and we have replaced the ad-hoc clustering of vocabulary items used by Bahl with a data-driven clustering scheme proposed in (Lucassen and Mercer, 1984).</Paragraph> </Section> class="xml-element"></Paper>