File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0714_intro.xml
Size: 2,448 bytes
Last Modified: 2025-10-06 14:01:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0714"> <Title>Using Perfect Sampling in Parameter Estimation of a Whole Sentence Maximum Entropy Language Model*</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The language modeling problem may be defined as the problem of calculating the probability of a string, p(w) = p(wl,..., Wn). The probability p(w) is usually calculated via conditional probabilities. The n-gram model is one of the most widely used language models. The power of the n-gram model resides in its simple formulation and the ease of training. On the other hand, n-grams only take into account local information, and important long-distance information contained in the string wl ... wn cannot be modeled by it. In an attempt to supplement the local information with long-distance information, hybrid models have been proposed such us (Belle* This work has been partially supported by the Spanish CYCIT under contract (TIC98/0423-C06).</Paragraph> <Paragraph position="1"> t Granted by Universidad del Cauca, Popay~n (Colombia) null garda, 1998; Chelba, 1998; Benedl and Sanchez, 2000).</Paragraph> <Paragraph position="2"> The Maximum Entropy principle is an appropriate framework for combining information of a diverse nature from several sources into the same model: the Maximum Entropy model (ME) (Rosenfeld, 1996). The information is incorporated as features which are submitted to constraints. The conditional form of the ME model is:</Paragraph> <Paragraph position="4"> where Ai are the parameters to be learned (one for each feature), the fi are usually characteristic functions which are associated to the features and Z(x) = ~y exp{~i~l Aifi(x,y)} is the normalization constant. The main advantages of ME are its flexibility (local and global information can be included in the model) and its simplicity. The drawbacks are that the paramenter's estimation is computationally expensive, specially the evaluation of the normalization constant Z(x) andthat the grammatical information contained in the sentence is poorly encoded in the conditional framework. This is due to the assumption of independence in the conditional events: in the events in the state space, only a part of the information contained in the sentence influences de calculation of the probability (Ristad, 1998).</Paragraph> </Section> class="xml-element"></Paper>