File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1003_evalu.xml

Size: 6,978 bytes

Last Modified: 2025-10-06 13:58:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1003">
  <Title>Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Featuresa0</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experimental Work
</SectionTitle>
    <Paragraph position="0"> A part of the Wall Street Journal (WSJ) which had been processed in the Penn Treebanck Project (Marcus et al., 1993) was used in the experiments.</Paragraph>
    <Paragraph position="1"> This corpus was automatically labelled and manually checked. There were two kinds of labelling: POStag labelling and syntactic labelling. The POStag vocabulary was composed of 45 labels.</Paragraph>
    <Paragraph position="2"> The syntactic labels are 14. The corpus was divided into sentences according to the bracketing.</Paragraph>
    <Paragraph position="3"> We selected 12 sections of the corpus at random. Six were used as training corpus, three as test set and the other three sections were used as held-out for tuning the smoothing WSME model.</Paragraph>
    <Paragraph position="4"> The sets are described as follow: the training corpus has 11,201 sentences; the test set has 6,350 sentences and the held-out set has 5,796 sentences. null A base-line Katz back-off smoothed trigram model was trained using the CMU-Cambridge statistical Language Modeling Toolkit 4 and used as prior distribution in (3) i.e. a5a59a58 . The vocabulary generated by the trigram model was used as vocabulary of the WSME model. The size of the vocabulary was 19,997 words.</Paragraph>
    <Paragraph position="5">  The estimation of the word-category probability distribution was computed from the training corpus. In order to avoid null values, the unseen events were labeled with a special &amp;quot;unknown&amp;quot; symbol which did not appear in the vocabulary, so that the probabilitie of the unseen envent were positive for all the categories.</Paragraph>
    <Paragraph position="6"> The SCFG had the maximum number of rules which can be composed of 45 terminal symbols (the number of POStags) and 14 non-terminal symbols (the number of syntactic labels). The initial probabilities were randomly generated and three different seeds were tested. However, only one of them is here given that the results were very similar.</Paragraph>
    <Paragraph position="7"> The size of the sample used in the ISS was estimated by means of an experimental procedure and was set at 10,000 elements. The procedure used to generate the sample made use of the &amp;quot;diagnosis of convergence&amp;quot; (Neal, 1993), a method by means of which an inicial portion of each run of the Markov chain of sufficient length is discarded. Thus, the states in the remaining portion come from the desired equilibrium distribution.</Paragraph>
    <Paragraph position="8"> In this work, a discarded portion of 3,000 elements was establiched. Thus in practice, we have to generate 13,000 instances of the Markov chain.</Paragraph>
    <Paragraph position="9"> During the IIS, every sample was tagged using the grammar estimated above, and then the grammatical features were extracted, before combining them with other kinds of features. The adequate number of iterations of the IIS was established experimentally in 13.</Paragraph>
    <Paragraph position="10"> We trained several WSME models using the Perfect Sampling algorithm in the IIS and a different set of features (including the grammatical features) for each model. The different sets of features used in the models were: n-grams (1grams,2-grams,3-grams); triggers; n-grams and grammatical features; triggers and grammatical feautres; n-grams, triggers and grammatical features. null The a41 -gram features,(N), was selected by means of its frequency in the corpus. We select all the unigrams, the bigrams with frequency greater than 5 and the trigrams with frequency greater than 10, in order to mantain the proportion of each type of a41 -gram in the corpus.</Paragraph>
    <Paragraph position="11"> The triggers, (T), were generated using a trig- null models with grammatical features and models without grammatical features for WSME models over part of the WSJ corpus. N means features of n-grams, T means features of Triggers. The perplexity of the trained n-gram model was PP=162.049 ger toolkit developed by Adam Berger 5. The triggers were selected in acordance with de mutual information. The triggers selected were those with mutual information greater than 0.0001.</Paragraph>
    <Paragraph position="12"> The grammatical features, (G), were selected using the parser tree of all the sentences in the training corpus to obtain the sets a127 a6a13a12 a8 and their union a127 as defined in section 3.</Paragraph>
    <Paragraph position="13"> The size of the initial set of features was: 12,023 a41 -grams, 39,428 triggers and 258 gramatical features, in total 51,709 features. At the end of the training procedure, the number of active features was significantly reduced to 4,000 features on average.</Paragraph>
    <Paragraph position="14"> During the training procedure, some of the</Paragraph>
    <Paragraph position="16"> a39 a120 and, so, we smooth the model. We smoothed it using a gaussian prior technique. In the gaussian technique, we assumed that the a102 a26 paramters had a gaussian (normal) prior probability distribution (Chen and Rosenfeld, 1999b) and found the maximum aposteriori parameter distribution. The prior distribution was a102  and we used the held-out data to find the a199</Paragraph>
    <Paragraph position="18"> rameters.</Paragraph>
    <Paragraph position="19"> Table 1 shows the experimental results: the first row represents the set of features used. The second row shows the perplexity of the models without using grammatical features. The third row shows the perplexity of the models using grammatical features and the fourth row shows the improvement in perplexity of each model using grammatical features over the corresponding model without grammatical features. As can be seen in Table 1, all the WSME models performed  better than the a41 -gram model, however that is natural because, in the worst case (if all a102 a26 a10a200a44 ), the WSME models perform like the a41 -gram model.</Paragraph>
    <Paragraph position="20"> In Table 1, we see that all the models using grammatical features perform better than the models that do not use it. Since the training procedure was the same for all the models described and since the only difference between the two kinds of models compared were the grammatical features, then we conclude that the improvement must be due to the inclusion of such features into the set of features. The average percentage of improvement was about 13%.</Paragraph>
    <Paragraph position="21"> Also, although the model N+T performs better than the other model without grammatical features (N,T), it behaves worse than all the models with grammatical features ( N+G improved 2.9% and T+G improvd 5.9% over N+T).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML