File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/p98-1035_evalu.xml
Size: 1,759 bytes
Last Modified: 2025-10-06 14:00:30
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1035"> <Title>Exploiting Syntactic Structure for Language Modeling</Title> <Section position="6" start_page="229" end_page="229" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> Due to the low speed of the parser -- 200 wds/min for stack depth 10 and log-probability threshold 6.91 nats (1/1000) -- we could carry out the re-estimation technique described in section 3.4 on only 1 Mwds of training data. For convenience we chose to work on the UPenn Treebank corpus. The vocabulary sizes were: * word vocabulary: 10k, open -- all words outside the vocabulary are mapped to the <unk> token; 82,430wds (sections 23-24). The &quot;check&quot; set has been used for estimating the interpolation weights and tuning the search parameters; the &quot;development&quot; set has been used for gathering/estimating counts; the test set has been used strictly for evaluating model performance.</Paragraph> <Paragraph position="1"> Table 1 shows the results of the re-estimation technique presented in section 3.4. We achieved a reduction in test-data perplexity bringing an improvement over a deleted interpolation trigram model whose perplexity was 167.14 on the same training-test data; the reduction is statistically significant according to a sign test.</Paragraph> <Paragraph position="2"> iteration DEV set TEST set Simple linear interpolation between our model and the trigram model:</Paragraph> <Paragraph position="4"> yielded a further improvement in PPL, as shown in check data to be )~ = 0.36.</Paragraph> <Paragraph position="5"> An overall relative reduction of 11% over the trigram model has been achieved.</Paragraph> </Section> class="xml-element"></Paper>