File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1032_evalu.xml
Size: 6,136 bytes
Last Modified: 2025-10-06 13:58:57
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1032"> <Title>Frequency Estimates for Statistical Word Similarity Measures</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Results and Discussion </SectionTitle> <Paragraph position="0"> The results for the TOEFL questions are presented in figure 3. The best performance found is 81.25% of the questions correctly answered. That result used DR-PMI with a window size of 16-32 words. This is an improvement over the results presented by Landauer and Dumais (1997) using Latent Semantic Analysis, where 64.5% of the questions were answered correctly, and Turney (2001), using pointwise mutual information and document retrieval, where the best result was 73.75%.</Paragraph> <Paragraph position="1"> Although we use a similar method (DR-PMI), the difference between the results presented here and Turney's results may be due to differences in the corpora and differences in the queries. Turney uses Altavista and we used our own crawl of web data. We can not compare the collections since we do not know how Altavista collection is created. As for the queries, we have more control over the queries since we can precisely specify the window size and we also do not know how queries are evaluated in Altavista.</Paragraph> <Paragraph position="2"> PMI performs best overall, regardless of estimates used (DR or W). W-CHI performs up to 80% when using window estimates, outperforming DR-CHI. MI and LL yield exactly the same results (and the same ranking of the alternatives), which suggests that the binomial distribution is a good approximation for word occurrence in text.</Paragraph> <Paragraph position="3"> The results for MI and PMI indicate that, for the two discrete random variables a2 a9 and a2 a14 (and range a34a198a7</Paragraph> <Paragraph position="5"> expectation in the divergence. Recall that the divergence formula has an embedded expectation to be calculated between the joint probability of these two random variables and their independence. The peak of information is exactly where both words co-occur, i.e. when a2 a9a199a7 a21 a9 and a2 a14a200a7 a21 a14 , and not any of the other three possible combinations.</Paragraph> <Paragraph position="6"> Similar trends are seen when using TS1 and no context, as depicted in figure 5. PMI is best overall, and DR-PMI and W-PMI outperform each other with different windows sizes. W-CHI has good performance in small windows sizes. MI and LL yield identical (poor) results, being worst than chance for some window sizes. Turney (2001) also uses this test set without context, achieving 66% peak performance compared with our best performance of 72% (DR-PMI).</Paragraph> <Paragraph position="7"> In the test set TS2 with no context, the trend seen between TOEFL and TS1 is repeated, as shown in figure 8.</Paragraph> <Paragraph position="8"> PMI is best overall but W-CHI performs better than PMI in three cases. DR-CHI performs poorly for small windows sizes. MI and LL also perform poorly in comparison with PMI. The peak performance is 75%, using DR-PMI with a window size of 64.</Paragraph> <Paragraph position="9"> The result are not what we expected when context is used in TS1 and TS2. In TS1, figure 6, only one of the measures, DR-PMIC-1, outperforms the results from non-context measures, having a peak of 80% correct answers. The condition for the best result (one word from context and a window size of 8) is similar to the one used for the best score reported by Turney. L1, AMIC and IRAD perform poorly, worst than chance for some window sizes. One difference in the results is that for DR-PMIC-1 only the best word from context was used, while the other methods used all words but stopwords.</Paragraph> <Paragraph position="10"> We examine the context and discovered that using more words degrades the performance of DR-PMIC in all different windows sizes but, even using all words except stopwords, the result from DR-PMIC is better than any other contextual measure - 76% correct answers in TS1 (with DR-PMIC and a window size of 8).</Paragraph> <Paragraph position="11"> For TS2, no measure using context was able to perform better than the non-contextual measures. DR-PMIC-1 performs better overall but has worse performance than DR-CP with a window size of 8. In this test set, the performance of DR-CP is better than W-CP. L1 performs better than AMIC but both have poor results, IRAD is never better than chance. The context in TS2 has more words than TS1 but the questions seem to be harder, as shown in figure 7. In some of the TS2 questions, the target word or one of the alternatives uses functional words. We also investigate the influence of more words from context in TS2, as depicted in figure 12, where the trends seen with TS1 are repeated.</Paragraph> <Paragraph position="12"> The results in TS1 and TS2 suggest that the available context is not very useful or that it is not being used properly. null Finally, we selected the method that yields the best performance for each test set to analyze the impact of the corpus size on performance, as shown in figures 4, 11 and 13. For TS1 we use W-PMI with a window size of 2 (W-PMI2) when no context is used and DR-PMIC-1 with a window size of 8 (DR-PMIC8-1) when context is used. For those measures, very little improvement is noticed after 500 GBytes for DR-PMIC8-1, roughly half of the collection size. No apparent improvement is achieved after 300-400 GBytes for W-PMI2. For TS2 we use DR-PMI with a window size of 64 (DR-PMI64) when no context is used, and DR-PMIC-1 with a windows size of 64 (DR-PMIC64-1) when context is used. It is clear that for TS2 no substantial improvement in DR-PMI64 and DR-PMIC64-1 is achieved by increasing the corpus size to values bigger than 300-400 GBytes. The most interesting impact of corpus size was on TOEFL test set using DR-PMI with a window size of 16 (DR-PMI16). Using the full corpus is no better than using 5% of the corpus, and the best result, 82.5% correct answers, is achieved when using 85-95% of corpus size.</Paragraph> </Section> class="xml-element"></Paper>