File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1075_evalu.xml

Size: 4,356 bytes

Last Modified: 2025-10-06 13:59:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1075">
  <Title>A Nonparametric Method for Extraction of Candidate Phrasal Terms</Title>
  <Section position="5" start_page="609" end_page="610" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Schone and Jurafsky's (2001) study examined the performance of various association metrics on a corpus of 6.7 million words with a cutoff of N=10. The resulting n-gram set had a maximum recall of 2,610 phrasal terms from the WordNet gold standard, and found the best figure of merit for any of the association metrics even with linguistic filterering to be 0.265. On the significantly larger Lexile corpus N must be set higher (around N=50) to make the results comparable. The statistics were also calculated for N=50, N=10 and N=5 in order to see what the effect of including more (relatively rare) n-grams would be on the overall performance for each statistic. Since many of the statistics are defined without interpolation only for bigrams, and the number of WordNet trigrams at N=50 is very small, the full set of scores were only calculated on the bigram data. For trigrams, in addition to rank ratio and frequency scores, extended pointwise mutual information and true mutual information scores were calculated using the formulas log (Pxyz/PxPy Pz)) and Pxyz log (Pxyz/PxPy Pz)). Also, since the standard lexical association metrics cannot be calculated across different n-gram types, results for bigrams and trigrams are presented separately for purposes of comparison.</Paragraph>
    <Paragraph position="1"> The results are are shown in Tables 2-5. Two points should should be noted in particular. First, the rank ratio statistic outperformed the other association measures tested across the board. Its best performance, a score of 0.323 in the part of speech filtered condition with N=50, outdistanced  the best score in Schone &amp; Jurafsky's study (0.265), and when large numbers of rare bigrams were included, at N=10 and N=5, it continued to outperform the other measures. Second, the results were generally consistent with those reported in the literature, and confirmed Schone &amp; Jurafsky's observation that the information-theoretic measures (such as mutual information and chisquared) outperform frequency-based measures (such as the T-score and raw frequency.)5</Paragraph>
    <Section position="1" start_page="610" end_page="610" type="sub_section">
      <SectionTitle>
4.1 Discussion
</SectionTitle>
      <Paragraph position="0"> One of the potential strengths of this method is that is allows for a comparison between n-grams of varying lengths. The distribution of scores for the gold standard bigrams and trigrams appears to bear out the hypothesis that the numbers are comparable across n-gram length. Trigrams constitute approximately four percent of the gold standard test set, and appear in roughly the same percentage across the rankings; for instance, they consistute 3.8% of the top 10,000 ngrams ranked by mutual rank ratio. Comparison of trigrams with their component bigrams also seems consistent with this hypothesis; e.g., the bigram Booker T. has a higher mutual rank ratio than the trigram Booker T.</Paragraph>
      <Paragraph position="1"> Washington, which has a higher rank that the bigram T. Washington. These results suggest that it would be worthwhile to examine how well the method succeeds at ranking n-grams of varying lengths, though the limitations of the current evaluation set to bigrams and trigrams prevented a full evaluation of its effectiveness across n-grams of varying length.</Paragraph>
      <Paragraph position="2"> The results of this study appear to support the conclusion that the Mutual Rank Ratio performs notably better than other association measures on this task. The performance is superior to the next-best measure when N is set as low as 5 (0.110 compared to 0.073 for Mutual Expectation and 0.063 for true mutual information and less than .05 for all other metrics). While this score is still fairly low, it indicates that the measure performs relatively well even when large numbers of low-probability n-grams are included. An examination of the n-best list for the Mutual Rank ratio at N=5 supports this contention.</Paragraph>
      <Paragraph position="3"> The top 10 bigrams are:</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML