File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-1008_evalu.xml
Size: 10,174 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1008"> <Title>Bootstrapping Deep Lexical Resources: Resources for Courses</Title> <Section position="7" start_page="72" end_page="75" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> We evaluate the component methods over the 5,675 open-class lexical items of the ERG described in Section 2.1 using 10-fold stratified cross-validation.</Paragraph> <Paragraph position="1"> In each case, we calculate the type precision (the proportion of correct hypothesised lexical entries) and type recall (the proportion of gold-standard lexical entries for which we get a correct hit), which we roll together into the type F-score (the harmonic mean of the two) relative to the gold-standard ERG lexicon. We also measure the token accuracy for the lexicon derived from each method, relative to the Redwoods treebank of Verbmobil data associated with the ERG (see Section 2.1).10 The token accuracy represents a weighted version of type precision, relative to the distribution of each lexical item in a representative text sample, and provides a crude approximation of the impact of each DLA method on parser coverage. That is, it gives more credit for a method having correctly hypothesised a commonly-occurring lexical item than a low-frequency lexical item, and no credit for having correctly identified a lexical item not occurring in the corpus.</Paragraph> <Paragraph position="2"> The overall results are presented in Figure 1, which are then broken down into the four open word classes in Figures 2-5. The baseline method (Base) in each case is a simple majority-class classifier, which generates a unique lexical item for each lexeme pre-identified as belonging to a given word class of the following type: Word class Majority-class lexical type Noun n intr le Verb v np trans le Adjective adj intrans le Adverb adv int vp le 10Note that the token accuracy is calculated only over the open-class lexical items, not the full ERG lexicon.</Paragraph> <Paragraph position="3"> In each graph, we present the type F-score and token accuracy for each method, and mark the best-performing method in terms of each of these evaluation measures with a star (star). The results for syntax-based DLA (SPOS, SCHUNK and SPARSE) are based on the BNC in each case. We return to investigate the impact of corpus size on the performance of the syntax-based methods below.</Paragraph> <Paragraph position="4"> Looking first at the combined results over all lexical types (Figure 1), the most successful method in terms of type F-score is syntax-based DLA, with chunker-based preprocessing marginally outperforming tagger- and parser-based preprocessing (type F-score = 0.641). The most successful method in terms of token accuracy is ontology-based DLA (token accuracy = 0.544).</Paragraph> <Paragraph position="5"> The figures for token accuracy require some qualification: ontology-based DLA tends to be liberal in its generation of lexical items, giving rise to over 20% more lexical items than the other methods (7,307 vs. 5-6000 for the other methods) and proportionately low type precision. This correlates with an inherent advantage in terms of token accuracy, which we have no way of balancing up in our token-based evaluation, as the treebank data offers no insight into the true worth of false negative lexical items (i.e. have no way of distinguishing between unobserved lexical items which are plain wrong from those which are intuitively correct and could be expected to occur in alternate sets of tree-bank data). We leave investigation of the impact of these extra lexical items on the overall parser performance (in terms of chart complexity and parse selection) as an item for future research.</Paragraph> <Paragraph position="6"> The morphology-based DLA methods were around baseline performance overall, with character n-grams marginally more successful than derivational morphology in terms of both type F-score and token accuracy.</Paragraph> <Paragraph position="7"> Turning next to the results for the proposed methods over nouns, verbs, adjectives and adverbs (Figures 2-5, respectively), we observe some interesting effects. First, morphology-based DLA hovers around baseline performance for all word classes except adjectives, where character n-grams produce the highest F-score of all methods, and nouns, where derivational morphology seems to aid DLA slightly (providing weak support for our original hypothesis in Section 3.2 relating to deverbal nouns and affixation). null acquisition methods over corpora of differing size Note: Base = baseline, MCHAR = morphology-based DLA with character n-grams, MDERIV = derivational morphology-based DLA, SPOS = syntax-based DLA with POS tagging, SCHUNK = syntax-based DLA with chunking, SPARSE = syntax-based DLA with dependency parsing, and Ont = ontology-based DLA Syntax-based DLA leads to the highest type F-score for nouns, verbs and adverbs, and the highest token accuracy for adjectives and adverbs. The differential in results between syntax-based DLA and the other methods is particularly striking for adverbs, with a maximum type F-score of 0.544 (for chunker-based preprocessing) and token accuracy of 0.340 (for tagger-based preprocessing), as compared to baseline figures of 0.471 and 0.017 respectively.</Paragraph> <Paragraph position="8"> There is relatively little separating the three styles of preprocessing in syntax-based DLA, although chunker-based preprocessing tends to have a slight edge in terms of type F-score, and tagger-based pre-processing generally produces the highest token accuracy.11 This suggests that access to a POS tagger for a given language is sufficient to make syntax-based DLA work, and that syntax-based DLA thus has moderately high applicability across languages of different densities.</Paragraph> <Paragraph position="9"> Ontology-based DLA is below baseline in terms of type F-score for all word classes, but results in the highest token accuracy of all methods for nouns and verbs (although this finding must be taken with a grain of salt, as noted above).</Paragraph> <Paragraph position="10"> Another noteworthy feature of Figures 2-5 is the huge variation in absolute performance across the word classes: adjectives are very predictable, with a majority class-based baseline type F-score of 0.832 and token accuracy of 0.847; adverbs, on the other hand, are similar to verbs and nouns in terms of their baseline type F-score (at 0.471), but the adverbs that occur commonly in corpus data appear to belong to less-populated lexical types (as seen in the baseline token accuracy of a miniscule 0.017). Nouns appear the hardest to learn in terms of the relative increment in token accuracy over the baseline. Verbs are extremely difficult to get right at the type level, but it appears that ontology-based DLA is highly adept at getting the commonly-occurring lexical items right.</Paragraph> <Paragraph position="11"> To summarise these findings, adverbs seem to benefit the most from syntax-based DLA. Adjectives, on the other hand, can be learned most effectively from simple character n-grams, i.e. similarlyspelled adjectives tend to have similar syntax, a somewhat surprising finding. Nouns are surprisingly hard to learn, but seem to benefit to some degree from corpus data and also ontological similarity. Lastly, verbs pose a challenge to all methods 11This trend was observed across all three corpora, although we do no present the full results here.</Paragraph> <Paragraph position="12"> at the type level, but ontology-based DLA seems to be able to correctly predict the commonly-occurring lexical entries.</Paragraph> <Paragraph position="13"> Finally, we examine the impact of corpus size on the performance of syntax-based DLA with tagger-based preprocessing.12 In Figure 6, we examine the relative change in type F-score and token accuracy across the four word classes as we increase the corpus size (from 0.5m words to 1m and finally 100m words, in the form of the Brown corpus, WSJ corpus and BNC, respectively). For verbs and adjectives, there is almost no change in either type F-score or token accuracy when we increase the corpus size, whereas for nouns, the token accuracy actually drops slightly. For adverbs, on the other hand, the token accuracy jumps up from 0.020 to 0.381 when we increase the corpus size from 1m words to 100m words, while the type F-score rises only slightly. It thus seems to be the case that large corpora have a considerable impact on DLA for commonly-occurring adverbs, but that for the remaining word classes, it makes little difference whether we have 0.5m or 100m words. This can be interpreted either as evidence that modestly-sized corpora are good enough to perform syntax-based DLA over (which would be excellent news for low-density languages!), or alternatively that for the simplistic syntax-based DLA methods proposed here, more corpus data is not the solution to achieving higher performance.</Paragraph> <Paragraph position="14"> Returning to our original question of the &quot;bang for the buck&quot; associated with individual LRs, there seems to be no simple answer: simple word lists are useful in learning the syntax of adjectives in particular, but offer little in terms of learning the other three word classes. Morphological lexicons with derivational information are moderately advantageous in learning the syntax of nouns but little else. A POS tagger seems sufficient to carry out syntax-based DLA, and the word class which benefits the most from larger amounts of corpus data is adverbs, otherwise the proposed syntax-based DLA methods don't seem to benefit from larger-sized corpora. Ontologies have the greatest impact on verbs and, to a lesser degree, nouns. Ultimately, this seems to lend weight to a &quot;horses for courses&quot;, or perhaps &quot;resources for courses&quot; approach to DLA.</Paragraph> <Paragraph position="15"> 12The results for chunker- and parser-based preprocessing are almost identical, and this omitted from the paper.</Paragraph> </Section> class="xml-element"></Paper>