File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1047_metho.xml
Size: 16,497 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1047"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Our Methodology </SectionTitle> <Paragraph position="0"> The first step in the application of our methodology is the production of a treebank annotated with LFG f-structure information. F-structures are feature structures which represent abstract syntactic information, approximating to basic predicate-argumentmodifier structures. We utilise the automatic annotation algorithm of (Cahill et al., 2004b) to derive a version of Penn-II where each node in each tree is annotated with an LFG functional annotation (i.e.</Paragraph> <Paragraph position="1"> an attribute value structure equation). Trees are traversed top-down, and annotation is driven by categorial, basic configurational, trace and Penn-II functional tag information in local subtrees of mostly depth one (i.e. CFG rules). The annotation procedure is dependent on locating the head daughter, for which the scheme of (Magerman, 1994) with some changes and amendments is used. The head is annotated with the LFG equation |=|. Linguistic generalisations are provided over the left (the prefix) and the right (suffix) context of the head for each syntactic category occurring as the mother node of such heads. To give a simple example, the rightmost NP to the left of a VP head under an S is likely to be its subject ( |SUBJ =|), while the leftmost NP to the right of the V head of a VP is most probably its object ( |OBJ =|). (Cahill et al., 2004b) provide four sets of annotation principles, one for non-coordinate configurations, one for coordinate configurations, one for traces (long distance dependencies) and a final 'catch all and clean up' phase.</Paragraph> <Paragraph position="2"> Distinguishing between argument and adjunct is an inherent step in the automatic assignment of functional annotations.</Paragraph> <Paragraph position="3"> The satisfactory treatment of long distance dependencies by the annotation algorithm is imperative for the extraction of accurate semantic forms. The Penn Treebank employs a rich arsenal of traces and empty productions (nodes which do not realise any lexical material) to co-index displaced material with the position where it should be interpreted semantically. The algorithm of (Cahill et al., 2004b) translates the traces into corresponding re-entrancies in the f-structure representation (Figure 1). Passive movement is also captured and expressed at f-structure level using a passive:+annotation. Once a treebank tree is annotated with feature structure equations by the annotation algorithm, the equations are collected and passed to a constraint solver which produces the f-structures.</Paragraph> <Paragraph position="4"> In order to ensure the quality of the seman- null tic forms extracted by our method, we must first ensure the quality of the f-structure annotations.</Paragraph> <Paragraph position="5"> (Cahill et al., 2004b) measure annotation quality in terms of precision and recall against manually constructed, gold-standard f-structures for 105 randomly selected trees from section 23 of the WSJ section of Penn-II. The algorithm currently achieves an F-score of 96.3% for complete f-structures and 93.6% for preds-only f-structures.1 Our semantic form extraction methodology is based on the procedure of (van Genabith et al., 1999): For each f-structure generated, for each level of embedding we determine the local PRED value and collect the subcategorisable grammatical functions present at that level of embedding. Consider the f-structure in Figure 1. From this we recursively extract the following non-empty semantic forms: say([subj,comp]), sign([subj,obj]). In effect, in both (van Genabith et al., 1999) and our approach semantic forms are reverse engineered from automatically generated f-structures for treebank trees. We extract the following subcategorisable syntactic functions: SUBJ, OBJ, OBJ2, OBLprep, OBL2prep, COMP, XCOMP and PART. Adjuncts (e.g. ADJ, APP etc) are not included in the semantic forms. PART is not a syntactic function in the strict sense but we capture the relevant co-occurrence patterns of verbs and particles in the semantic forms. Just as OBL includes the prepositional head of the PP, PART includes the actual particle which occurs e.g.</Paragraph> <Paragraph position="6"> add([subj,obj,part:up]).</Paragraph> <Paragraph position="7"> In the work presented here we substantially extend the approach of (van Genabith et al., 1999) as 1Preds-only measures only paths ending in PRED:VALUE so features such as number, person etc are not included.</Paragraph> <Paragraph position="8"> regards coverage, granularity and evaluation: First, we scale the approach of (van Genabith et al., 1999) which was proof of concept on 100 trees to the full WSJ section of the Penn-II Treebank. Second, our approach fully reflects long distance dependencies, indicated in terms of traces in the Penn-II Tree-bank and corresponding re-entrancies at f-structure. Third, in addition to abstract syntactic function-based subcategorisation frames we compute frames for syntactic function-CFG category pairs, both for the verbal heads and their arguments and also generate pure CFG-based subcat frames. Fourth, our method differentiates between frames captured for active or passive constructions. Fifth, our method associates conditional probabilities with frames.</Paragraph> <Paragraph position="9"> In contrast to much of the work reviewed in the previous section, our system is able to produce surface syntactic as well as abstract functional subcategorisation details. To incorporate CFG details into the extracted semantic forms, we add an extra feature to the generated f-structures, the value of which is the syntactic category of the pred at each level of embedding. Exploiting this information, the extracted semantic form for the verb sign looks as follows: sign(v,[subj(np),obj(np)]).</Paragraph> <Paragraph position="10"> We have also extended the algorithm to deal with passive voice and its effect on subcategorisation behaviour. Consider Figure 2: not taking voice into account, the algorithm extracts an intransitive frame outlaw([subj])for the transitive outlaw. To correct this, the extraction algorithm uses the feature value pair passive:+, which appears in the f-structure at the level of embedding of the verb in question, to mark that predicate as occurring in the passive: outlaw([subj],p).</Paragraph> <Paragraph position="11"> In order to estimate the likelihood of the cooccurrence of a predicate with a particular argument list, we compute conditional probabilities for subcategorisation frames based on the number of token occurrences in the corpus. Given a lemma l and an argument list s, the probability of s given l is estimated as:</Paragraph> <Paragraph position="13"> We use thresholding to filter possible error judgements by our system. Table 1 shows the attested semantic forms for the verb accept with their associated conditional probabilities. Note that were the distinction between active and passive not taken into account, the intransitive occurrence ofaccept would have been assigned an unmerited probability.</Paragraph> <Paragraph position="14"> subj : spec : quant : pred : all</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> We extract non-empty semantic forms2 for 3586 verb lemmas and 10969 unique verbal semantic form types (lemma followed by non-empty argument list). Including prepositions associated with the OBLs and particles, this number rises to 14348, an average of 4.0 per lemma (Table 2). The number of unique frame types (without lemma) is 38 without specific prepositions and particles, 577 with (Table 3). F-structure annotations allow us to distinguish passive and active frames.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 COMLEX Evaluation </SectionTitle> <Paragraph position="0"> We evaluated our induced (verbal) semantic forms cluding syntactic category for grammatical function) LEX defines 138 distinct verb frame types without the inclusion of specific prepositions or particles. The following is a sample entry for the verb its subcategorisation behaviour. For example, reimburse can occur with two noun phrases (NP-NP), a noun phrase and a prepositional phrase headed by &quot;for&quot; (NP-PP :PVAL (&quot;for&quot;)) or a single noun phrase (NP). Note that the details of the subject noun phrase are not included in COMLEX frames.</Paragraph> <Paragraph position="1"> Each of the complement types which make up the value of the :SUBC feature is associated with a formal frame definition which looks as follows: (vp-frame np-np :cs ((np 2)(np 3)) :gs (:subject 1 :obj 2 :obj2 3) :ex &quot;she asked him his name&quot;) The value of the :cs feature is the constituent structure of the subcategorisation frame, which lists the syntactic CF-PSG constituents in sequence. The value of the :gs feature is the grammatical structure which indicates the functional role played by each of the CF-PSG constituents. The elements of the constituent structure are indexed, and referenced in the :gs field. This mapping between constituent structure and functional structure makes the information contained in COMLEX suitable as an evaluation standard for the LFG semantic forms which we induce.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 COMLEX-LFG Mapping </SectionTitle> <Paragraph position="0"> We devised a common format for our induced semantic forms and those contained in COMLEX.</Paragraph> <Paragraph position="1"> This is summarised in Table 4. COMLEX does not distinguish between obliques and objects so we converted Obji to OBLi as required. In addition, COMLEX does not explicitly differentiate between COMPs and XCOMPs, but does encode control information for any Comps which occur, thus allowing us to deduce the distinction automatically. The manually constructed COMLEX entries provided us with a gold standard against which we evaluated the automatically induced frames for the 2992 (active) verbs that both resources have in common.</Paragraph> <Paragraph position="2"> We use the computed conditional probabilities to set a threshold to filter the selection of semantic forms. As some verbs occur less frequently than others we felt it was important to use a relative rather than absolute threshold. For a threshold of 1%, we disregard any frames with a conditional probability of less than or equal to 0.01. We carried out the evaluation in a similar way to (Schulte im Walde, 2002). The scale of our evaluation is comparable to hers.</Paragraph> <Paragraph position="3"> This allows us to make tentative comparisons between our respective results. The figures shown in Table 5 are the results of three different kinds of evaluation with the threshold set to 1% and 5%. The effect of the threshold increase is obvious in that Precision goes up for each of the experiments while Recall goes down.</Paragraph> <Paragraph position="4"> For Exp 1, we excluded prepositional phrases entirely from the comparison, i.e. assumed that PPs were adjunct material (e.g. [subj,obl:for] becomes [subj]). Our results are better for Precision than for Recall compared to Schulte im Walde (op cit.), who reports Precision of 74.53%, Recall of 69.74% and an F-score of 72.05%.</Paragraph> <Paragraph position="5"> Exp 2 includes prepositional phrases but not parameterised for particular prepositions (e.g.</Paragraph> <Paragraph position="6"> [subj,obl:for] becomes [subj,obl]). While our figures for Recall are again lower, our results for Precision are considerably higher than those of Schulte im Walde (op cit.) who recorded Precision of 60.76%, Recall of 63.91% and an F-score of 62.30%.</Paragraph> <Paragraph position="7"> For Exp. 3, we used semantic forms which contained details of specific prepositions for any subcategorised prepositional phrase. Our Precision figures are again high (in comparison to 65.52% as recorded by (Schulte im Walde, 2002)). However, except they include the specific particle associated with each PART.</Paragraph> <Paragraph position="8"> There are a number of possible reasons for our low recall scores for Experiment 3 in Table 5. It is a well-documented fact (Briscoe and Carroll, 1997) that subcategorisation frames (and their frequencies) vary across domains. We have extracted frames from one domain (the WSJ) whereas COM-LEX was built using examples from the San Jose Mercury News, the Brown Corpus, several literary works from the Library of America, scientific abstracts from the U.S. Department of Energy, and the WSJ. For this reason, it is likely to contain a greater variety of subcategorisation frames than our induced lexicon. It is also possible that due to human error COMLEX contains subcategorisation frames, the validity of which may be in doubt. This is due to the fact that the aim of the COMLEX project was to construct as complete a set of subcategorisation frames as possible, even for infrequent verbs. Lexicographers were allowed to extrapolate from the citations found, a procedure which is bound to be less certain than the assignment of frames based entirely on existing examples. Our recall figure was particularly low in the case of evaluation using details of prepositions (Experiment 3). This can be accounted for by the fact that COMLEX errs on the side of overgeneration when it comes to preposition assignment. This is particularly true of directional prepositions, a list of 31 of which has been prepared and is assigned in its entirety by default to any verb which can potentially appear with any directional preposition. In a subsequent experiment, we incorporate this list of directional prepositions by default into our semantic form induction process in the same way as the creators of COM-LEX have done. Table 6 shows the results of this experiment. As expected there is a significant im- null provement in the recall figure, being almost double the figures reported in Table 5 for Experiments 3 and 3a.</Paragraph> <Paragraph position="9"> Table 7 presents the results of our evaluation of the passive semantic forms we extract. It was carried out for 1422 verbs which occur with passive frames and are shared by the induced lexicon and COMLEX. As COMLEX does not provide explicit passive entries, we applied Lexical Redundancy Rules (Kaplan and Bresnan, 1982) to automatically convert the active COMLEX frames to their passive counterparts. For example, the COM-LEX entry see([subj,obj]) is converted to see([subj]). The resulting precision is very high, a slight increase on that for the active frames. The recall score drops for passive frames (from 54.7% to 29.3%) in a similar way to that for active frames when prepositional details are included.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Lexical Accession Rates </SectionTitle> <Paragraph position="0"> As well as evaluating the quality of our extracted semantic forms, we also examine the rate at which they are induced. (Charniak, 1996) and (Krotov et al., 1998) observed that treebank grammars (CFGs extracted from treebanks) are very large and grow with the size of the treebank. We were interested in discovering whether the acquisition of lexical material on the same data displays a similar propensity. Figure 3 displays the accession rates for the semantic forms induced by our method for sections 0-24 of the WSJ section of the Penn-II treebank. When we do not distinguish semantic forms by category, all semantic forms together with those for verbs display smaller accession rates than for the PCFG.</Paragraph> <Paragraph position="1"> We also examined the coverage of our system in a similar way to (Hockenmaier et al., 2002). We extracted a verb-only reference lexicon from Sections 02-21 of the WSJ and subsequently compared this to a test lexicon constructed in the same way from Section 23. Table 8 shows the results of this experiment. 89.89% of the entries in the test lexicon appeared in the reference lexicon.</Paragraph> </Section> </Section> class="xml-element"></Paper>