File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/j05-3003_relat.xml
Size: 12,220 bytes
Last Modified: 2025-10-06 14:15:49
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-3003"> <Title>Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks</Title> <Section position="4" start_page="333" end_page="336" type="relat"> <SectionTitle> 3. Related Work </SectionTitle> <Paragraph position="0"> The encoding of verb subcategorization properties is an essential step in the construction of computational lexicons for tasks such as parsing, generation, and machine translation. Creating such a resource by hand is time consuming and error prone, requires considerable linguistic expertise, and is rarely if ever complete. In addition, a hand-crafted lexicon cannot be easily adapted to specific domains or account for linguistic change. Accordingly, many researchers have attempted to construct lexicons automatically, especially for English. In this section, we discuss approaches to CFG-based subcategorization frame extraction as well as attempts to induce lexical resources which comply with specific linguistic theories or express information in terms of more abstract predicate-argument relations. The evaluation of these approaches is discussed in greater detail in Section 6, in which we compare our results with those reported elsewhere in the literature.</Paragraph> <Paragraph position="1"> We will divide more-general approaches to subcategorization frame acquisition into two groups: those which extract information from raw text and those which use preparsed and hand-corrected treebank data as their input. Typically in the approaches based on raw text, a number of subcategorization patterns are predefined, a set of verb subcategorization frame associations are hypothesized from the data, and statistical methods are applied to reliably select hypotheses for the final lexicon.</Paragraph> <Paragraph position="2"> Brent (1993) relies on morphosyntactic cues in the untagged Brown corpus as indicators of six predefined subcategorization frames. The frames do not include details of specific prepositions. Brent used hypothesis testing on binomial frequency data to statistically filter the induced frames. Ushioda et al. (1993) run a finite-state NP parser on a POS-tagged corpus to calculate the relative frequency of the same six subcategorization verb classes. The experiment is limited by the fact that all prepositional phrases are treated as adjuncts. Ushioda et al. (1993) employ an additional statistical method based on log-linear models and Bayes' theorem to filter the extra noise introduced by the parser and were the first to induce relative frequencies for the extracted frames.</Paragraph> <Paragraph position="3"> Manning (1993) attempts to improve on the approach of Brent (1993) by passing raw text through a stochastic tagger and a finite-state parser (which includes a set of simple rules for subcategorization frame recognition) in order to extract verbs and the constituents with which they co-occur. He assumes 19 different subcategorization O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources frame definitions, and the extracted frames include details of specific prepositions.</Paragraph> <Paragraph position="4"> The extracted frames are noisy as a result of parser errors and so are filtered using the binomial hypothesis theory (BHT), following Brent (1993). Applying his technique to approximately four million words of New York Times newswire, Manning acquired 4,900 verb-subcategorization frame pairs for 3,104 verbs, an average of 1.6 frames per verb. Briscoe and Carroll (1997) predefine 163 verbal subcategorization frames, obtained by manually merging the classes exemplified in the COMLEX (MacLeod, Grishman, and Meyers 1994) and ANLT (Boguraev et al. 1987) dictionaries and adding around 30 frames found by manual inspection. The frames incorporate control information and details of specific prepositions. Briscoe and Carroll (1997) refine the BHT with a priori information about the probabilities of subcategorization frame membership and use it to filter the induced frames. Recent work by Korhonen (2002) on the filtering phase of this approach uses linguistic verb classes (based on Levin [1993]) for obtaining more accurate back-off estimates for hypothesis selection. Carroll and Rooth (1998) use a handwritten head-lexicalized, context-free grammar and a text corpus to compute the probability of particular subcategorization patterns. The approach is iterative with the aim of estimating the distribution of subcategorization frames associated with a particular predicate. They perform a mapping between their frames and those of the OALD, resulting in 15 frame types. These do not contain details of specific prepositions.</Paragraph> <Paragraph position="5"> More recently, a number of researchers have applied similar techniques to automatically derive lexical resources for languages other than English. Schulte im Walde (2002a, 2002b) uses a head-lexicalized probabilistic context-free grammar similar to that of Caroll and Rooth (1998) to extract subcategorization frames from a large German newspaper corpus from the 1990s. She predefines 38 distinct frame types, which contain maximally three arguments each and are made up of a combination of the following: nominative, dative, and accusative noun phrases; reflexive pronouns; prepositional phrases; expletive es; subordinated nonfinite clauses; subordinated finite clauses; and copula constructions. The frames may optionally contain details of particular prepositional use. Unsupervised training is performed on a large German newspaper corpus, and the resulting probabilistic grammar establishes the relevance of different frame types to a specific lexical head. Because of computing time constraints, Schulte im Walde limits sentence length for grammar training and parsing. Sentences of length between 5 and 10 words were used to bootstrap the lexicalized grammar model. For lexicalized training, sentences of length between 5 and 13 words were used. The result is a subcategorization lexicon for over 14,000 German verbs. The extensive evaluation carried out by Schulte im Walde will be discussed in greater detail in Section 6.</Paragraph> <Paragraph position="6"> Approaches using treebank-based data as a source for subcategorization information, such as ours, do not predefine the frames to be extracted but rather learn them from the data. Kinyon and Prolo (2002) describe a simple tool which uses fine-grained rules to identify the arguments of verb occurrences in the Penn-II Treebank. This is made possible by manual examination of more than 150 different sequences of syntactic and functional tags in the treebank. Each of these sequences was categorized as a modifier or argument. Arguments were then mapped to traditional syntactic functions.</Paragraph> <Paragraph position="7"> For example, the tag sequence NP-SBJ denotes a mandatory argument, and its syntactic function is subject. In general, argumenthood was preferred over adjuncthoood. As Kinyon and Prolo (2002) does not include an evaluation, currently it is impossible to say how effective their technique is. Sarkar and Zeman (2000) present an approach to learn previously unknown frames for Czech from the Prague Dependency Bank (Hajic Computational Linguistics Volume 31, Number 3 1998). Czech is a language with a freer word order than English and so configurational information cannot be relied upon. In a dependency tree, the set of all dependents of the verb make up a so-called observed frame, whereas a subcategorization frame contains a subset of the dependents in the observed frame. Finding subcategorization frames involves filtering adjuncts from the observed frame. This is achieved using three different hypothesis tests: BHT, log-likelihood ratio, and t-score. The system learns 137 subcategorization frames from 19,126 sentences for 914 verbs (those which occurred five times or more). Marinov and Hemming (2004) present preliminary work on the automatic extraction of subcategorization frames for Bulgarian from the BulTreeBank (Simov, Popova, and Osenova 2002). In a similar way to that of Sarkar and Zeman (2000), Marinov and Hemming's system collects both arguments and adjuncts. It then uses the binomial log-likelihood ratio to filter incorrect frames. The BulTreebank trees are annotated with HPSG-typed feature structure information and thus contain more detail than the dependency trees. The work done for Bulgarian is small-scale, however, as Marinov and Hemming are working with a preliminary version of the treebank with 580 sentences.</Paragraph> <Paragraph position="8"> Work has been carried out on the extraction of formalism-specific lexical resources from the Penn-II Treebank, in particular TAG, CCG, and HPSG. As these formalisms are fully lexicalized with an invariant (LTAG and CCG) or limited (HPSG) rule component, the extraction of a lexicon essentially amounts to the creation of a grammar. Chen and Vijay-Shanker (2000) explore a number of related approaches to the extraction of a lexicalized TAG from the Penn-II Treebank with the aim of constructing a statistical model for parsing. The extraction procedure utilizes a head percolation table as introduced by Magerman (1995) in combination with a variation of Collins's (1997) approach to the differentiation between complement and adjunct. This results in the construction of a set of lexically anchored elementary trees which make up the TAG in question.</Paragraph> <Paragraph position="9"> The number of frame types extracted (i.e., an elementary tree without a specific lexical anchor) ranged from 2,366 to 8,996. Xia (1999) also presents a similar method for the extraction of a TAG from the Penn Treebank. The extraction procedure consists of three steps: First, the bracketing of the trees in the Penn Treebank is corrected and extended based on the approaches of Magerman (1994) and Collins (1997). Then the elementary trees are read off in a quite straightforward manner. Finally any invalid elementary trees produced as a result of annotation errors in the treebank are filtered out using linguistic heuristics. The number of frame types extracted by Xia (1999) ranged from 3,014 to 6,099.</Paragraph> <Paragraph position="10"> Hockenmaier, Bierner, and Baldridge (2004) outline a method for the automatic extraction of a large syntactic CCG lexicon from the Penn-II Treebank. For each tree, the algorithm annotates the nodes with CCG categories in a top-down recursive manner.</Paragraph> <Paragraph position="11"> The first step is to label each node as either a head, complement, or adjunct based on the approaches of Magerman (1994) and Collins (1997). Each node is subsequently assigned the relevant category based on its constituent type and surface configuration.</Paragraph> <Paragraph position="12"> The algorithm handles &quot;like&quot; coordination and exploits the traces used in the treebank in order to interpret LDDs. Unlike our approach, those of Xia (1999) and Hockenmaier, Bierner, and Baldridge (2004) include a substantial initial correction and clean-up of the Penn-II trees.</Paragraph> <Paragraph position="13"> Miyao, Ninomiya, and Tsujii (2004) and Nakanishi, Miyao, and Tsujii (2004) describe a methodology for acquiring an English HPSG from the Penn-II Treebank.</Paragraph> <Paragraph position="14"> Manually defined heuristics are used to automatically annotate each tree in the treebank with partially specified HPSG derivation trees: Head/argument/modifier distinctions are made for each node in the tree based on Magerman (1994) and Collins (1997); O'Donovan et al. Large-Scale Induction and Evaluation of Lexical Resources the whole tree is then converted to a binary tree; heuristics are applied to deal with phenomena such as LDDs and coordination and to correct some errors in the treebank, and finally an HPSG category is assigned to each node in the tree in accordance with its CFG category. In the next phase of the process (externalization), HPSG lexical entries are automatically extracted from the annotated trees through the application of &quot;inverse schemata.&quot;</Paragraph> </Section> class="xml-element"></Paper>