File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1071_metho.xml
Size: 13,671 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1071"> <Title>Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus</Title> <Section position="3" start_page="0" end_page="428" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> This paper presents a method for extracting subcorpora documenting different subcategorization frames for verbs, nouns, and adjectives in the 100 mio. word British National Corpus. The extraction tool consists of a set of batch files for use with the Corpus Query Processor (CQP), which is part of the IMS corpus workbench (cf. Christ 1994a,b).</Paragraph> <Paragraph position="1"> A macroprocessor has been developed that allows the user to specify in a simple input file which subcorpora are to be created for a given lemma.</Paragraph> <Paragraph position="2"> The resulting subcorpora can be used (1) to provide evidence for the subcategorization properties of a given lemma, and to facilitate the selection of corpus lines for lexicographic research, and (2) to determine the frequencies of different syntactic contexts of each lemma.</Paragraph> <Paragraph position="3"> Introduction A number of resources are available for obtaining subcategorization information, i.e. information on the types of syntactic complements associated with valence-bearing predicators (which include verbs, nouns, and adjectives). This information, also referred to as valence information is available both in machine-readable form, as in the COMLEX database (Macleod et al. 1995), and in human-readable dictionaries (e.g. Hornby 1989, Procter 1978, Sinclair 1987). Increasingly, tools are also becoming available for acquiring subcategorization information from corpora, i.e. for inferring the subcategorization frames of a given lemma (e.g. Manning 1993).</Paragraph> <Paragraph position="4"> None of these resources provide immediate access to corpus evidence, nor do they provide information on the relative frequency of the patterns that are listed for a given lemma.</Paragraph> <Paragraph position="5"> There is a need for a tool that can (1) find evidence for subcategorization patterns and (2) determine their frequencies in large corpora: 1. Statistical approaches to NLP rely on information not just on the range of combinatory possibilities of words, but also the relative frequencies of the expected patterns.</Paragraph> <Paragraph position="6"> 2. Dictionaries that list subcategorization frames often list expected patterns, rather than actual ones. Lexicographers and lexicologist need access to the evidence for this information.</Paragraph> <Paragraph position="7"> 3. Frequency information has come to be the focus of much psycholinguistic research on sentence processing (see for example MacDonald 1997). While information on word frequency is readily available (e.g. Francis and Kucera (1982)), there is as yet no easy way of obtaining information from large corpora on the relative frequency of complementation patterns.</Paragraph> <Paragraph position="8"> None of these points argue against the usefulness of the available resources, but they show that there is a gap in the available information. null To address this need, we have developed a tool for extracting evidence for subcategorization patterns from the 100 mio. word British National Corpus (BNC). The tool is used as pan of the lexicon-building process in the FrameNet project, an NSF-funded project aimed at creating a lexical database based on the principles of Frame Semantics (Fillmore 1982).</Paragraph> </Section> <Section position="4" start_page="428" end_page="429" type="metho"> <SectionTitle> 1 Infrastructure </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="428" end_page="428" type="sub_section"> <SectionTitle> 1.1 Tools </SectionTitle> <Paragraph position="0"> We are using the 100 mio. word British National Corpus, with the following corpus query tools: * CQP (Corpus Query Processor, Christ (1994)), a general corpus query processor for complex queries with any number and combination of annotated information types, including part-of-speech tags, morphosyntactic tags, lemmas and sentence boundaries.</Paragraph> <Paragraph position="1"> * A macroprocessor for use with CQP that allows the user to specify which subcorpora are to be created for a given lemma.</Paragraph> <Paragraph position="2"> The corpus queries are written in the CQP corpus query language, which uses regular expressions over part-of-speech tags, lemmas, morphosyntactic tags, and sentence boundaries. For details, see Christ (1994a). The queries essentially simulate a chunk parser, using a regular grammar.</Paragraph> </Section> <Section position="2" start_page="428" end_page="428" type="sub_section"> <SectionTitle> 1.2 Coverage </SectionTitle> <Paragraph position="0"> A list of the verb frames that are currently searchable is given in figure 1 below, along with an example of each pattern. The categories we are using are roughly based on those used in the COMLEX syntactic dictionary (Macleod et al. 1995).</Paragraph> <Paragraph position="1"> intransitive 'worms wiggle' np 'kiss me' np_np 'brought her flowers' np_pp 'replaced it with a new one' np_Pvping 'prevented him from leaving' np_pwh 'asked her about what it all meant' np_vpto 'advised her to go' np_vping 'kept them laughing' np_sfin 'told them (that) he was back' np_wh 'asked him where the money In our queries for nouns and adjectives as targets, we are able to extract prepositional, clausal, infinitival, and gerundial complements. In addition, the tool accomodates searches for compounds and for possessor phrases (my neighbor's addiction to cake, my milk allergy). Even though these categories are not tied to the syntactic subcategorization frames of the target lemmas, they often instantiate semantic arguments, or, more specifically, Frame elements (Fillmore 1982, Baker et al.</Paragraph> <Paragraph position="2"> forthcoming).</Paragraph> </Section> <Section position="3" start_page="428" end_page="429" type="sub_section"> <SectionTitle> 1.3 Method </SectionTitle> <Paragraph position="0"> We start by creating a subcorpus containing all concordance lines for a given lemma. We call this subcorpus a lemma-subcorpus. The extraction of smaller subcorpora from the lemma subcorpus then proceeds in two stages.</Paragraph> <Paragraph position="1"> During the first stage, syntactic patterns involving 'displaced' arguments (i.e. 'left isolation' or 'movement' phenomena) are extracted, such as passives, tough movement and constructions involving WH-extraction.</Paragraph> <Paragraph position="2"> The result of this procedure is a set of subcorpora that are homogeneous with respect to major constituent order. Following this, the remainder of the lemma-subcorpus is partitioned into subcorpora based on the subcategorization properties of the lemma in question.</Paragraph> <Paragraph position="3"> 1.3.2 Search strategies: positive and negative queries For the extraction of certain subcategorization patterns, it is not necessary to simulate a parse of all of the constituents. Where an explicit context cue exists, a partial parse suffices. For example, the query given in figure 2 below is used to find \[_ NP VPing\] patterns (e.g. kept them laughing). Note that the query does not positively identify a noun phrase in the 1.3.3Searches driven by subcategorization frames Applying queries like the one for \[NP VPing\] &quot;blindly&quot;, i.e. in the absence of any information on the target lemma, would produce many false hits, since the query also matches gerunds that are not subcategorized.</Paragraph> <Paragraph position="4"> However, the information that the target verb subcategorizes for a gerund dramatically reduces the number of such errors.</Paragraph> <Paragraph position="5"> The same mechanism is used for addressing the problems associated with prepositional phrase attachment. The general principle is that prepositional phrases in certain contexts are considered to be embedded in a preceding noun phrase , unless the user specifies that a given preposition is subcategorized for by the target lemma. For example, the of-phrase in a sequence Verb - NP - of- NP is interpreted as part of the first NP (as in met the president of the company), unless we are dealing with a verb that has a \[_NP PPof\] subcategorization frame, e.g. cured the president of his asthma. The result of each query is subtracted from the lemma subcorpus and the remainder submitted to the next set of queries. As a result, earlier queries pre-empt later queries. For example, concordance lines matching the queries for passives, e.g. he was cured of his asthma are filtered out early on in the process, so as to avoid getting matched by the queries dealing with (active intransitive) verb + prepositional phrase complements, such as he boasted of his achievements.</Paragraph> <Paragraph position="6"> Another example of this type of preemption concerns the interaction of the query for ditransitive frames (brought her flowers) with later queries for NP complements. A proper name immediately followed by another proper name (e.g. Henry James) is interpreted as a single noun phrase except when the target lemma subcategorizes for a ditransitive frame t. An analogous strategy is used for identifying noun compounds. For ditransitives, strings that represent two consecutive noun phrases are queried for first. Note that this method crucially relies on the fact that the subcategorization properties of the target lemma are given as the input to the query process.</Paragraph> </Section> </Section> <Section position="5" start_page="429" end_page="430" type="metho"> <SectionTitle> 2 Examples </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="429" end_page="429" type="sub_section"> <SectionTitle> 2.1 NPs </SectionTitle> <Paragraph position="0"> An example of a complex query expression of the kind we are using is given in figure 3. The expression matches noun phrases like &quot;the three kittens&quot;, &quot;poor Mr. Smith&quot;, &quot;all three&quot;, &quot;blue flowers&quot;, &quot;an unusually large hat&quot;, etc.</Paragraph> <Paragraph position="2"/> </Section> <Section position="2" start_page="429" end_page="430" type="sub_section"> <SectionTitle> 2.2 Coordinated passives </SectionTitle> <Paragraph position="0"> As an example of a query matching a 'movement' structure, consider the query for coordinated passives, given in figure 3 below. The leftmost column gives the query expression itself, while the other columns show i Inevitably, this strategy fails in some cases, such as &quot;I'm reading Henry James now&quot; (vs. &quot;I read Henry stories.&quot; concordance lines found by this query. The</Paragraph> <Paragraph position="2"/> </Section> </Section> <Section position="6" start_page="430" end_page="430" type="metho"> <SectionTitle> 3 The macroprocessor </SectionTitle> <Paragraph position="0"> A macroprocessor has been developed 2 that allows the user to specify in a simple input file which subcorpora are to be created for a given lemma.</Paragraph> <Paragraph position="1"> The macroprocessor reads the the number target lemma is of matches for each subcategorization pattern into an output file. A sample input file for the lemma insist is given in figure 5 p.p: (_list_ prepositions) on ping: (_list_ prepositions) on pwh: (_list_ prepositions) on particle: (y/n) n np_particle: (y/n) n particle_pp: {y/n) n particle_wh: (y'n) n ap: (y/n) n directquote: (y/n) y sfin (y/n) y sbrst: (y/n) y figure 5 Input form for macroporcessor</Paragraph> </Section> <Section position="7" start_page="430" end_page="430" type="metho"> <SectionTitle> 4 Output format </SectionTitle> <Paragraph position="0"> sorted, usually by the head of the first complement following the target lemma.</Paragraph> <Paragraph position="1"> The subcorpora can be saved as binary files for further processing in CQP or XKWIC, an interactive corpus query tool (Christ 1994) and as text files. The text files are</Paragraph> </Section> <Section position="8" start_page="430" end_page="431" type="metho"> <SectionTitle> 5 Limitations of the approach </SectionTitle> <Paragraph position="0"> Our tool relies on subcategorization information as its input. Hence it is not capable of automatically learning subcategorization frames, e.g. ones that are missing in diction- null aries or omitted in the input file. The tool facilitates the (manual) discovery of evidence for new subcategorization frames, however, as potential complement patterns are saved in separate subcorpora. Indeed, this is one of the ways in which the tool is being used in the context of the FrameNet project.</Paragraph> <Paragraph position="1"> Some of the technical limitations of the existing tools result from the fact that we are working with an unparsed corpus. Thus, many types of 'null' or 'empty' constituents 3 are not recognized by the queries. Ambiguities in prepositional phrase attachment are another major source of errors. For instance, of the concordance lines supposedly instantiating a \[_NP PPwith\] frame for the verb heal, several in fact contained embedded PPs (e.g. \[_NP\], as in heal \[children with asthma\], rather than \[_NP PPwith\], as in healing \[arthritis\] \[with a crystal ball\]), Finally, the search results can only be as accurate as the part-of-speech tags and other annotations in the corpus.</Paragraph> </Section> class="xml-element"></Paper>