File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0109_metho.xml
Size: 15,012 bytes
Last Modified: 2025-10-06 14:13:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0109"> <Title>The Automatic Acquisition of Frequencies of Verb Subcategorization Frames from Tagged Corpora</Title> <Section position="4" start_page="96" end_page="98" type="metho"> <SectionTitle> 3 Experiment on Wall Street Journal Corpus </SectionTitle> <Paragraph position="0"> We used the above method in experiments involving a tagged corpus of Wall Street Journal (WSJ) articles, provided by the Penn Treebank project. Our experiment was limited in two senses. First, we treated all prepositional phrases as adjuncts. (It is generally difficult to distinguish complement and adjunct PPs.) Second, we measured the frequencies of only six fixed subcat frames for verbs in non-participle form. (This does not represent an essential shortcoming in the method; we only need to have additional subcat frame extraction rules to accommodate participles.) We extracted two sets of tagged sentences from the WSJ corpus, each representing 3-MBytes and approximately 300,000 words of text. One set was used as a training corpus, the other as a test corpus. Table 2 gives the list of verb-subcat frame extraction rules obtained (via examination) for four verbs &quot;expect&quot;, &quot;reflect&quot;, &quot;tell&quot;, and &quot;give&quot;, as they occurred in the training corpus. Sample sentences that can be captured by each set of rules are attached to the list. Table 3 shows the result of the hand comparison of the automat!cally identified verb-subcat frames for &quot;give&quot; and &quot;expect&quot; in the test corpus. The tabular columns give actual frequencies for each verb-subcat frame based on manual review and the tabular rowsgive the frequencies as determined automatically by the system. The count of each cell (\[i, j\]) gives the number of occurrences of the verb that are assigned the i-th subcat frame by the system and assigned the j-th frame by manual review. The frame/column labeled &quot;REST&quot; represents all other subcat frames, encompassing such subcat frames as those involving wh-clauses, verb-particle combinations (such as &quot;give up&quot;), and no complements.</Paragraph> <Paragraph position="1"> Despite the simplicity of the rules, the frequencies for subcat frames determined under automatic processing are very close to the real distributions. Most of the errors are attributable to errors in the noun phrase parser. For example, 10 out of the 13 errors in the \[NP,NP+NP\] cell under &quot;give&quot; are due to noun phrase parsing errors such as the misidentification of a N-N sequence (e.g., *&quot;give \[NP government officials rights\] against the press&quot; vs. &quot;give \[NP government officials\] \[NP rights\] against the press&quot;). 1. NP+NP 2. NP+CL 3. NPTINF 4. CL 5. NP 6. INF Frame 1.</Paragraph> <Paragraph position="2"> Frame 2.</Paragraph> <Paragraph position="3"> Frame 3.</Paragraph> <Paragraph position="4"> ~rame 4.</Paragraph> <Paragraph position="5"> Frame 5.</Paragraph> <Paragraph position="6"> Frame 6.</Paragraph> <Paragraph position="7"> &quot;...gives current management enough time to work on...&quot; &quot;...tel_.._l the people in the hall that...&quot; ; &quot;...tol__.d him the man would...&quot; &quot;...expected the impact from the restructuring to make...&quot; &quot;...thlnk that...&quot; ; &quot;...thought the company eventually responded...&quot; &quot;...sa__.E the man...&quot; ; &quot;...which the president of the company wanted...&quot; but not &quot;...sa__~ him swim...&quot;; &quot;...(hotel) in which he stayed...&quot;; &quot;...(gift) which he expected to get...&quot; &quot;...expects to gain...&quot;</Paragraph> <Paragraph position="9"/> <Paragraph position="11"> To measure the total accuracy of the system, we randomly chose 33 verbs from the 300 most frequent verbs in the test corpus (given in Table 4), automatically estimated the subcat frames for each occurrence of these verbs in the test corpus, and compared the results to manually determined subcat frames.</Paragraph> <Paragraph position="12"> The overall results are quite promising. The total number of occurrences of the 33 verbs in the test corpus (excluding participle forms) is 2,242. Of these, 1,933 were assigned correct subcat frames by the system. (The 'correct'-assignment counts always appear in the diagonal cells in a comparison table such as in Table 3.) This indicates an overall accuracy for the method of 86%.</Paragraph> <Paragraph position="13"> If we exclude the subcat frame &quot;REST&quot; from our statistics, the total number of occurrences of the 33 verbs in one of the six subcat frames is 1,565. Of these, 1,311 were assigned correct subcat frames by the system. This represents 83% accuracy.</Paragraph> <Paragraph position="14"> For 30 of the 33 verbs, both the first and the second (if any) most frequent subcat frames as determined by the system were correct. For all of the verbs except one (&quot;need&quot;), the most frequent frame was correct.</Paragraph> <Paragraph position="15"> Figure 1 is a histogram showing the number of verbs within each error-rate zone.</Paragraph> <Paragraph position="16"> In computing the error rate, we divide the total 'off-diagonal'-cell counts, excluding the counts in the &quot;REST&quot; column, by the total cell counts, again excluding the &quot;REST&quot; column margin. Thus, the off-diagonal cell counts in the &quot;REST&quot; row, representing instances where one of the six actual subcat frames was misidentified as &quot;REST&quot;, are counted as errors. This formula, in general, gives higher error rates than would result from simply dividing the off-diagonal cell counts by the total cell counts.</Paragraph> <Paragraph position="17"> Overall, the most frequent source of errors, again, was errors in noun phrase boundary detection. The second most frequent source was misidentification of infinitival 'purpose' clauses, as in &quot;he used a crowbar to open the door&quot;. &quot;To open the door&quot; is a 'purpose' adjunct modifying either the verb phrase &quot;used a crowbar&quot; or the main clause &quot;he used a crowbar&quot;. But such adjuncts are incorrectly judged to be complements of their main verbs by the subcat frame extraction rules in Table 2. In formulating the rules, we assumed that a 'purpose' adjunct appears effectively randomly and much less frequently than infinitival complements. This is true for our corpus in general; but some verbs, such as &quot;use&quot; and &quot;need&quot;, appear relatively frequently with 'purpose' infinitivals. In addition to errors from parsing and 'purpose' infinitives, we observed several other, less frequent types of errors. These, too, pattern with specific verbs and do not occur randomly across verbs.</Paragraph> </Section> <Section position="5" start_page="98" end_page="104" type="metho"> <SectionTitle> 4 Statistical Analysis </SectionTitle> <Paragraph position="0"> For most of the verbs in the experiment, our method provides a good measure of subcat frame frequencies. However, some of the verbs seem to appear in syntactic structures that cannot be captured by our inventory of subcat frames. For example, &quot;need&quot; is frequently used in relative clauses without relative pronouns, as in &quot;the last thing they need&quot;. Since this kind of relative clauses cannot be captured by the rules in Table 2, each occurrence of these relative clause causes an error in measurement. It is likely that there are many other classes of verbs with distinctive syntactic preferences. If we try to add rules for each such class, it will become increasingly difficult to write rules that affect only the target class and to eliminate undesirable rule interactions.</Paragraph> <Paragraph position="1"> In the following sections, we describe a statistical method which, based on a set of training samples, enables the system to learn patterns of errors and substantially increase the accuracy of estimated verb-suhcat frequencies.</Paragraph> <Section position="1" start_page="98" end_page="102" type="sub_section"> <SectionTitle> 4.1 General Scheme </SectionTitle> <Paragraph position="0"> The method described in Section 2 is wholly deterministic; it depends only on one set of subcat extraction rules which serve as filters. Instead of treating the system output for each verb token as an estimated subcat frame, we can think of the output as one feature associated with the occurrence of the verb. This single feature can be combined, statistically, with other features in the corpus to yield more accurate characterizations of verb contexts and more accurate subcat-frame frequency estimates. If the other features are capturable via regular-expression rules, they can also be automatically detected in the manner described in the Section 2. For example, main verbs in relative clauses without relative pronouns may have a higher probability of having the feature &quot;nnk&quot;, i.e., &quot;(NP)(NP)(VERB)&quot;.</Paragraph> <Paragraph position="1"> More formally, let Y be a response variable taking as its value a subeat frame. Let X1, X2,..., XN be explanatory variables. Each Xi is associated with a feature expressed by one or a set of regular expressions. If a feature is expressed by one regular expression (R), the value of the feature is 1 if the occurrence of the verb matches R and 0 otherwise.</Paragraph> <Paragraph position="2"> If the feature is expressed by a set of regular expressions, its value is the label of the regular expression that the occurrence of the verb matches. The set of regular expressions in Table 2 can therefore be considered to characterize one explanatory variable whose value ranges from (NP+NP) to (REST).</Paragraph> <Paragraph position="3"> Now, we assume that a training corpus is available in which all verb tokens are given along with their subcat frames. By running our system on the training corpus, we can automatically generate a (N + 1)-dimensional contingency table. Table 3 is an example of a 2-dimensional contingency table with X = <OUTPUT OF SYSTEM> and Y = <REAL OCCURRENCES>. Using loglinear models \[Agresti, 1990\], we can derive fitted values of each cell in the (N + 1)-dimensional contingency table. In the case of a saturated model, in which all kinds of interaction of variables up to (N + 1)-way interactions are included, the raw cell counts are the Maximum Likelihood solution. The fitted values are then used to estimate the subcat frame frequencies of a new corpus as follows.</Paragraph> <Paragraph position="4"> First, the system is run on the new corpus to obtain an N-dimensional contingency table. This table is considered to be an X1 - X2 ..... XN-marginal table. What we are aiming at is the Y margins that represent the real subcat frame frequencies of the new corpus. Assuming that the training corpus and the new corpus are homogeneous (e.g., reflecting similar sub-domains or samples of a common domain), we estimate the Y margins using Bayes theorem on the fitted values of the training corpus as follows:</Paragraph> <Paragraph position="6"> where ~iili~...i. + is the cell count of the X1 - X2 ..... XN marginal table of the new corpus obtained as the system output, and .h411i2...iN~ is the fitted value of the (N + 1)dimensional contingency table of the training corpus based on a particular loglinear model.</Paragraph> </Section> <Section position="2" start_page="102" end_page="104" type="sub_section"> <SectionTitle> 4.2 Lexical Heuristics </SectionTitle> <Paragraph position="0"> The simplest application of the above method is to use a 2-way contingency table, as in Table 3. There are two possibilities to explore in constructing a 2-way contingency table.</Paragraph> <Paragraph position="1"> One is to sum up the cell counts of all the verbs in the training corpus and produce a single (large) general table. The other is to construct a table for each verb. Obviously the former approach is preferable if it works. Unfortunately, such a table is typically too general to be useful; the estimated frequencies based on it are less accurate than raw system output. This is because the sources of errors, viz., the distribution of off-diagonal cell counts of 2-way contingency tables, differ considerably from verb to verb. The latter approach is problematic if we have to make such a table for each domain. However, if we have a training corpus in one domain, and if the heuristics for each verb extracted from the training corpus are also applicable to other domains, the approach may work.</Paragraph> <Paragraph position="2"> To test the latter possibility, we constructed a contingency table for the verb from the test corpus described in the Section 3 that was most problematic (least accurately estimated) among the 33 verbs--&quot;need&quot;. Note that we are using the test corpus described in the Section 3 as a training corpus here, because we already know both the measured frequency and the hand-judged frequency of &quot;need&quot; which are necessary to construct a contingency table. The total occurrence of this verb was 75. To smooth the table, 0.1 is added to all the cell counts. As new test corpora, we extracted another 300,000 words of tagged text from the WSJ corpus (labeled &quot;W3&quot;) and also three sets of 300,000 words of tagged text from the Brown corpus (labeled &quot;BI&quot;, &quot;B2&quot;, and &quot;B3&quot;), as retagged under the Penn Treebank tagset. All the training and test corpora were reviewed--and judged--by hand.</Paragraph> <Paragraph position="3"> Table 5 gives the frequency distributions based on the system output, hand judgement, and statistical analysis. (As before, we take the hand judgement to be the gold standard, the actual frequency of a particular frame.) After the Y margins are statistically estimated, the least estimated Y values less than 1.0 are truncated to 0. (These are considered to have appeared due to the smoothing.) In all of the test corpora, the method gives very accurate frequency distribution estimates. Big gaps between the automatically-measured and manually-determined frequencies of &quot;NP&quot; and &quot;REST&quot; are shown to be substantially reduced through the use of statistical estimation. This result is especially encouraging because tile heuristics obtained in one domain are shown to be applicable to a considerably different domain. Furthermore, by combining more feature sets and making use of multi-dimensional analysis, we can expect to obtain more accurate estimations.</Paragraph> </Section> </Section> class="xml-element"></Paper>