File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1325_evalu.xml
Size: 10,264 bytes
Last Modified: 2025-10-06 13:58:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1325"> <Title>Statistical Filtering and Subcategorization Frame Acquisition</Title> <Section position="5" start_page="201" end_page="203" type="evalu"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 3.1 Method </SectionTitle> <Paragraph position="0"> To evaluate the different approaches, we took a sample of 10 million words of the BNC corpus (Leech, 1992). We extracted all sentences containing an occurrence of one of fourteen verbs 3. The verbs were chosen at random, subject to the constraint that they exhibited multiple complementation patterns. After the extraction process, we retained 3000 citations, on average, for each verb. The sentences containing these verbs were processed by the SCF acquisition system, and then we applied the three filtering methods described above. We also obtained results for a baseline without any filtering.</Paragraph> <Paragraph position="1"> The results were evaluated against a manual analysis of corpus data 4. This was obtained by analysing up to a maximum of 300 occurrences for each of the 14 test verbs in LOB (Garside et al., 1987), Susanne and SEC (Taylor and Knowles, 1988) corpora. Following Briscoe and Carroll (1997), we calculated precision (percentage of SCFS acquired which were also exemplified in the manual analysis) and recall (percentage of the SCFs exemplified in the manual analysis which were acquired automatically). We also combined precision and recall into a single measure of overall performance using the F measure (MA.nniug and Schiitze, 1999).</Paragraph> <Paragraph position="2"> F = 2.precision. recall (5) precision + recall</Paragraph> </Section> <Section position="2" start_page="201" end_page="201" type="sub_section"> <SectionTitle> 3.2 Results </SectionTitle> <Paragraph position="0"> Table 1 gives the raw results for the 14 verbs using each method. It shows the number of true positives (TP), .false positives (FP), and .false negatives (FN), as determined according to the manual analysis. The results for high frequency SCFs (above 0.01 relative frequency), medium frequency (between 0.001 and 0.01) and low frequency (below 0.001) SCFs are listed respectively in the second, ual analysis as Briscoe and Carroll, Le. one from the Susanne, LOB, and SEC corpora. A manual analysis of the BNC data might produce better results. However, since the BNC is a heterogeneous corpus we felt it was reasonable to test the data on a different corpus, which is also heterogeneous.</Paragraph> <Paragraph position="1"> third and fourth columns, and the final column includes the total results for all frequency ranges.</Paragraph> <Paragraph position="2"> Table 2 shows precision and recall for the 14 verbs and the F measure, which combines precision and recall. We also provide the baseline results, if all SCFs were accepted.</Paragraph> <Paragraph position="3"> From the results given in tables 1 and 2, the MLE approach outperformed both hypothesis tests. For both BHT and LLR there was an increase in FNs at high frequencies, and an increase in FPs at medium and low frequencies, when compared to MLE. The number of errors was typically larger for LLR than BHT.</Paragraph> <Paragraph position="4"> The hypothesis tests reduced the number of FNS at medium and low frequencies, however, this was countered by the substantial increase in FPs that they gave. While BHT nearly always acquired the three most frequent SCFs of verbs correctly, LLR tended to reject these.</Paragraph> <Paragraph position="5"> While the high number of FNS can be explained by reports which have shown LLR to be over-conservative (Ribas, 1995; Pedersen, 1996), the high number of FPs is surprising.</Paragraph> <Paragraph position="6"> Although theoretically, the strength of LLR lies in its suitability for low frequency data, the results displayed in table 1 do not suggest that the method performs better than BHT on low frequency frames.</Paragraph> <Paragraph position="7"> MLE thresholding produced better results than the two statistical tests used. Precision improved considerably, showing that the classes occurring in the data with the highest frequency are often correct. Although MLE thresholding clearly makes no attempt to solve the sparse data problem, it performs better than BHT or LLR overall. MLE is not adept at finding low frequency SCFS, however, the other methods are problematic in that they wrongly accept more than they correctly reject. The baseline, of accepting all SCFS, obtained a high recall at the expense of precision.</Paragraph> </Section> <Section position="3" start_page="201" end_page="203" type="sub_section"> <SectionTitle> 3.3 Discussion </SectionTitle> <Paragraph position="0"> Our results indicate that MLE outperforms both hypothesis tests. There are two explanations for this, and these are jointly responsible for the results.</Paragraph> <Paragraph position="1"> Firstly, the SCF distribution is zipfian, as are many distributions concerned with natural language (Manning and Schiitze, 1999). Figure 1 shows the conditional distribution for the verb find. This ~mf~ltered SCF probability distribution was obtained from 20 M words of BNC data output from the SCF sys- null tern. The unconditional distribution obtained from the observed distribution of SCFs in the 20 M words of BNC is shown in figure 2. The figures show SCF rank on the X-axis versus SCF frequency on the Y-axis, using logarithmic scales. The line indicates the closest Zipflike power law fit to the data.</Paragraph> <Paragraph position="2"> Secondly, the hypothesis tests make the false assumption (H0) that the unconditional and conditional distributions are correlated.</Paragraph> <Paragraph position="3"> The fact that a significant improvement in performance is made by correcting the prior probabilities according to the performance of the system (Briscoe, Carroll and Korhonen, 1997) suggests the discrepancy between the unconditional and the conditional distributions. null We examined the correlation between the manual analysis for the 14 verbs, and the unconditional distribution of verb types over all SCFs estimated from the ANLT using the Spearman Rank Correlation Coefficient. The results included in table 3 show that only a moderate correlation was found averaged over all verb types.</Paragraph> <Paragraph position="4"> Both LLR and BHT work by comparing the observed value of p(scfi\[verbj) to that expected by chance. They both use the observed tional SCF distributions of the test verbs and the unconditional distribution value for p(sc.filverbj) from the system's output, and they both use an estimate for the unconditional probability distribution (p(scfi)) for estimating the expected probability. They differ in the way that the estimate for the unconditional probability is obtained, and the way that it is used in hypothesis testing.</Paragraph> <Paragraph position="5"> For BHT, the null hypothesis is that the observed value ofp(scfiIverbj) arose by chance, because of noise in the data. We estimate the probability that the value observed could have arisen by chance using p(m+, n,pe), pe is calculated using: * the SCF acquisition system's raw (untiltered) estimate for the unconditional distribution, which is obtained from the Susanne corpus and * the ANLT estimate of the unconditional distribution of a verb not taking scf~, across all SCFs For LLR, both the conditional (pl) and unconditional distributions (p2) are estimated from the BNC data. The unconditional probability distribution uses the occurrence of scfi with any verb other than our target.</Paragraph> <Paragraph position="6"> The binomial tests look at one point in the SCF distribution at a time, for a given verb. The expected value is determined using the unconditional distribution, on the assumption that if the null hypothesis is true then this distribution will correlate with the conditional distribution. However, this is rarely the case. Moreover, because of the zipfian nature of the distributions, the frequency differences at any point can be substantial. In these experiments, we used one-tailed tests because we were looking for cases where there was a positive association between the SCF and verb, however, in a two-tailed test the null hypothesis would rarely be accepted, because of the substantial differences in the conditional and unconditional distributions.</Paragraph> <Paragraph position="7"> A large number of false negatives occurred for high frequency SCFs because the probability we compared them to was too high. This probability was estimated from the combination of many verbs genuinely occurring with the frame in question, rather than from an estimate of background noise from verbs which did not occur with the frame. We did not use an estimate from verbs which do not take the SCF, since this would require a priori knowledge about the phenomena that we were endeavouring to acquire automatically. For LLR the unconditional probability estimate (p2) was high, simply because this SCF was a common one, rather than because the data was particularly noisy. For BHT, R e was likewise too high as the SCF was also common in the Susanne data. The ANLT estimate went someway to compensating for this, thus we obtained fewer false negatives with BHT than LLR.</Paragraph> <Paragraph position="8"> A large number of false positives occurred for low frequency SCFs because the estimate for p(scf) was low. This estimate was more readily exceeded by the conditional estimate.</Paragraph> <Paragraph position="9"> For BHT false positives arose because of the low estimate of p(scf) (from Susanne) and because the estimate of p(-,SCF) from ANLT did not compensate enough for this. For LLR, there was no mean~ to compensate for the fact that p2 was lower than pl.</Paragraph> <Paragraph position="10"> In contrast, MLE did not compare two distributions. Simply rejecting the low frequency data produced better results overall by avoiding the false positives with the low frequency data, and the false negatives with the high frequency data.</Paragraph> </Section> </Section> class="xml-element"></Paper>