File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1031_evalu.xml
Size: 10,562 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1031"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A Feedback-Augmented Method for Detecting Errors in the Writing of Learners of English</Title> <Section position="7" start_page="245" end_page="247" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="245" end_page="245" type="sub_section"> <SectionTitle> 4.1 Experimental Conditions </SectionTitle> <Paragraph position="0"> A set of essays9 written by Japanese learners of English was used as the target essays in the experiments. It consisted of 47 essays (3180 words) on the topic traveling. A native speaker of English who was a professional rewriter of English recognized 105 target errors in it.</Paragraph> <Paragraph position="1"> The written part of the British National Corpus (BNC) (Burnard, 1995) was used to learn decision lists. Sentences the OAK system10, which was used to extract NPs from the corpus, failed to analyze were excluded. After these operations, the size of the corpus approximately amounted to 80 million words. Hereafter, the corpus will be referred to as the BNC.</Paragraph> <Paragraph position="2"> As another corpus, the English concept explication in the EDR English-Japanese Bilingual dictionary and the EDR corpus (1993) were used; it will be referred to as the EDR corpus, hereafter.</Paragraph> <Paragraph position="3"> Its size amounted to about 3 million words.</Paragraph> <Paragraph position="4"> Performance of the proposed method was evaluated by recall and precision. Recall is de ned by No. of target errors detected correctly No. of target errors in the target essays a0 (10) Precision is de ned by No. of target errors detected correctly No. of detected errors a0 (11)</Paragraph> </Section> <Section position="2" start_page="245" end_page="246" type="sub_section"> <SectionTitle> 4.2 Experimental Procedures </SectionTitle> <Paragraph position="0"> First, decision lists for each target noun in the target essays were learned from the BNC11. To extract noun phrases and their head nouns, the OAK system was used. An optimal value for a24 (window size of context) was estimated as follows. For 25 nouns shown in (Huddleston and Pullum, 2002) as examples of nouns used as both mass and count nouns, accuracy on the BNC was calculated using ten-fold cross validation. As a result of setting small (a24a77a51a147a53 ), medium (a24a77a51a78a82a84a119 ), and large (a24a77a51a149a148a40a119 ) window sizes, it turned out that a24a150a51a151a53 maximized the average accuracy. Following this result, a24a65a51a89a53 was selected in the experiments.</Paragraph> <Paragraph position="1"> Second, the target nouns were distinguished whether they were mass or count by the learned eral corpora (and also in the feedback corpus in case of the feedback-augmented method), the target noun is ignored in the error detection procedure.</Paragraph> <Paragraph position="2"> decision lists, and then the target errors were detected by applying the detection rules to the mass count distinction. As a preprocessing, spelling errors were corrected using a spell checker. The results of the detection were compared to those done by the native-speaker of English. From the comparison, recall and precision were calculated.</Paragraph> <Paragraph position="3"> Then, the feedback-augmented method was evaluated on the same target essays. Each target essay in turn was left out, and all the remaining target essays were used as a feedback corpus. The target errors in the left-out essay were detected using the feedback-augmented method. The results of all 47 detections were integrated into one to calculate overall performance. This way of feedback can be regarded as that one uses revised essays previously written in a class to detect errors in essays on the same topic written in other classes.</Paragraph> <Paragraph position="4"> Finally, the above two methods were compared with their seven variants shown in Table 5. DL in Table 5 refers to the nine decision list based methods (the above two methods and their seven variants). The words in brackets denote the corpora used to learn decision lists; the symbol +FB means that the feedback corpus was simply added to the general corpus. The subscripts a36a17a152a42a153 and a36a17a152a44a154 indicate that the feedback was done by using Equation (8) and Equation (9), respectively.</Paragraph> <Paragraph position="5"> In addition to the seven variants, two kinds of earlier method were used for comparison. One was one (Kawai et al., 1984) of the rule-based methods. It judges singular head nouns with no determiner to be erroneous since missing articles are most common in the writing of Japanese learners of English. In the experiments, this was implemented by treating all nouns as count nouns and applying the same detection rules as in the proposed method to the countability.</Paragraph> <Paragraph position="6"> The other was a web-based method (Lapata and Keller, 2005)12 for generating articles. It retrieves web counts for queries consisting of two words preceding the NP that the target noun head, one of the articles (a91 a/an, the, a93a17a92 ), and the core NP to generate articles. All queries are performed as exact matches using quotation marks and submitted to the Google search engine in lower case. For example, in the case of *She is good student. , it retrieves web counts for she is a good student , 12There are other statistical methods that can be used for comparison including Lee (2004) and Minnen (2000). Lapata and Keller (2005) report that the web-based method is the best performing article generation method.</Paragraph> <Paragraph position="7"> she is the good student , and she is good student . Then, it generates the article that maximizes the web counts. We extended it to make it capable of detecting our target errors. First, the singular/plural distinction was taken into account in the queries (e.g., she is a good students , she is the good students , and she is good students in addition to the above three queries). The one(s) that maximized the web counts was judged to be correct; the rest were judged to be erroneous. Second, if determiners other than the articles modify head nouns, only the distinction between singular and plural was taken into account (e.g., he has some book vs he has some books ). In the case of much/many , the target noun in singular form modi ed by much and that in plural form modi ed by many were compared (e.g., he has much furniture vs he has many furnitures). Finally, some rules were used to detect literal errors. For example, plural head nouns modi ed by this were judged to be erroneous.</Paragraph> </Section> <Section position="3" start_page="246" end_page="247" type="sub_section"> <SectionTitle> 4.3 Experimental Results and Discussion </SectionTitle> <Paragraph position="0"> Table 5 shows the experimental results. Rule-based and Web-based in Table 5 refer to the rule-based method and the web-based method, respectively. The other symbols are as already explained in Section 4.2.</Paragraph> <Paragraph position="1"> As we can see from Table 5, all the decision list based methods outperform the earlier methods. The rule-based method treated all nouns as count nouns, and thus it did not work well at all on mass nouns. This caused a lot of false-positives and false-negatives. The web-based method suffered a lot from other errors than the target errors since it implicitly assumed that there were no errors except the target errors. Contrary to this assumption, not only did the target essays contain the target errors but also other errors since they were written by Japanese learners of English. This indicate that the queries often contained the other errors when web counts were retrieved. These errors made the web counts useless, and thus it did not perform well. By contrast, the decision list based methods did because they distinguished mass and count nouns by one of the words around the target noun that was most likely to be effective according to the log-likelihood ratio13; the best performing decision list based method (DLa97a84a98 a154 (EDR)) is signi cantly superior to the best performing14 nondecision list based method (Web-based) in both recall and precision at the 99% con dence level.</Paragraph> <Paragraph position="2"> Table 5 also shows that the feedback-augmented methods bene t from feedback. Only an exception is DLa97a84a98 a153 (BNC) . The reason is that the size of BNC is far larger than that of the feedback corpus and thus it did not affect the performance.</Paragraph> <Paragraph position="3"> This also explains that simply adding the feed-back corpus to the general corpus achieved little or no improvement as DL (EDR+FB) and DL (BNC+FB) show. Unlike these, both DLa97a84a98 a154 (BNC) and DLa97a84a98 a154 (EDR) bene t from feed-back since the effect of the general corpus is limited to some extent by the log function in Equation (9). Because of this, both bene t from feed-back despite the differences in size between the feedback corpus and the general corpus.</Paragraph> <Paragraph position="4"> Although the experimental results have shown that the feedback-augmented method is effective to detecting the target errors in the writing of Japanese learners of English, even the best performing method (DLa97a84a98 a154 (EDR)) made 30 false-negatives and 29 false-positives. About 70% of the false-negatives were errors that required other sources of information than the mass count distinction to be detected. For example, extra definite articles (e.g., *the traveling) cannot be detected even if the correct mass count distinction is given. Thus, only a little improvement is expected in recall however much feedback corpus data become available. On the other hand, most of the 13Indeed, words around the target noun were effective. The default rules were used about 60% and 30% of the time in DL (EDR) and DL (BNC) , respectively; when only the default rules were used, DL (EDR) ( DL (BNC) ) achieved 0.66 (0.56) in recall and 0.58 (0.53) in precision.</Paragraph> <Paragraph position="5"> 14 Best performing here means best performing in terms of a155 -measure.</Paragraph> <Paragraph position="6"> false-positives were due to the decision lists themselves. Considering this, it is highly possible that precision will improve as the size of the feedback corpus increases.</Paragraph> </Section> </Section> class="xml-element"></Paper>