File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2019_metho.xml

Size: 11,855 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2019">
  <Title>An Unsupervised Method for Detecting Grammatical Errors</Title>
  <Section position="4" start_page="143" end_page="145" type="metho">
    <SectionTitle>
3 Experimental Design and Results
</SectionTitle>
    <Paragraph position="0"> ALEK was tested on 20 words. These words were randomly selected from those which met two criteria: (1) They appear in a university word list ('Nation, 1990) as words that a student in a US university will be expected to encounter and (2) there were at least 1,000 sentences containing the word in the TOEFL essay pool.</Paragraph>
    <Paragraph position="1"> To build the usage model for each target word, 10,000 sentences containing it were extracted from the North American News Corpus.</Paragraph>
    <Paragraph position="2"> Preprocessing included detecting sentence boundaries and part-of-speech tagging. As in the development system, the model of general English was based on bigram and trigram frequencies of function words and part-of-speech tags from 30-million words of the San Jose Mercury News.</Paragraph>
    <Paragraph position="3"> For each test word, all of the test sentences were marked by ALEK as either containing an error or not containing an error. The size of the test set for each word ranged from 1,400 to 20,000 with a mean of 8,000 sentences.</Paragraph>
    <Section position="1" start_page="143" end_page="144" type="sub_section">
      <SectionTitle>
3.1 Results
</SectionTitle>
      <Paragraph position="0"> To evaluate the system, for each test word we randomly extracted 125 sentences that ALEK classified as containing no error (C-set) and 125 sentences which it labeled as containing an error (E-set). These 250 sentences were presented to a linguist in a random order for blind evaluation.</Paragraph>
      <Paragraph position="1"> The linguist, who had no part in ALEK's development, marked each usage of the target word as incorrect or correct and in the case of incorrect usage indicated how far from the target one would have to look in order to recognise that there was an error. For example, in the case of &amp;quot;an period&amp;quot; the error occurs at a distance of one word from period. When the error is an omission, as in &amp;quot;lived in Victorian period&amp;quot;, the distance is where the missing word should have appeared. In this case, the missing determiner is 2 positions away from the target. When more than one error occurred, the distance of the one closest to the target was marked.</Paragraph>
      <Paragraph position="2"> Table 3 lists the precision and recall for the 20 test words. The column labelled &amp;quot;Recall&amp;quot; is the proportion of human-judged errors in the 250sentence sample that were detected by ALEK.</Paragraph>
      <Paragraph position="3"> &amp;quot;Total Recall&amp;quot; is an estimate that extrapolates from the human judgements of the sample to the entire test set. We illustrate this with the results for pollution. The human judge marked as incorrect usage 91.2% of the sample from ALEK's E-set and 18.4% of the sample from its C-set. To estimate overall incorrect usage, we computed a weighted mean of these two rates, where the weights reflected the proportion of sentences that were in the E-set and C-set. The E-set contained 8.3% of the pollution sentences and the C-set had the remaining 91.7%. With the human judgements as the gold standard, the estimated overall rate of incorrect usage is (.083</Paragraph>
      <Paragraph position="5"> recall is the proportion of sentences in the E-set times its precision, divided by the overall estimated error rate (.083 x .912) / .245 = .310.</Paragraph>
      <Paragraph position="6">  The precision results vary from word to word. Conclusion and pollution have precision in the low to middle 90's while individual's precision is 57%. Overall, ALEK's predictions are about 78% accurate. The recall is limited in part by the fact that the system only looks at syntactic information, while many of the errors are semantic.</Paragraph>
    </Section>
    <Section position="2" start_page="144" end_page="144" type="sub_section">
      <SectionTitle>
3.2 Analysis of Hits and Misses
</SectionTitle>
      <Paragraph position="0"> Nicholls (1999) identifies four error types: an unnecessary word (*affect to their emotions), a missing word (*opportunity of job.), a word or phrase that needs replacing (*every jobs), a word used in the wrong form (*pollutions). ALEK recognizes all of these types of errors. For closed class words, ALEK identified whether a word was missing, the wrong word was used (choice), and when an extra word was used. Open class words have a fourth error category, form, including inappropriate compounding and verb agreement. During the development stage, we found it useful to add additional error categories. Since TEOFL graders are not supposed to take punctuation into account, punctuation errors were only marked when they caused the judge to &amp;quot;garden path&amp;quot; or initially misinterpret the sentence. Spelling was marked either when a function word was misspelled, causing part-of-speech tagging errors, or when the writer's intent was unclear.</Paragraph>
      <Paragraph position="1"> The distributions of categories for hits and misses, shown in Table 4, are not strikingly different. However, the hits are primarily syntactic in nature while the misses are both semantic (as in open-class:choice) and syntactic (as in closed-class:missing).</Paragraph>
      <Paragraph position="2"> ALEK is sensitive to open-class word confusions (affect vs effect) where the part of speech differs or where the target word is confused with another word (*ln this aspect,...</Paragraph>
      <Paragraph position="3"> instead ofln this respect, ...). In both cases, the system recognizes that the target is in the wrong syntactic environment. Misses can also be syntactic - when the target word is confused with another word but the syntactic environment fails to trigger an error. In addition, ALEK does not recognize semantic errors when the error involves the misuse of an open-class word in  of 200 hits and 200 misses combination with the target (for example, make in &amp;quot;*they make benefits&amp;quot;).</Paragraph>
      <Paragraph position="4"> Closed class words typically are either selected by or agree with a head word. So why are there so many misses, especially with prepositions? The problem is caused in part by polysemy when one sense of the word selects a preposition that another sense does not. When concentrate is used spatially, it selects the preposition in, as &amp;quot;the stores were concentrated in the downtown area&amp;quot;. When it denotes mental activity, it selects the preposition on, as in &amp;quot;Susan concentrated on her studies&amp;quot;. Since ALEK trains on all senses of concentrate, it does not detect the error in &amp;quot;*Susan concentrated in her studies&amp;quot;. Another cause is that adjuncts, especially temporal and locative adverbials, distribute freely in the word-specific corpora, as in &amp;quot;Susan concentrated in her room.&amp;quot; This second problem is more tractable than the polysemy problem - and would involve training the system to recognize certain types of adjuncts.</Paragraph>
    </Section>
    <Section position="3" start_page="144" end_page="145" type="sub_section">
      <SectionTitle>
3.3 Analysis of False Positives
</SectionTitle>
      <Paragraph position="0"> False positives, when ALEK &amp;quot;identifies&amp;quot; an error where none exists, fall into six major categories. The percentage of each false positive type in a random sample of 200 false positives is shown in Table 5.</Paragraph>
      <Paragraph position="1"> Domain mismatch: Mismatch of the newspaper-domain word-specific corpora and essay-domain test corpus. One notable difference is that some TOEFL essay prompts call for the writer's opinion. Consequently,  TOEFL essays often contain first person references, whereas newspaper articles are written in the third person. We need to supplement the word-specific corpora with material that more closely resembles the test corpus.</Paragraph>
      <Paragraph position="2"> Tagger: Incorrect analysis by the part-of-speech tagger. When the part-of-speech tag is wrong, ALEK often recognizes the resulting n-gram as anomalous. Many of these errors are caused by training on the Brown corpus instead of a corpus of essays.</Paragraph>
      <Paragraph position="3"> Syntactic analysis: Errors resulting from using part-of-speech tags instead of supertags or a full parse, which would give syntactic relations between constituents. For example, ALEK false alarms on arguments of ditransitive verbs such as offer and flags as an error &amp;quot;you benefits&amp;quot; in &amp;quot;offers you benefits&amp;quot;.</Paragraph>
      <Paragraph position="4"> Free distribution: Elements that distribute freely, such as adverbs and conjunctions, as well as temporal and locative adverbial phrases, tend to be identified as errors when they occur in some positions.</Paragraph>
      <Paragraph position="5"> Punctuation: Most notably omission of periods and commas. Since these errors are not indicative of one's ability to use the target word, they were not considered as errors unless they caused the judge to misanalyze the sentence.</Paragraph>
      <Paragraph position="6"> Infrequent tags. An undesirable result of our &amp;quot;enriched&amp;quot; tag set is that some tags, e.g., the post-determiner last, occur too infrequently in the corpora to provide reliable statistics.</Paragraph>
      <Paragraph position="7"> Solutions to some of these problems will clearly be more tractable than to others.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="145" end_page="145" type="metho">
    <SectionTitle>
4 Comparison of Results
</SectionTitle>
    <Paragraph position="0"> Comparison of these results to those of other systems is difficult because there is no generally accepted test set or performance baseline. Given this limitation, we compared ALEK's performance to a widely used grammar checker, the one incorporated in Microsoft's Word97. We created files of sentences used for the three development words concentrate, interest, and knowledge, and manually corrected any errors outside the local context around the target before checking them with Word97. The performance for concentrate showed overall precision of 0.89 and recall of 0.07. For interest, precision was 0.85 with recall of 0.11. In sentences containing knowledge, precision was 0.99 and recall was 0.30. Word97 correctly detected the ungrammaticality ofknowledges as well as a knowledge, while it avoided flagging a knowledge of.</Paragraph>
    <Paragraph position="1"> In summary, Word97's precision in error detection is impressive, but the lower recall values indicate that it is responding to fewer error types than does ALEK. In particular, Word97 is not sensitive to inappropriate selection of prepositions for these three words (e.g., *have knowledge on history, *to concentrate at science). Of course, Word97 detects many kinds of errors that ALEK does not.</Paragraph>
    <Paragraph position="2"> Research has been reported on grammar checkers specifically designed for an ESL population. These have been developed by hand, based on small training and test sets. Schneider and McCoy (1998) developed a system tailored to the error productions of American Sign Language signers. This system was tested on 79 sentences containing determiner and agreement errors, and 101 grammatical sentences. We calculate that their precision was 78% with 54% recall. Park, Palmer and Washburn (1997) adapted a categorial grammar to recognize &amp;quot;classes of errors \[that\] dominate&amp;quot; in the nine essays they inspected. This system was tested on eight essays, but precision and recall figures are not reported.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML