XML Viewer - w97-0902

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0902_metho.xml
Size: 12,255 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0902">
  <Title>Developing a new grammar checker for English as a second language</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
I Introduction
</SectionTitle>
    <Paragraph position="0"> Many word-processing systems today include grammar checkers which can be used to locate various grammatical problems in a text. These tools are clearly aimed at native speakers even if they can be of some help to non-native speakers as well.</Paragraph>
    <Paragraph position="1"> However, non-native speakers make more errors than native speakers and their errors are quite different (Corder 1981). Because of this, they require grammar checkers designed for their specific needs (Granger &amp; Meunier 1994).</Paragraph>
    <Paragraph position="2"> The prototype we have developed is aimed at French native speakers writing in English. From the very start, we worked on the idea that our prototype would be commercialized. In order to find out users' real needs, we first conducted a survey among potential users and experts in the field concerning such issues as coverage and the interface. In addition, we studied which errors needed to be dealt with. To do this, we integrated the information on errors found in published lists of typical learner errors, e.g. Fitikides (1963), with our own corpus of errors obtained from English texts written by French native speakers. Some 27'000 words of text produced over 2'800 errors, which were classified and sorted. We have used this corpus to decide which errors to concentrate on and to evaluate the correction procedures developed. The following two tables give the percentages of errors found in our corpus, broken down by the major categories, followed by the subcategories pertaining to the verb.</Paragraph>
    <Paragraph position="3">  subcategories The prototype includes a set of writing aids, a problem word highlighter and a grammar checker. The writing aids include two monolingual and a bilingual dictionary (simulated), a verb conjugator, a small translating module for certain fixed expressions, and a comprehensive on-line grammar. The problem word highlighter is used to show all the potential lexical errors in a text. While we hoped that the grammar checker would cover as many different types of errors as possible, it quickly became clear that certain errors could not be handled satisfactorily with the checker, e.g. using library (based on French librairie) instead of bookstore in a sentence like I need to go to the library in order to buy a present for my father. Instead of flagging every instance of library in a text, something other grammar checkers often do, we developed the problem word highlighter. It allows the user to view all the problematic words at one glance and it offers help, i.e. explanations and examples for each word, on request.</Paragraph>
    <Paragraph position="4"> Potential errors such as false friends, confusions, foreign loans, etc. are tackled by the highlighter. The heart of the prototype is the grammar checker, which we describe below. Further details about the writing aids can be found in Tschumi et al. (1996).</Paragraph>
    <Paragraph position="5"> 2 The grammar checker Texts written by non-native speakers are more difficult to parse than texts written by native speakers because of the number and types of errors they contain. It is doubtful whether a complete parse of a sentence conraining these kinds of errors can be achieved with today's technology (Thurmair 1990).</Paragraph>
    <Paragraph position="6"> To get around this, we chose an approach using island processing. Such a method, similar to the chunking approach described in Abney (1991), makes it possible to extract from the text most of the information needed to detect errors without wasting time and resources on trying to parse an ungrammatical sentence fully. Chunking can be seen as an intermediary step between tagging and parsing, but it can also be used to get around the problem of a full parse when dealing with ill-formed text.</Paragraph>
    <Paragraph position="7"> Once the grammar checker has been actived, the text which is to be checked goes through a number of stages. It is first segmented into sentences and words. The individual words are then looked up in the dictionary, which is an extract of CELEX (see Burnage 1990). It includes all the words that occur in our corpus of texts, with all their possible readings. For example, the word the is listed in our dictionary as a determiner and as an adverb; table has an entry both as a noun and as a verb. In this sense, our dictionary is not a simplified and scaled-down version of a full dictionary, but simply a shorter version. In the next stage, an algorithm based on neural networks (see Bodmer 1994)disambiguates all words which belong to more than one syntactic category. Furthermore, some multi-word units are identified and labeled with a single syntactic category, e.g. a lot of(PRON) I.</Paragraph>
    <Paragraph position="8"> After this stage, island parsing can begin. A first step consists in identifying simple noun phrases. On the basis of these NPs, preproeessing automata assemble complex noun phrases and assign features to them whenever possible. In a second step, other pre-processing automata identify the verb group and assign tense, voice and aspect features to it. Finally, in a third step, error detection automata are run, some of which involve interacting with the user. Each of these three steps will be described below.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The preprocessing stage
</SectionTitle>
    <Paragraph position="0"> The noun phrase parser identifies simple non-recursive noun phrases such as Det+Adj+N or N+N. The method used for this process involves an algorithm of the type described in Church (1988) which was trained on a manually marked part of our corpus. The module is thus geared to the particular type of second language text the checker needs to deal with. The resulting information is passed on to a preprocessing module consisting of a number of automata groups. The automata used here (as well as in subsequent modules) are finite-state automata similar to those described in Allen (1987) or Silberztein (1993). This type of automata is well-known for its efficiency and versatility.</Paragraph>
    <Paragraph position="1"> In the preprocessing module, a first set of automata scan the text for noun phrases, identify the head of each NP and assign the features for person and number to it. Other sets of automata then mark temporal noun phrases, e.g. this past week or six months ago. In a similar fashion, some prepositional phrases are given specific features if they denote time or place, e.g. during that period, at the office. Still within the same preprocessing module, some recursive NPs are then assembled into more complex NPs, e.g. the arrival of the first group. Finally, human NPs are identified and given a special feature. This is illustrated in the following automaton:</Paragraph>
    <Paragraph position="3"> Every automaton starts by looking for its anchor, an arc marked by &amp;quot;@&amp;quot; (not necessarily the first arc in the automaton). The above automaton first looks for a noun (inside a noun phrase) which occurs in the list of nouns denoting human beings. If it finds one, it puts the value NP_HUMAN in the register called NP TYPE, a register which is associated with the NP as a whole.</Paragraph>
    <Paragraph position="4"> The second group of preprocessing automata deals with the verb group.</Paragraph>
    <Paragraph position="5"> Periphrastic verb forms are analyzed for tense, aspect, and phase, and the result stored in registers. The content of these registers can later be called upon by the detection automata. After these two preprocessing stages, the error detection automata can begin their work.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Error detection
</SectionTitle>
    <Paragraph position="0"> Once important islands in the sentence have been identified, it becomes much easier to write detection automata which look for specific types of errors. Because no overall parse is attempted, error detection has to rely on well-described contexts. Such an approach, which reduces overflagging, also has the advantage of describing errors precisely. Errors can thus not only be identified but can also be explained to the user. Suggestions can also be made as to how errors can be corrected.</Paragraph>
    <Paragraph position="1"> One of the detection automata that make use of the NPs which have been previously identified is the automaton used for certain cases of subject-noun agreement (e.g. *He never eat cucumber sandwiches): l This part of speech corresponds to CELEX's use of</Paragraph>
    <Paragraph position="3"> This automaton (simplified here) first looks for the content of its last arc, a verb which is not in the third person singular form of the present tense. From there it proceeds leftwards. The arc before the anchoring arc optionally matches an adverb.</Paragraph>
    <Paragraph position="4"> The next arc, the second from the top, matches an NP which has been found to have the features third person singular. If the whole of this detection automaton succeeds, a message appears which suggests that the verb be changed to the third person singular form.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 User interaction
</SectionTitle>
    <Paragraph position="0"> It is sometimes possible to detect an error without being completely sure about its identity or how to correct it. To deal with some of these cases, we have developed an interactive mode where the user is asked for additional information on the problem in question. The first step is to detect the structure where there might be an error.</Paragraph>
    <Paragraph position="1"> Here again, the preprocessing automata are put to use. For example, the following automaton looks for the erroneous order of constituents in a sentence containing a transitive verb (as in *We saw on the planet all the little Martians).</Paragraph>
    <Paragraph position="3"> If this automaton finds a sequence that contains a transitive verb followed by a prepositional phrase with either the feature PP TIME or PP PLACE and that ends in a noun phrase which does not have the feature NP_TIME, then the following question is put to the user: &amp;quot;La srquence &amp;quot;all the little Martians&amp;quot; estelle l'objet direct du verbe &amp;quot;saw&amp;quot;?&amp;quot; (&amp;quot;Is the sequence &amp;quot;all the little Martians&amp;quot; the direct object of the verb &amp;quot;saw&amp;quot;?&amp;quot;) If the user presses on the Yes-button, a reordering of the predicate is suggested so as to give V - NP - PP (this can be done automatically). If the No-button is pressed, a thank-you message appears instead and the checker moves on to the next error.</Paragraph>
    <Paragraph position="4"> Interaction is also used in cases where one of two possible corrections needs to be chosen. If a French native speaker writes *the more great, it is not clear whether s/he intended to use a comparative or a superlative. This can be determined by interacting with the user and an appropriate correction can then be proposed if need be.</Paragraph>
    <Paragraph position="5"> While developing the automata which include interaction, we took great care not to ask too much of the user. So far we have used less than ten different question patterns and the vocabulary has been restricted to terminology familiar to potential users. In addition we only ask one question per interaction. null Interacting with the user can thus be a valuable device during error detection if it is used with caution. Too many interactions, especially if they do not lead to actual error correction, can be annoying. They should therefore be restricted to cases where there is a fair chance of detecting an error and should not be used to flag problematic lexical items such as all the tokens of the word library in a text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML