File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1035_metho.xml

Size: 14,378 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1035">
  <Title>Spelling and Grammar Correction for Danish in SCARRIE</Title>
  <Section position="4" start_page="0" end_page="255" type="metho">
    <SectionTitle>
2 The prototype
</SectionTitle>
    <Paragraph position="0"> The prototype is a system for high-quality proofreading for Danish which has been developed in the context of a collaborative EUproject 1. Together with the Danish prototype,  the project has also produced similar systems for Swedish and Norwegian, all of them tailored to meet the specific needs of the Scandinavian publishing industry. They all provide writing support in the form of word and grammar checking. null The Danish version of the system 2 constitutes a further development of the CORRie prototype (Vosse, 1992) (Vosse, 1994), adapted to deal with the Danish language, and to the needs of the project's end users. The system processes text in batch mode and produces an annotated output text where errors are flagged and replacements suggested where possible. Text correction is performed in two steps: first the system deals with spelling errors and typos resulting in invalid words, and then with grammar errors.</Paragraph>
    <Paragraph position="1"> Invalid words are identified on the basis of dictionary lookup. The dictionary presently consists of 251,000 domain-relevant word forms extracted from a collection of 68,000 newspaper articles. A separate idiom list allowing for the identification of multi-word expressions is also available. Among the words not found in the dictionary or the idiom list, those occurring most frequently in the text (where frequency is assessed relative to the length of the text) are taken to be new words or proper names 3. The remaining unknown words are passed on to the compound analysis grammar, which is a set of regular expressions covering the most common types of compound nominals in Danish. This is an important feature, as in Danish compounding is very productive, and compounds are written as single words.</Paragraph>
    <Paragraph position="2"> Words still unknown at this point are taken to be spelling errors. The System flags them as Sprogteknologi (Denmark), Department of Linguistics at Uppsala University (Sweden), Institutt for lingvistikk og litteraturvitenskab at the University of Bergen (Norway), and Svenska Dagbladet (Sweden). A number of subcontractors also contributed to the project. Subcontractors in Denmark were: Munksgaard International Publishers, Berlingske Tidende, Det Danske Sprog- og Litteraturselskab, and Institut for Almen og Anvendt  alternative can be found in the dictionary, to avoid mistaking a consistently misspelt word for a new word. such and tries to suggest a replacement. The algorithm used is based on trigram and triphone analysis (van Berkel and Smedt, 1988), and takes into account the orthographic strings corresponding to the invalid word under consideration and its possible replacement, as well as the phonetic representations of the same two words. Phonetic representations are generated by a set of grapheme-to-phoneme rules (Hansen, 1999) the aim of which is to assign phonetically motivated misspellings and their correct counterparts identical or similar phonetic representations. null Then the system tries to identify context-dependent spelling errors. This is done by parsing the text. Parsing results are passed on to a corrector to find replacements for the errors found. The parser is an implementation of the Tomita algorithm with a component for error recognition whose job is to keep track of error weights and feature mismatches as described in (Vosse, 1991). Each input sentence is assigned the analysis with the lowest error weight. If the error is due to a feature mismatch, the offending feature is overridden, and if a dictionary entry satisfying the grammar constraints expressed by the context is found in the dictionary, it is offered as a replacement. If the structure is incomplete, on the other hand, an error message is generated. Finally, if the system identifies an error as a split-up or a run-on, it will suggest either a possible concatenation, or a sequence of valid words into which the misspelt word can be split up.</Paragraph>
  </Section>
  <Section position="5" start_page="255" end_page="256" type="metho">
    <SectionTitle>
3 The errors
</SectionTitle>
    <Paragraph position="0"> To ensure the coverage of relevant error types, a set of parallel unedited and proofread texts provided by the Danish end users has been collected. This text collection consists of newspaper and magazine articles published in 1997 for a total of 270,805 running words. The articles have been collected in their raw version, as well as in the edited version provided by the publisher's own proofreaders. Although not very large in number of words, th@ corpus consists of excerpts from 450 different articles to ensure a good spread of lexical domains and error types. The corpus has been used to construct test suites for progress evaluation, and also to guide grammar development. The aim set for  pus grammar development was then to enable the system to identify and analyse the grammatical constructions in which errors typically occur, whilst to some extent disregarding the remainder of the text.</Paragraph>
    <Paragraph position="1"> The errors occurring in the corlbus have been analysed according to the taxonomy in (Rambell, 1997). Figure 1 shows the distribution of the various error types into the five top-level categories of the taxonomy. As can be seen, grammar errors account for 30~0 of the errors.</Paragraph>
    <Paragraph position="2"> Of these, 70% fall into one of the following categories (Povlsen, 1998):  * Too many finite verbal forms or missing finite verb * Errors in nominal phrases: - agreement errors, - wrong determination, - genitive errors, - errors concerning pronouns; * Split-ups and run-ons.</Paragraph>
    <Paragraph position="3">  Another way of grouping the errors is by the kind of parsing failure they generate: they can then be viewed as either feature mismatches, or as structural errors. Agreement errors are typical examples of feature mismatches. In the following nominal phrase, for example: (1) de *interessant projekter (the interesting projects) _the error can be formalised as a mismatch between the definiteness of the determiner de (the) and the indefiniteness of the adjective interessant (interesting). Adjectives have in fact both an indefinite and a definite form in Danish. The sentence below, on the other hand, is an example of structural error.</Paragraph>
    <Paragraph position="4"> (2) i sin tid *skabet han skulpturer over atomkraften (during his time wardrobe/created he sculptures about nuclear power) Since the finite verb skabte (created) has been misspelt as skabet (the wardrobe), the syntactic structure corresponding to the sentence is missing a verbal head.</Paragraph>
    <Paragraph position="5"> Run-ons and split-ups are structural errors of a particular kind, having to do with leaves in the syntactic tree. In some cases they can only be detected on the basis of the context, because the misspelt word has the wrong category or bears some other grammatical feature that is incorrect in the context. Examples are given in (3) and (4) below, which like the preceding examples are taken from the project's corpus. In both cases, the error would be a valid word in a different context. More specifically, rigtignok (indeed) is an adverb, whilst rigtig nok (actually correct) is a modified adjective; and inden .for (inside) is a preposition, whilst indenfor (indoors) is an adverb. In both examples the correct alternative  is indicated in parentheses.</Paragraph>
    <Paragraph position="6"> (3) ... studerede rain gruppe *rigtig nok (rigtignok) under temaoverskrifter (studied my group indeed on the basis of topic headings) (4) *indenfor (inden for) de gule mute  (inside the yellow walls) Although the system has a facility for identifying and correcting split-ups and run-ons based on a complex interaction between the dictionary, the idiom list, the compound grammar and the syntactic grammar, this facility has not been fully developed yet, and will therefore not be described any further here. More details can be found in (Paggio, 1999).</Paragraph>
  </Section>
  <Section position="6" start_page="256" end_page="258" type="metho">
    <SectionTitle>
4 The grammar
</SectionTitle>
    <Paragraph position="0"> The grammar is an augmented context-free grammar consisting of rewrite rules where symbols are associated with features. Error weights and error messages can also be attached to either rules or single features. The rules are applied by unification, but in cases where one or more features do not unify, the offending features will be overridden.</Paragraph>
    <Paragraph position="1">  In the current version of the grammar~ only the structures relevant to the error types we want the system to deal with - in other words nominal phrases and verbal groups - are accounted for in detail. The analysis produced is thus a kind of shallow syntactic analysis where the various sentence constituents are attached under the topmost S node as fragments.</Paragraph>
    <Paragraph position="2"> For example, adjective phrases can be analysed as fragments, as shown in the following rule:</Paragraph>
    <Paragraph position="4"> To indicate that the fragment analysis is not optimal, it is associated with an error weight, as well as an error message to be used for debugging purposes (the message is not visible to the end user). The weight penalises parse trees built by applying the rule. The rule is used e.g.</Paragraph>
    <Paragraph position="5"> to analyse an AP following a copula verb as in: (5) De projekter er ikke interessante.</Paragraph>
    <Paragraph position="6"> (Those projects are not interesting) The main motivation for implementing a grammar based on the idea of fragments was efficiency. Furthermore, the fragment strategy could be implemented very quickly. However, as will be clarified in Section 5, this strategy is sometimes responsible for bad flags.</Paragraph>
    <Section position="1" start_page="257" end_page="258" type="sub_section">
      <SectionTitle>
4.1 Feature mismatches
</SectionTitle>
      <Paragraph position="0"> As an alternative to the fragment analysis, APs can be attached as daughters in NPs. This is of course necessary for the treatment of agreement in NPs, one of the error types targeted in our application. This is shown in the following rule:</Paragraph>
      <Paragraph position="2"> The rule will parse a correct definite NP such  The feature overriding mechanism makes it possible for the system to suggest interessante as the correct replacement in (7), and projekter in (8). Let us see how this is done in more detail for example (7). The parser tries to apply the NP rule to the input string. The rule states that the adjective phrase must be definite (AP (def _ _)). But the dictionary entry corresponding to interessant bears the feature 'indef'. The parser will override this feature and build an NP according to the constraints expressed by the rule. At this point, a new dictionary lookup is performed, and the definite form of the adjective can be suggested as a replacement.</Paragraph>
      <Paragraph position="3"> Weights are used to control rule interaction as well as to establish priorities among features that may have to be overridden. For example in our NP rule, a weight has been attached to the Gender feature in the N node. The weight expresses the fact that it costs more to override gender on the head noun than on the determiner or adjective. The rationale behind this is the fact that if there is a gender mismatch, the parser should not try to find an alternative * form of the noun (which does not exist), but if necessary override the gender feature either on the adjective or the determiner.</Paragraph>
      <Paragraph position="4"> 4.2. Capturing structural errors in grammar rules To capture structural errors, the formalism allows the grammar writer to write so-called error rules. The syntax of error rules is very similar to that used in 'normal' rules, the only difference being that an error rule must have an er* ror weight and an error message attached to it. The purpose of the weight is to ensure that error rules are applied only if 'normal' rules are not applicable. The error message can serve two purposes. Depending on whether it is stated as an implicit or an explicit message (i.e. whether it is preceded by a question mark or not), it will appear in the log file where it can be used for debugging purposes, or in the output text as a message to the end user.</Paragraph>
      <Paragraph position="5"> The following is an error rule example.</Paragraph>
      <Paragraph position="6">  A weight of 4 is attached to the rule as a whole, but there are also weights attached to the 'finiteness' feature on the daughters: their function is to make it costly for the system to apply the rule to non-finite forms. In other words, the feature specification 'finite' is made difficult to override to ensure that it is indeed a sequence of finite verbal forms the rule applies to and flags. The rule will for example parse the verbal sequence in the following sentence: (9) Jeg vil *bevarer (berate) rain frihed.</Paragraph>
      <Paragraph position="7"> (*I want keep my freedom) As a result of parsing, the system in this case will not attempt to correct the wrong verbal form, but issue the error message &amp;quot;Sequence of two finite verbs&amp;quot;.</Paragraph>
      <Paragraph position="8"> Error rules can thus be used to explicitly describe an error and to issue error messages.</Paragraph>
      <Paragraph position="9"> However, so far we have made very limited use of them, as controlling their interaction with 'normal' rules and with the feature overriding mechanism is not entirely easy. In fact, they are consistently used only to identify incorrect sequences of finite verbal forms or sentences missing a finite verb. To this sparse use of error rules corresponds, on the other hand, an extensive exploitation of the feature overriding mechanism. This strategy allows us to keep the number of rules in the grammar relatively low, but relies on a careful manual adjustment of the weights attached to the various features in the rules.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML