File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1011_metho.xml
Size: 22,733 bytes
Last Modified: 2025-10-06 14:14:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1011"> <Title>Learning and Application of Differential Grammars</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Commonly Confused Words </SectionTitle> <Paragraph position="0"> A word processor's grammar checker typically checks grammar in two senses, a pre-/proscriptive sense which should be characterized by what are more properly identified as style rules, and a spell-checking like sense which is characterized as the use of grammar rules to identify typos, substitutions, omissions, duplications which occur at the word or letter level.</Paragraph> <Paragraph position="1"> The errors we focus on are those 'typos' which result in one word being substituted by another word &quot; - both of which are words which would be accepted by the spell-checker. This is closely related to the problems of commonly confused words, homophones and near homophones, but the most important difference between the different types of substitution is in fact whether readers, proofreading their own work, or looking at a detected error, will recognize that it is an error or not.</Paragraph> <Paragraph position="2"> This is not a new problem, either from the point of view of commercial word processors (Johnson, 1992) or Computational Linguistics (Ill, 1983), and two aspects to the problem must be differentiated: we want to find the typos, but we don't want to be overwhelmed with false errors. It is this latter concern which leads to people deciding not to use the available grammar checkers - and those who do tend to turn off the style rules, which they regard as pure noise, making the whole question of the value of grammar checkers rather controversial (Wampler, 1995).</Paragraph> <Paragraph position="3"> Powers 88 Differential Grammars David M. W. Powers (1997) Learning and Application of Differential Grammars. In T.M. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL pp 88-96. (~) 1997 Association for Computational Linguistics By targeting commonly confused words, rather the general problem, we have simply to search for contexts which differentiate the words well. Furthermore, we only wish to distinguish words on grammatical grounds - probabilistic methods which rely on semantic associations are simply too sensitive to changes in genre or topic to use as the primary discriminant. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Differential Grammars </SectionTitle> <Paragraph position="0"> A differential grammar is basically a small set of environments that allows us to differentiates between a pair of confused words in all contexts(Kernick, 1996). However, we want to emphasize that the approach need not be limited to word confusion or grammar checking, and that there is no need to limit it to pairs of targets. Conversely, we also want to strengthen the definition slightly, as we want to syntactic errors. We therefore define a Differential Grammar as: Definition: Differential Grammar A minimal set of syntactically significant environments that differentiate amongst a set of possible targets.</Paragraph> <Paragraph position="1"> However, we do not wish to have to specify the differential grammars or the syntactic environments, but rather wish to learn them. For the commonly confused word problem (in the more general and inclusive sense that encompasses everything from common typos to near homophones), we thus have three aspects to the problem where we wish to do some kind of learning: 1. identifying pairs of commonly confused words; 2. selecting appropriate syntactically significant environments; 3. deciding when an error has occurred.</Paragraph> <Paragraph position="2"> This involves learning in three different senses, and the programs we present here have performed each kind of learning to varying degrees. First, we want to learn to select appropriate data for applying our grammar building methodology to - we want to avoid having to provide positive and negative examples. Second, we want to learn what is syntactically appropriate - we want to avoid having to provide tags, bracketing or parses. Third, we want to learn and dynamically adjust to what the user wants and the target text requires - we want to avoid users having to set parameters.</Paragraph> <Paragraph position="3"> In addition we have some further goals relating to optimization: 4. minimizing the size of the differential grammar; 5. ensuring the significance of the contexts stored; 6. facilitating the users' interactions with the system.</Paragraph> <Paragraph position="4"> 4 Discovery of Confused Word pairs We consider here the first of our six goals. We assume that certain substitutions are more likely than others, and we do not aim to deal with the general case which includes arbitrary unlikely substitutions. We distinguish six different types of reasons for substituted word errors: a. typos: keyboard proximity (knowledge of keyboard used); b. phonos: phonological proximity (phonological features used); c. grammos: grammatical proximity (grammatical features used); d. frequens: frequency disparity (frequency information used); e. foreignish: interlinguistic disparity (not targeted at present); f. idiosyncratic: unknown reason (some system or user confuses the pair).</Paragraph> <Paragraph position="5"> Note that we do not include semantic or style errors - the latter tend to be a result of prescriptive linguists proscribing certain constructs, or maintaining traditional distinctions which are falling out of common use: e.g. the distinction between 'due to' (as meaning 'caused by' but not 'because of') and 'owing to' (as meaning 'because of' but not 'caused by'); rules about prepositions at the end of sentences; split infinitives; deprecated passives; etc.</Paragraph> <Paragraph position="6"> In our first set of experiments (a) has been modeled by simple adjacency on the keyboard, testing in principle all pairs in decreasing order of frequency of the more frequent member. Errors are modeled primarily as systematic displacements of the hand on the keyboard, or substitutions of adjacent characters in the order 1 case. Deletions are handled by treating the empty string &quot; as being adjacent to all characters, and insertions are handled inversely as deletions. Transpositions and interspersions can be ranked on the number of moved characters, and displacements by the number of substituted characters, but in our experiment we limited both to one. The grammatical errors (c) are somewhat trickier to characterize, but a brute-force first approximation would simply list all morphological derivations from each root, ideally working at the morphological level. Powers 89 Diffferential Grammars In relation to (f) we note that for any confusion pair identified by the commercial grammar checkers on our test texts (true error or false error), we have forced generation of Differential Grammars. We are trying to target the exact same class of substitution errors. This has two effects: it increases the possibility of false errors and decreases the possibility of missed errors.</Paragraph> <Paragraph position="7"> The frequens (d) are an interesting class (Kernick, 1996) in that we tend to make disproportionately more errors in which one word of the confused pair is very frequent and the other less so. The very frequent words seem to be more easily activated than their near homonyms, and we have tendency to type the frequens automatically even when it is the less frequent partner that was intended. This is handled in our experiments by our use of the higher of the two frequencies ranking for grammar generation and evaluation. In addition, we could (but don't) allow more latitude in the search for pairings of frequent words in classes (a), (b) and (C/), e.g. by increasing the number of characters or features that might be out of place, substituted or inserted. Instead, we have manually included pairs involving some of the most common words.</Paragraph> <Paragraph position="8"> Two common errors (which I made in typing the last paragraph) are 'out' ~ 'our' (a) and 'our' --+ 'are' (b/d), and the single most common error is 'its/it's' - but often the confusion classes are not simply pairs of words, e.g: 'yaw/your/yore/you're' and 'there/their/they're'.</Paragraph> <Paragraph position="9"> We tested around 100 pairs of words generated automatically on the basis of keyboard proximity (a), as well as those proposed manually under (b) to (f). 76 were used in our system and 55 were rejected for lack of either significance or discrimination.</Paragraph> <Paragraph position="10"> 5 Building an efficient Differential Grammar We now refine the concept of a differential grammar and present the specific form we employ. The first thing we need to do is to define what we mean by 'syntactically significant environments'. Basically we collect Ngram statistics for each of the target words, but we reduce the amount of statistics by a form of syntactic preclassification, then we start with a minimal diameter and progressively increase it until a desired degree of certainty and significance is reached.</Paragraph> <Paragraph position="11"> We define an environment and its diameter straightforwardly (again generalizing (Kernick, 1996)): Definition: Environment A sequence of contiguous units which ineludes the target unit.</Paragraph> <Paragraph position="12"> Definition: Diameter The number of units other than the target which constitute the specified environment of the target unit.</Paragraph> <Paragraph position="13"> We note that we have generalized the definition from the specific focus on words we have here: read 'word' for 'unit' throughout. Also we highlight the fact that there is not a unique environment for each target word, but rather that there are, in general, multiple environments for each possible diameter. Diameter 0 refers to the target word alone. It's frequency relative to the combined frequency of the other members of a confused word set provides a 0th order likelihood of the word being correct. However, because of the phenomenon of frequens, this is a highly unreliable estimate and we therefore eschew it no matter how great the order 0 likelihood of error may be. However, some commercial wordprocessors seem to use an order 0 model and flag all occurrences of certain words (Kernick, 1996).</Paragraph> <Paragraph position="14"> We therefore look for significant environments of larger diameter and estimate the probability of each alternative in terms of their relative frequency in that environment. An environment of diameter N-1 corresponds to an N-gram and environment statistics are therefore derived from N-gram statistics. In practice we limit ourselves to near symmetric environments with the target word in the centre. This gives us a unique environment for each even diameter, but a pair of environments designated +N (more left context) and -N (more right context) for each odd diameter N.</Paragraph> <Paragraph position="15"> Our algorithm stores statistics only for significant environments and increases the diameter progressively, up to a predefined maximum diameter or until a specified certainty threshold is reached. Note that we overload our term 'environment' to mean not only a specific environment of a target word in the current text, but the set of environments of the target word in the training corpus and having the specified diameter, as summarized in the collected statistics. Our usage will be clear from context, and will vary according to whether we are talking about testing/text or training/corpus (respectively).</Paragraph> <Paragraph position="16"> This approach to the storing of Differential Grammars helps to keep the requirements for a given confused word set small, and thus contributes to our goal (4) of minimizing the size of the grammar, without significantly affecting reliability (precision and detection rate).</Paragraph> <Paragraph position="18"> We now discuss the different models of significance we have experimented with, and the issue of combining information from multiple environments. We will illustrate this with examples relating to the common substitution 'from' ~ 'form', and we will assume that a desired minimal level of significance s has been specified (0.95 in the examples). Note that models which are statistically significant are also likely to be skewed so as to provide good discrimination, but the reverse is not iri general true. The more data we have, the more statistically significant a particular likelihood ratio is - so we also want to ensure 'significance' in a more application-oriented sense and only store information which is both significant and discriminative.</Paragraph> <Paragraph position="19"> Since we are training on a large corpus, and allowing for potentially many confusion sets from a huge set of possibilities, we want to eliminate as quickly as possible those pairs which can't possibly be discriminated reliably. For this purpose we introduced a first order test based on the binary Laplacian estimator, and require that the number of instances of a specific environment Na for the confusion set satisfies Na > 1/(1- s)- 2 (where d is its diameter). For our modest 95% significance level, only 18 examples are required. Note that, as previously discussed, we do not ever directly use the 0-diameter environment alone to determine likelihoods, but rather we insist that environments must be more significant than the~ 0-environment in terms of the Laplacian test.</Paragraph> <Paragraph position="20"> Next we want to ensure that the each environment is not only significant in its own right but is significantly different from the next smaller environment. The example of 1105 instances of 'from' and 5 of 'form' being reduced to 47:1 (Kernick, 1996) is significant according to our Laplacian estimate, but the 1 could just have easily been a 0 or a 2 and thus doesn't improve on the smaller environment.</Paragraph> <Paragraph position="21"> The larger environment is not significantly different according to a likelihood derivative, which considers the rate of change of the ratio (Kernick, 1996), and the difference between the two environments is also not significant by Fisher's exact test (Winston, 1993) (or the closely related G 2 (Kilgarriff, 1996)) which assesses the probability that the distribution is due to chance.</Paragraph> <Paragraph position="22"> Note that we make no attempt to correct or smooth data, since 0% and 100% likelihoods are not undesirable, and the 'corrections' are as likely to distort as to improve the data (Church and Gale, 1991), rather we discard data which does not reach significance. Our current model also discounts any cases handled by a larger environment (see section 9).</Paragraph> <Paragraph position="23"> This leaves two further issues to discuss. The first is how we determine likelihoods. Normally, it is very simple, we take the biggest environment that matches and simply use the relative probability of the target word in the sampled environment. In the case of odd diameters, both a left and a right environment may exist and these may agree or disagree.</Paragraph> <Paragraph position="24"> Here we use the rain operator to combine them the minimum fits both the case where they tend to agree, and is conservative, or when they tend to differ, when if one side think it is wrong it is probably not correct. However, this is not always the case.</Paragraph> <Paragraph position="25"> An actual example from our experiment on Usenet text is: &quot;I don't know where you're coming from on this&quot; where the diameter one environments strongly suggest &quot;coming from&quot; and &quot;form on&quot; respectively. With a larger corpus a significant diameter two environment for 'from' would form to handle this idiom.</Paragraph> <Paragraph position="26"> The second question is how to allow for the biasing effect of frequens, where we tend to make disproportionately more errors in the direction of the more frequent word, since this is also the direction our statistics are pointing. This effect however can have its impact reduced by taking larger environments and provides support for the intuitively and empirically determined threshold function of (Kernick, 1996), which converts the user supplied precision (C/) to a discrimination threshold value (/9) which increases as the diameter (d) of the environment reduces, thus giving more credence to the larger diameter environments: 0 = C/ / (1 - C/)/2d.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Focussing on syntactic contexts </SectionTitle> <Paragraph position="0"> Limiting our environments to a small diameter already biases toward correlations which express syntactic rather than semantic relationships (Finch, 1993), which very phenomenon is responsible for the success of Ngram models as an alternative to a grammar (Charniak, 1993). Semantic associations are normally found in larger segments in which the frequency of semantically primed words is higher than expected on the basis of relative frequency in the corpus or the language (this locally increased frequency has recently been dubbed 'clumping' (Church and Gale, 1995) and various measures have been proposed to compensate for it (Kilgarriff, 1996)).</Paragraph> <Paragraph position="1"> Another statistical attribute which is associated with the syntax/semantics distinction is the raw frequency of a word. The most frequent words tend to be syntactic in nature, and will often function words or closed class words. The less frequent words tend to be more content words or open class words, like nouns and verbs (Kilgarriff, 1996). Due to the ex-Powers 91 Differential Grammars treme skewing of the frequency distribution, varying inversely with rank according to Zipf's law, the first 150 words of a corpus with a lexicon of 250,000 words cover over half of the corpus (Kernick, 1996).</Paragraph> <Paragraph position="2"> Collecting statistics based on these 150 'eigenwords', almost all of which are function words, gives us our a syntactic bias and we used the standard Unix list/usr/lib/eign in our initial experiments.</Paragraph> <Paragraph position="3"> Furthermore since the function words tend to act at fairly close quarters, these words are appropriate for smallish environments. However, we don't want our statistics to be restricted to environments composed solely of the 150 eigenwords, and we do not want to have to resort to just bigram statistics collected at various displacements (Finch, 1993). Indeed for function words, for anything but the smallest displacements, our experiments show that such bigram statistics quickly approach corpus/author norms (Kernick, 1996). The obvious step is to introduce an 'open class', denoted by 'O', as a placeholder for the rest of the lexicon. But we can to do better than this by finding other useful classes which are easy to discover using our collected statistics. We therefore move our search for syntactic cues to the morphological level. Again, rather than seeking to develop a formal morphology and associate grammatical information with the morphemes, we simply keep additional statistics for words classified by the most common affixes (we use 12 suffixes for English).</Paragraph> <Paragraph position="4"> Note that our residue class now represents the null morph, '0'.</Paragraph> <Paragraph position="5"> After we include numbers and punctuation we end up with a nominal 172 'eigenunits', but irregular or problematic forms could usefully be added to reduce the noise in these blindly recognized classes, e.g, classes may have multiple syntactic functions ('s' and the null morph '-0' can indicate a noun or a verb) and/or fortuitous mismatches ('-ed' and '-ing' will accept 'red' and 'ring').</Paragraph> <Paragraph position="6"> Fortunately, such mismatches have a good chance of already being one of the 150 eigenwords (better than 50%) and a low probability of occurrence in any particular slot (a fraction of 1%). Those occurrences which are not systematic simply contribute to the overall noise in the method, whilst those which are systematic actually contribute to the success of the technique! In addition, a broad definition of affix as a word-initial or -final sequence can give us affixes which may deviate from the morphological. In our first experiment we use only 12 hand-chosen suffixes, but in subsequent experiments we also split each of these classes according to whether they had a vowel or a consonant prefix, which permits us to ensure that we can deal with the 'a/an' distinction. Later we investigate how affixes may be discovered automatically. Note that (Entwisle and Groves, 1994) use essentially the same crude affix information to achieve complete parsing/validation of English sentences using a (computationally expensive) constraint parser.</Paragraph> <Paragraph position="7"> This now allows us to complete the definition of our syntactic environment: The Ngram information is reduced to eigenunit environments by simply replacing each word other than the target by the first matching eigenunit (eigenwords are checked before affix classes, shortest first).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Interface </SectionTitle> <Paragraph position="0"> In addition to the learning of Differential Grammars using the pure syntactic methods defined above, and testing on 'known good' and 'expected bad' text, we have also paid some attention to the user interface.</Paragraph> <Paragraph position="1"> Two interfaces are available, an emacs interface - it works just like the spell checker - and a frame-based web-interface. We used the likelihoods to colour the words so that the words which are more likely to be wrong are highlighted more strongly. Also the environment used to make the decision is highlighted contrastingly. This is useful both for the user, and for the developer, in evaluating and enhancing the performance of the system.</Paragraph> <Paragraph position="2"> We also allow the user to change the threshold at which notification of potential errors occurs. Normally this is set at a relative probability 0.75 for the target word relative to other members of the confusion class, a precision setting of 75% (our results are presented for this setting). If this threshold is exceeded the most likely replacement is automatically proposed.</Paragraph> </Section> class="xml-element"></Paper>