XML Viewer - p94-1013

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1013_metho.xml
Size: 24,829 bytes
Last Modified: 2025-10-06 14:13:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1013">
  <Title>DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French</Title>
  <Section position="4" start_page="0" end_page="88" type="metho">
    <SectionTitle>
PROBLEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> The general problem considered here is the resolution of lexical ambiguity, both syntactic and semantic, based on properties of the surrounding context.</Paragraph>
    <Paragraph position="1"> Accent restoration is merely an instance of a closely-related class of problems including word-sense disambiguation, word choice selection in machine translation, homograph and homophone disambiguation, and capitalization restoration. The given algorithm may be used to solve each of these problems, and has been applied without modification to the case of homograph disambiguation in speech synthesis (Sproat, Hirschberg and Yarowsky, 1992).</Paragraph>
    <Paragraph position="2"> It may not be immediately apparent to the reader why this set of problems forms a natural class, similar in origin and solvable by a single type of algorithm. In each case it is necessary to disambiguate two or more semantically distinct word-forms which have been conflated into the same representation in some medium.</Paragraph>
    <Paragraph position="3"> In the prototypical instance of this class, word-sense disambiguation, such distinct semantic concepts as river bank, financial bank and to bank an airplane are conflated in ordinary text. Word associations and syntactic patterns are sufficient to identify and label the correct form. In homophone disambiguation, distinct semantic concepts such as ceiling and sealing have also become represented by the same ambiguous form, but in the medium of speech and with similar disambiguating clues.</Paragraph>
    <Paragraph position="4"> Capitalization restoration is a similar problem in that distinct semantic concepts such as AIDS/aids (disease or helpful tools) and Bush~bush (president or shrub)  are ambiguous, but in the medium of all-capitalized (or casefree) text, which includes titles and the beginning of sentences. Note that what was once just a capitalization ambiguity between Prolog (computer language) and prolog (introduction) has is becoming a &amp;quot;sense&amp;quot; ambiguity since the computer language is now often written in lower case, indicating the fundamental similarity of these problems.</Paragraph>
    <Paragraph position="5"> Accent restoration involves lexical ambiguity, such as between the concepts cSle (coast) and cSld (side), in textual mediums where accents are missing. It is traditional in Spanish and French for diacritics to be omitted from capitalized letters. This is particularly a problem in all-capitalized text such as headlines. Accents in on-line text may also be systematically stripped by many computational processes which are not 8-bit clean (such as some e-mail transmissions), and may be routinely omitted by Spanish and French typists in informal computer correspondence.</Paragraph>
    <Paragraph position="6"> Missing accents may create both semantic and syntactic ambiguities, including tense or mood distinctions which may only be resolved by distant temporal markers or non-syntactic cues. The most common accent ambiguity in Spanish is between the endings -o and -5, such as in the case of completo vs. complet6. This is a present/preterite tense ambiguity for nearly all -at verbs, and very often also a part of speech ambiguity, as the -o form is a frequently a noun as well. The second most common general ambiguity is between the past-subjunctive and future tenses of nearly all-at verbs (eg: terminara vs. lerminard), both of which are 3rd person singular forms. This is a particularly challenging class and is not readily amenable to traditional part-of-speech tagging algorithms such as local trigram-based taggers. Some purely semantic ambiguities include the nouns secretaria (secretary) vs. secretarla (secretariat), sabana (grassland) vs. sdbana (bed sheet), and politica (female politician) vs. polilica (politics). The distribution of ambiguity types in French is similar. The most common case is between -e and -d, which is both a past participle/present tense ambiguity, and often a part-of-speech ambiguity (with nouns and adjectives) as well. Purely semantic ambiguities are more common than in Spanish, and include traitd/traile (treaty/draft), marche/raarchd (step/market), and the cole example mentioned above.</Paragraph>
    <Paragraph position="7"> Accent restoration provides several advantages as a case study for the explication and evaluation of the proposed decision-list algorithm. First, as noted above, it offers a broad spectrum of ambiguity types, both syntactic and semantic, and shows the ability of the algorithm to handle these diverse problems. Second, the correct accent pattern is directly recoverable: unlimited quantities of test material may be constructed by stripping the accents from correctly-accented text and then using the original as a fully objective standard for automatic evaluation. By contrast, in traditional word-sense disambiguation, hand-labeling training and test data is a laborious and subjective task. Third, the task of restoring missing accents and resolving ambiguous forms shows considerable commercial applicability, both as a stand-alone application or part of the front-end to NLP systems. There is also a large potential commercial market in its use in grammar and spelling correctors, and in aids for inserting the proper diacritics automatically when one types 2. Thus while accent restoration may not be be the prototypical member of the class of lexical-ambiguity resolution problems, it is an especially useful one for describing and evaluating a proposed solution to this class of problems.</Paragraph>
  </Section>
  <Section position="6" start_page="89" end_page="92" type="metho">
    <SectionTitle>
ALGORITHM
</SectionTitle>
    <Paragraph position="0"> Step 1: Identify the Ambiguities in Accent Pattern Most words in Spanish and French exhibit only one accent pattern. Basic corpus analysis will indicate which is the most common pattern for each word, and may be used in conjunction with or independent of dictionaries and other lexical resources.</Paragraph>
    <Paragraph position="1"> The initial step is to take a histogram of a corpus with accents and diacritics retained, and compute a table of accent pattern distributions as follows:  For words with multiple accent patterns, steps 2-5 are applied.</Paragraph>
    <Paragraph position="2">  For a particular case of accent ambiguity identified above, collect 4-k words of context around all occurrences in the corpus, label the concordance line with the observed accent pattern, and then strip the accents from the data. This will yield a training set such as the following:  Pattern Context (1) c6td du laisser de cote faute de temps (1) c6td appeler l' autre cote de l' atlantique (1) c6td passe de notre cote de la frontiere (2) cSte vivre sur notre cote ouest toujours verte (2) c6te creer sur la cote du labrador des (2) cSte travaillaient cote a cote , ils avaient  The training corpora used in this experiment were the Spanish AP Newswire (1991-1993, 49 million words), SBaseline accuracy for this data (using the most common pronunciation) is 67%.</Paragraph>
    <Paragraph position="3"> the French Canadian Hansards (1986-1988, 19 million words), and a collection from Le Monde (1 million words).</Paragraph>
    <Paragraph position="4">  The driving force behind this disambiguation Mgorithm is the uneven distribution of collocations 4 with respect to the ambiguous token being classified. Certain collocations will indicate one accent pattern, while different collocations will tend to indicate another. The goal of this stage of the algorithm is to measure a large number of collocational distributions to select those which are most useful in identifying the accent pattern of the ambiguous word.</Paragraph>
    <Paragraph position="5"> The following are the initial types of collocations considered: null  * Word immediately to the right (+1 W) * Word immediately to the left (-1 W) * Word found in =t=k word window 5 (+k W) * Pair of words at offsets -2 and -1 * Pair of words at offsets -1 and +1 * Pair of words at offsets +1 and +2  For the two major accent patterns of the French word cote, below is a small sample of these distributions for several types of collocations:  This core set of evidence presupposes no language-specific knowledge. However, if additional language resources are available, it may be desirable to include a larger feature set. For example, if lemmatization procedures are available, collocational measures for morphological roots will tend to yield more succinct and generalizable evidence than measuring the distributions for each of the inflected forms. If part-of-speech information is available in a lexicon, it is useful to compute the 4The term collocation is used here in its broad sense, meaning words appearing adjacent to or near each other (literally, in the same location), and does not imply only idiomatic or non-compositional associations.</Paragraph>
    <Paragraph position="6"> SThe optimal value of k is sensitive to the type of ambiguity. Semantic or topic-based ambiguities warrant a larger window (k ~ 20-50), while more local syntactic ambiguities warrant a smaller window (k ~ 3 or 4)  distributions for part-of-speech bigrams and trigrams as above. Note that it's not necessary to determine the actual parts-of-speech of words in context; using only the most likely part of speech or a set of all possibilities will produce adequate, if somewhat diluted, distributional evidence. Similarly, it is useful to compute collocational statistics for arbitrary word classes, such as the class WEEKDAY ----( domingo, lunes, martes, ... }. Such classes may cover many types of associations, and need not be mutually exclusive.</Paragraph>
    <Paragraph position="7"> For the French experiments, no additional linguistic knowledge or lexical resources were used. The decision lists were trained solely on raw word associations without additional patterns based on part of speech, morphological analysis or word class. Hence the reported performance is representative of what may be achieved with a rapid, inexpensive implementation based strictly on the distributional properties of raw text.</Paragraph>
    <Paragraph position="8"> For the Spanish experiments, a richer set of evidence was utilized. Use of a morphological analyzer (developed by Tzoukermann and Liberman (1990)) allowed distributional measures to be computed for associations of lemmas (morphological roots), improving generalization to different inflected forms not observed in the training data. Also, a basic lexicon with possible parts of speech (augmented by the morphological analyzer) allowed adjacent part-of-speech sequences to be used as disambiguating evidence. A relatively coarse level of analysis (e.g. NOUN, ADJECTIVE, SUBJECT-PRONOUN, ARTICLE, etc.), augmented with independently modeled features representing gender, person, and number, was found to be most effective. However, when a word was listed with multiple parts-of-speech, no relative frequency distribution was available. Such words were given a part-of-speech tag consisting of the union of the possibilities (eg ADJECTIVE-NOUN), as in Kupiec (1989). Thus sequences of pure part-of-speech tags were highly reliable, while the potential sources of noise were isolated and modeled separately. In addition, several word classes such as WEEKDAY and MONTH were defined, primarily focusing on time words because so many accent ambiguities involve tense distinctions. To build a full part of speech tagger for Spanish would be quite costly (and require special tagged corpora). The current approach uses just the information available in dictionaries, exploiting only that which is useful for the accent restoration task. Were dictionaries not available, a productive approximation could have been made using the associational distributions of suffixes (such as -aba, -aste, -amos) which are often satisfactory indicators of part of speech in morphologically rich languages such as Spanish.</Paragraph>
    <Paragraph position="9"> The use of the word-class and part-of-speech data is illustrated below, with the example of distinguishing  into The next step is to compute the ratio called the loglikelihood: null A .... Pr(Accent_Patternl \[Collocationi) ,~ ostLogt ~ ~ j~ The collocations most strongly indicative of a particular pattern will have the largest log-likelihood. Sorting by this value will list the strongest and most reliable evidence first 6.</Paragraph>
    <Paragraph position="10"> Evidence sorted in the above manner will yield a decision list like the following, highly abbreviated exampleT:  The resulting decision list is used to classify new examples by identifying the highest line in the list that matches the given context and returning the indicated SProblems arise when an observed count is 0. Clearly the probability of seeing c~td in the context of poisson is not 0, even though no such collocation was observed in the training data. Finding a more accurate probability estimate depends on several factors, including the size of the training sample, nature of the collocation (adjacent bigrams or wider context), our prior expectation about the similarity of contexts, and the amount of noise in the training data. Several smoothing methods have been explored here, including those discussed in (Gale et al., 1992). In one technique, all observed distributions with the same 0-denominator raw frequency ratio (such as 2/0) are taken collectively, the average agreement rate of these distributions with additional held-out training data is measured, and from this a more realistic estimate of the likelihood ratio (e.g. 1.8/0.2) is computed. However, in the simplest implementation, satisfactory results may be achieved by adding a small constant a to the numerator and denominator, where c~ is selected empirically to optimize classification performance. For this data, relatively small a (between 0.1 and 0.25) tended to be effective, while noisier training data warrant larger a. rEntries marked with t are pruned in Step 5, below.</Paragraph>
    <Paragraph position="11">  A potentially useful optional procedure is the interpolation of log-likelihood ratios between those computed from the full data set (the globalprobabilities) and those computed from the residual training data left at a given point in the decision list when all higher-ranked patterns failed to match (i.e. the residual probabilities). The residual probabilities are more relevant, but since the size of the residual training data shrinks at each level in the list, they are often much more poorly estimated (and in many cases there may be no relevant data left in the residual on which to compute the distribution of accent patterns for a given collocation). In contrast, the global probabilities are better estimated but less relevant. A reasonable compromise is to interpolate between the two, where the interpolated estimate is/3 x global + 7 x residual. When the residual probabilities are based on a large training set and are well estimated, 7 should dominate, while in cases the relevant residual is small or non-existent, /3 should dominate. If always/3 = 0 and 3' = 1 (exclusive use of the residual), the result is a degenerate (strictly right-branching) decision tree with severe sparse data problems. Alternately, if one assumes that likelihood ratios for a given collocation are functionally equivalent at each line of a decision list, then one could exclusively use the global (always/3 = 1 and 3' = 0). This is clearly the easiest and fastest approach, as probability distributions do not need to be recomputed as the list is constructed.</Paragraph>
    <Paragraph position="12"> Which approach is best? Using only the global proabilities does surprisingly well, and the results cited here are based on this readily replicatable procedure. The reason is grounded in the strong tendency of a word to exhibit only one sense or accent pattern per collocation (discussed in Step 6 and (Yarowsky, 1993)). Most classifications are based on a x vs. 0 distribution, and while the magnitude of the log-likelihood ratios may decrease in the residual, they rarely change sign. There are cases where this does happen and it appears that some interpolation helps, but for this problem the relatively small difference in performance does not seem to justify the greatly increased computational cost.</Paragraph>
    <Paragraph position="13"> Two kinds of optional pruning can also increase the efficiency of the decision lists. The first handles the problem of &amp;quot;redundancy by subsumption,&amp;quot; which is clearly visible in the example decision lists above (in WEEKDAY and domingo). When lemmas and word-classes precede their member words in the list, the latter will be ignored and can be pruned. If a bigram is unambiguous, probability distributions for dependent tri-grams will not even be generated, since they will provide no additional information.</Paragraph>
    <Paragraph position="14"> The second, pruning in a cross-validation phase, compensates for the minimM observed over-modeling of the data. Once a decision list is built it is applied to its own training set plus some held-out cross-validation data (not the test data). Lines in the list which contribute to more incorrect classifications than correct ones are removed. This also indirectly handles problems that may result from the omission of the interpolation step. If space is at a premium, lines which are never used in the cross-validation step may also be pruned. However, useful information is lost here, and words pruned in this way may have contributed to the classification of testing examples. A 3% drop in performance is observed, but an over 90% reduction in space is realized. The optimum pruning strategy is subject to cost-benefit analysis. In the results reported below, all pruning except this final space-saving step was utilized.</Paragraph>
    <Paragraph position="15"> Step 6: Train Decision Lists for General</Paragraph>
    <Section position="1" start_page="91" end_page="92" type="sub_section">
      <SectionTitle>
Classes of Ambiguity
</SectionTitle>
      <Paragraph position="0"> For many similar types of ambiguities, such as the Spanish subjunctive/future distinction between -ara and ard, the decision lists for individual cases will be quite similar and use the same basic evidence for the classification (such as presence of nearby time adverbials). It is useful to build a general decision list for all -ara/ard ambiguities. This also tends to improve performance on words for which there is inadequate training data to build a full individual decision lists. The process for building this general class disambiguator is basically identical to that described in Steps 2-5 above, except that in Step 2, training contexts are pooled for all individual instances of the class (such as all -ara/-ard ambiguities). It is important to give each individual -ara word roughly equal representation in the training set, however, lest the list model the idiosyncrasies of the most frequent class members, rather than identify the shared common features representative of the full class.</Paragraph>
      <Paragraph position="1"> In Spanish, decision lists are trained for the general ambiguity classes including -o/-6, -e/-d, -ara/-ard, and -aran/-ardn. For each ambiguous word belonginging to one of these classes, the accuracy of the word-specific decision list is compared with the class-based list. If the class's list performs adequately it is used. Words with idiosyncrasies that are not modeled well by the class's list retain their own word-specific decision list.</Paragraph>
      <Paragraph position="2"> Step 7: Using the Decision Lists Once these decision lists have been created, they may be used in real time to determine the accent pattern for ambiguous words in new contexts.</Paragraph>
      <Paragraph position="3"> At run time, each word encountered in a text is looked up in a table. If the accent pattern is unambiguous, as determined in Step 1, the correct pattern is printed. Ambiguous words have a table of the possible accent patterns and a pointer to a decision list, either for that specific word or its ambiguity class (as determined in Step 6). This given list is searched for the highest ranking match in the word's context, and a classification number is returned, indicating the most likely of the word's accent patterns given the context s . Slf all entries in a decision list fail to match in a particular new context, a final entry called DEFAULT is used;  From a statistical perspective, the evidence at the top of this list will most reliably disambiguate the target word. Given a word in a new context to be assigned an accent pattern, if we may only base the classification on a single line in the decision list, it should be the highest ranking pattern that is present in the target context. This is uncontroversial, and is solidly based in Bayesian decision theory.</Paragraph>
      <Paragraph position="4"> The question, however, is what to do with the lessreliable evidence that may also be present in the target context. The common tradition is to combine the available evidence in a weighted sum or product. This is done by Bayesian classifiers, neural nets, IR-based classifiers and N-gram part-of-speech taggers. The system reported here is unusual in that it does no such combination. Only the single most reliable piece of evidence matched in the target context is used. For example, in a context of cote containing poisson, ports and allantique, if the adjacent feminine article la cote (the coast) is present, only this best evidence is used and the supporting semantic information ignored. Note that if the masculine article le cote (the side) were present in a similar maritime context, the most reliable evidence (gender agreement) would override the semantic clues which would otherwise dominate if all evidence was combined.</Paragraph>
      <Paragraph position="5"> If no gender agreement constraint were present in that context, the first matching semantic evidence would be used.</Paragraph>
      <Paragraph position="6"> There are several motivations for this approach. The first is that combining all available evidence rarely produces a different classification than just using the single most reliable evidence, and when these differ it is as likely to hurt as to help. In a study comparing results for 20 words in a binary homograph disambiguation task, based strictly on words in local (4-4 word) context, the following differences were observed between an algorithm taking the single best evidence, and an otherwise identical algorithm combining all available match- null Of course that this behavior does not hold for all classification tasks, but does seem to be characteristic of lexically-based word classifications. This may be explained by the empirical observation that in most cases, and with high probability, words exhibit only one sense in a given collocation (Yarowsky, 1993). Thus for this type of ambiguity resolution, there is no apparent detriment, and some apparent performance gain, from usit indicates the most likely accent pattern in cases where nothing matches.</Paragraph>
      <Paragraph position="7"> 9In cases of disagreement, using the single best evidence outperforms the combination of evidence 65% to 35%. This observed difference is 1.9 standard deviations greater than expected by chance and is statistically significant.</Paragraph>
      <Paragraph position="8"> ing only the single most reliable evidence in a classification. There are other advantages as well, including run-time efficiency and ease of parallelization. However, the greatest gain comes from the ability to incorporate multiple, non-independent information types in the decision procedure. As noted above, a given word in context (such as Castillos) may match several times in the decision list, once for its parts of speech, \]emma, capitalized and capitalization-free forms, and possible word-classes as well. By only using one of these matches, the gross exaggeration of probability from combining all of these non-independent log-likelihoods is avoided. While these dependencies may be modeled and corrected for in Bayesian formalisms, it is difficult and costly to do so. Using only one log-likelihood ratio without combination frees the algorithm to include a wide spectrum of highly non-independent information without additional algorithmic complexity or performance loss.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML