File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0304_intro.xml

Size: 3,821 bytes

Last Modified: 2025-10-06 14:01:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0304">
  <Title>Accenting unknown words in a specialized language</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> Previous work has addressed text accentuation, with an emphasis on the cases where all possible words are assumed to be known (listed in a lexicon). The issue in that case is to disambiguate unaccented words when they match several possible accented word forms in the lexicon - the marche/marche examples in the introduction.</Paragraph>
    <Paragraph position="1"> Yarowsky (1999) addresses accent restoration in Spanish and in French, and notes that they can be linked to part-of-speech ambiguities and to semantic ambiguities which context can help to resolve. He proposes three methods to handle these: N-gram tagging, Bayesian classification and decision lists, which obtain the best results. These methods rely either on full words, on word suffixes or on partsof-speech. They are tested on 'the most problematic cases of each ambiguity type', extracted from the Spanish AP Newswire. The agreement with human accented words reaches 78.4-98.4% depending on ambiguity type.</Paragraph>
    <Paragraph position="2"> Spriet and El-Beze (1997) use an N-gram model on parts-of-speech. They evaluate this method on a 19,000 word test corpus consisting of news articles and obtain a 99.31% accuracy. In this corpus, only 2.6% of the words were unknown, among which 89.5% did not need accents. The resulting error rate (0.3%) accounts for nearly one half of the total error rate, but is so small that it is not worth trying to guess accentuation for unknown words.</Paragraph>
    <Paragraph position="3"> The same kind of approach is used in project REACC (Simard, 1998). Here again, unknown words are left untouched, and account for one fourth of the errors. We typed the words in table 1 through the demonstration interface of REACC on-line at www-rali.iro.umontreal.ca/Reacc/: none of these words was accented by the system (7 out of 16 do need accentuation).</Paragraph>
    <Paragraph position="4"> When the unaccented words are in the lexicon, the problem can also be addressed as a spelling correction task, using methods such as string edit distance (Levenshtein, 1966), possibly combined with the previous approach (Ruch et al., 2001).</Paragraph>
    <Paragraph position="5"> However, these methods have limited power when a word is not in the lexicon. At best, they might say something about accented letters in grammatical affixes which mark contextual, syntactic constraints. We found no specific reference about the accentuation of such 'unknown' words: a method that, when a word is not listed in the lexicon, proposes an accented version of that word. Indeed, in the above works, the proportion of unknown words is too small for specific steps to be taken to handle them. The situation is quite different in our case, where about one fourth of the words are 'unknown'. Moreover, contextual clues are scarce in our short, often ungrammatical terms.</Paragraph>
    <Paragraph position="6"> We took obvious measures to reduce the number of unknown words: we filtered out the words that can be found in accented lexicons and corpora. But this technique is limited by the size of the corpus that would be necessary for such 'rare' words to occur, and by the lack of availability of specialized French lexicons for the medical domain.</Paragraph>
    <Paragraph position="7"> We then designed two methods that can learn accenting rules for the remaining unknown words: B4CXB5 adapting a POS-tagging method (Brill, 1995) (section 3.3); B4CXCXB5 adapting a method designed for learning morphological rules (Theron and Cloete, 1997) (section 3.4).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML