File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0304_metho.xml
Size: 13,236 bytes
Last Modified: 2025-10-06 14:07:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0304"> <Title>Accenting unknown words in a specialized language</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Accenting unknown words </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Filtering out know words </SectionTitle> <Paragraph position="0"> The French MeSH was briefly presented in the introduction; we work with the 2001 version. The part which was accented and converted into mixed case by the CISMeF team is that of November 2001. As more resources are added to CISMeF on a regular basis, a larger number of these accented terms must now be available. The list of word forms that occur in these accented terms serves as our base lexicon (4861 word forms). We removed from this list the 'words' that contain numbers, those that are shorter than 3 characters (abbreviations), and converted them in lower case. The resulting lexicon includes 4054 words (4047 once unaccented). This lexicon deals with single words. It does not try to register complex terms such as myocardial infarction, but instead breaks them into the two words myocardial and infarction.</Paragraph> <Paragraph position="1"> A word is considered unknown when it is not listed in our lexicon. A first concern is to filter out from subsequent processing words that can be found in larger lexicons. The question is then to find suitable sources of additional words.</Paragraph> <Paragraph position="2"> We used various specialized word lists found on the Web (lexicon on cancer, general medical lexicon) and the ABU lexicon (abu.cnam.fr/DICO), which contains some 300,000 entries for 'general' French. Several corpora provided accented sources for extending this lexicon with some medical words (cardiology, haematology, intensive care, drawn from the current state of the CLEF corpus (Habert et al., 2001), and drug monographs). We also used a word list extracted from the French versions of two other medical terminologies: the International Classification of Diseases (ICD-10) and the Microglossary for Pathology of the Systematized Nomenclature of Medicine (SNOMED). This word list contains 8874 different word forms. The total number of word forms of the final word list was 276 445.</Paragraph> <Paragraph position="3"> After application of this list to the MeSH, 7407 words were still not recognized. We converted these words to lower case, removed those that did not include the letter e, were shorter than 3 letters (mainly acronyms) or contained numbers. The remaining 5188 words, among which those listed in table 1, were submitted to the following procedure.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Representing the context of a letter </SectionTitle> <Paragraph position="0"> The underlying hypotheses of this method are that sufficiently regular rules determine, for most words, which letters are accented, and that the context of occurrence of a letter (its neighboring letters) is a good basis for making accentuation decisions. We attempted to compile these rules by observing the occurrences of eeeee in a reference list of words (the training set, for instance, the part of the French MeSH accented by the CISMeF team). In the following, we shall call pivot letter a letter that is part of the confusion set eeeee (set of letters to discriminate). null An issue is then to find a suitable description of the context of a pivot letter in a word, for instance the letter e in excisee. We explored and compared two different representation schemes, which underlie two accentuation methods.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Accentuation as contextual tagging </SectionTitle> <Paragraph position="0"> This first method is based on the use of a part-of-speech tagger: Brill's (1995) tagger. We consider each word as a 'string of letters': each letter makes one word, and the sequence of letters of a word makes a sentence. The 'tag' of a letter is the expected accented form of this letter (or the same letter if it is not accented). For instance, for the word endometre (endometer), to be accented as endometre, the 'tagged sentence' is e/e n/n d/d o/o m/m e/e t/t r/r e/e (in the format of Brill's tagger). The regular procedure of the tagger then learns contextual accentuation rules, the first of which are shown on table 2. (1) e e NEXT2TAG i e.iB5 eAXe (2) e e NEXT1OR2TAG o e.?oB5 eAXe (3) e e NEXT1OR2TAG a e.?aB5 eAXe (4) e e NEXT1OR2WD e e.?eB5 eAX e (5) e e NEXT2TAG h e.hB5eAX e (6) e e NEXTBIGRAM ne ene B5 e AXe (7) e e NEXTBIGRAM me emeB5eAX e (8) e e NEXTBIGRAM tr etr B5 eAXe (9) e e NEXT1OR2OR3TAG x e.?.?xB5 eAXe (10) e e NEXT1OR2TAG y e.?yB5 eAX e (11) e e NEXT2TAG u e.uB5eAX e (12) e e SURROUNDTAG ti teiB5 eAXe (13) e e NEXTBIGRAM se eseB5 eAX e</Paragraph> <Paragraph position="2"> if test true on DC [DD]'. NEXT2TAG = second next tag, NEXT1OR2TAG = one of next 2 tags, NEXTBIGRAM =next2words,NEXT1OR2OR3TAG = one of next 3 tags, SURROUNDTAG = previous and next tags, Given a new 'sentence', Brill's tagger first assigns each 'word' its mots frequent tag: this consists in accenting no e. The contextual rules are then applied and successively correct the current accentuation. For instance, when accenting the word flexion, rule (1) first applies (if e with second next tag = i, change to e) and accentuates the e to yield flexion (as in ...emie). Rule (9) applies next (if e with one of next three tags = x, change to e) to correct this accentuation before an x, which finally results in flexion. These rules correspond to representations of the contexts of occurrence of a letter. This representation is mixed (left and right contexts can be combined, e.g.,inSURROUNDTAG, where both immediate left and right tags are examined), and can extend to a distance of three letters left and right, but in restricted combinations.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Mixed context representation </SectionTitle> <Paragraph position="0"> The 'mixed context' representation used by Theron and Cloete (1997) folds the letters of a word around a pivot letter: it enumerates alternately the next letter on the right then on the left, until it reaches the word boundaries, which are marked with special symbols (here,</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> CM </SectionTitle> <Paragraph position="0"> for start of word, and $ for end of word). Theron & Cloete additionally repeat an out-of-bounds symbol outside the word, whereas we dispense with these marks. For instance, the first e in excisee (excised) is represented as the mixed context in the right column of the first row of table 3. The left column shows the order in which the letters of the word are enumerated. The next two rows explain the mixed context representations for the two other es in the word. This representation caters for contexts of different sizes and facilitates their comparison.</Paragraph> <Paragraph position="1"> Each of these contexts is unaccented (it is meant to be matched with representations of unaccented words) and the original form of the pivot letter is associated to the context as an output (we use the symbol '=' to mark this output). Each context is thus converted into a transducer: the input tape is the mixed context of a pivot letter, and the output tape is the appropriate letter in the confusion set eeeee. The next step is to determine minimal discriminating contexts (figure 1). To obtain them, we join all these transducers (OR operator) by factoring their common prefixes as a trie structure, i.e., a deterministic transducer that exactly represents the training set. We then compute, for each state of this transducer and for each possible output (letter in the confusion set) reachable from this state, the number of paths starting from this state that lead to this output.</Paragraph> <Paragraph position="3"> the frequency of each possible output.</Paragraph> <Paragraph position="4"> We call a state unambiguous if all the paths from this state lead to the same output. In that case, for our needs, these paths may be replaced with a short-cut to an exit to the common output (see figure 1). This amounts to generalizing the set of contexts by replacing them with a set of minimal discriminating contexts.</Paragraph> <Paragraph position="5"> Given a word that needs to be accented, the first step consists in representing the context of each of its pivot letters. For instance, the word biologie:</Paragraph> <Paragraph position="7"> . Each context is matched with the transducer in order to find the longest path from the start state that corresponds to a prefix of the context string (here, $igo). If this path leads to an output state, this output provides the proposed accented form of the pivot letter (here, e). If the match terminates earlier, we have an ambiguity: several possible outputs can be reached (e.g., hemorragie matches $ig).</Paragraph> <Paragraph position="8"> We can take absolute frequencies into account to obtain a measure of the support (confidence level) for a given output C7 from the current state CB:how much evidence there is to support this decision. It is computed as the number of contexts of the training set that go through CB to an output state labelled with C7 (see figure 1). The accenting procedure can choose to make a decision only when the support for that decision is above a given threshold. Table 4 $igo=e 65 -ogie cytologie $ih=e 63 -hie lipoatrophie $uqit=e 77 -tique amelanotique u=e 247 -eu- activateur, calleux x=e 68 -ex- excisee shows some minimal discriminating contexts learnt from the accented part of the French MeSH with a high support threshold. However, in previous experiments (Zweigenbaum and Grabar, 2002), we tested a range of support thresholds and observed that the gain in precision obtained by raising the support threshold was minor, and counterbalanced by a large loss in recall. We therefore do not use this device here and accept any level of support.</Paragraph> <Paragraph position="9"> Instead, we take into account the relative frequencies of occurrence of the paths that lead to the different outputs, as marked in the trie. A probabilistic, majority decision is made on that basis: if one of the competing outputs has a relative frequency above a given threshold, this output is chosen. In the present experiments, we tested two thresholds: 0.9 (90% or more of the examples must support this case; this makes the correct decision for hemorragie)and1 (only non-ambiguous states lead to a decision: no decision for the first e in hemorragie,whichwe leave unaccented).</Paragraph> <Paragraph position="10"> Simpler context representations of the same family can also be used. We examined right contexts (a variable-length string of letters on the right of the pivot letter) and left contexts (idem, on the left).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Evaluating the rules </SectionTitle> <Paragraph position="0"> We trained both methods, Brill and contexts (mixed, left and right), on three training sets: the 4054 words of the accented part of the MeSH, the 54,291 lemmas of the ABU lexicon and the 8874 words in the ICD-SNOMED word list. To check the validity of the rules, we applied them to the accented part of the MeSH. The context method knows when it can make a decision, so that we can separate the words that are fully processed (CU,alles have lead to decisions) from those that are partially (D4) processed or not (D2) processed at all. Let CU</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> CR </SectionTitle> <Paragraph position="0"> the number of correct accentuations in CU. If we decide to only propose an accented form for the words that get fully accented, we can compute recall CA measures can be computed for D4 and D2,aswellas for the total set of words.</Paragraph> <Paragraph position="1"> We then applied the accentuation rules to the 5188 accentable 'unknown' words of the MeSH. No gold standard is available for these words: human validation was necessary. We drew from that set a random sample containing 260 words (5% of the total) which were reviewed by the CISMeF team. Because of sampling, precision measures must include a confidence interval.</Paragraph> <Paragraph position="2"> We also tested whether the results of several methods can be combined to increase precision. We simply applied a consensus rule (intersection): a word is accepted only if all the methods considered agree on its accentuation.</Paragraph> <Paragraph position="3"> The programs were developed in the Perl5 language. They include a trie manipulation package which we wrote by extending the Tree::Trie package, online on the Comprehensive Perl Archive Network (www.cpan.org).</Paragraph> </Section> class="xml-element"></Paper>