XML Viewer - w00-0722

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0722_metho.xml
Size: 8,370 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0722">
  <Title>Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models</Title>
  <Section position="4" start_page="111" end_page="112" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> In order to assess the system, we selected a corpus (40000 tokens) based equally on three types of documents: reports of surgery, discharge summaries and follow-up notes. This ad hoc corpus is split into 5 equivalent sets. The first one (set A, 8520 words) will serve to write the basic rules of the tagger, while the other sets (set B, 8480 tokens, C, 7447 tokens, D, 7311 tokens, and E, 8242 tokens), will be used for assessment purposes and incremental improvements of the system.</Paragraph>
    <Section position="1" start_page="111" end_page="112" type="sub_section">
      <SectionTitle>
3.1 Lexicon, morphological analysis
</SectionTitle>
      <Paragraph position="0"> and guesser The lexicon, with around 20000 entries, covers exhaustively the whole ICD-10. The morphological analyser is morpheme-based (Baud et al., 1998), it maps each inflected surface form of a word to its canonical lexical form, followed by the relevant morphological features. Words absent from the lexicon follow a two-step guessing process. First, the unknown token is analysed regarding its respective morphemes, if this first stage fails then a last attempt is made to guess the hypothetical MS tags of the token.</Paragraph>
      <Paragraph position="1"> The first stage is based on the assumption that aFor a MULTEXT-like description of the FIPSTAG tagset see Ruch P, 1997: Table de cotrespondance GRACE/FIPSTAG, available at http://latl.unige.ch/doc/etiquettes.ps  unknown words in medical documents axe very likely to belong to the medical jargon, the second one supposed that neologisms follow regular inflectional patterns. If regarding the morphosyntax, both stages are functionally equivalent, as each one provides a set of morpho-syntactic information, they radically behave differently regarding the WS information. For guessing WS categories only the first stage guesser is relevant, as inflectional patterns are not sufficient for guessing the semantic of a given token. Thus, the ending able characterises very probably an adjective, but does not provide any semantic information 4 on it.</Paragraph>
      <Paragraph position="2"> Let us consider two examples of words absent from the lexicon. First, allomorph: the prefix part allo, and the suffix part, morph, are listed in the lexicon, with all the MS and the WS features, therefore it is recognized by the first-stage guesser. Second, allocution, it can not be split into any affix, as cution is not a morpheme, but the ending tion refers to some features (noun, singular) in the second-stage guesser. As the underlying objective of the project is to retrieve documents, the main and most complete information is provided by the first-stage guesser, and the second-stage is only interesting for MS tagging, as in (Chanod and Tapanainen, 1995).</Paragraph>
      <Paragraph position="3"> Finally (tab. 1), some of the morpho-syntactic features provided by the lemmatizer are expressed into the MS tagset 5, to be processed by the tagger (tab. 2).</Paragraph>
      <Paragraph position="5"> Tokens still ambiguous, with GC 161 (1.9) 183 (2.5) 136 (1.9) 101 (1.2) Tokens ambiguous, without GC 9 (0.1) 2 (0) i 9 (0.1) Tokens incorrectly tagged 76 (0.9) i 78 (1.0) 36 (0.5) i 51 (0.6)</Paragraph>
    </Section>
    <Section position="2" start_page="112" end_page="112" type="sub_section">
      <SectionTitle>
3.2 Studying the ambiguities
</SectionTitle>
      <Paragraph position="0"> Our first investigations aimed at assessing the overall ambiguity of medical texts. We found that 1227 tokens (14.4% of the whole sample 6) were ambiguous in set A, and 511 tokens (6.0%) were unknown. We first decided not to care about unknown words, therefore they were not taking into account in the first assessment (cf.</Paragraph>
      <Paragraph position="1"> Performances). However, some frequent words were missing, so that together with the MS guesser, we would improve the guessing score by adding some lexemes. Thus, adding 232 entries in the lexicon and linking it with the Swiss compendium (for drugs and chemicals) provides an unknown word rate of less than 3%. This result includes also the pre-processing of patients and physicians names (Ruch and al., 2000). Concerning the ambiguities, we found that 5 tokens were responsible for half of the ambiguities, while in unrestricted corpora this number seems around 16 (Chanod and Tapanainen, 1995).</Paragraph>
      <Paragraph position="2">  We separated the set A in 8 subsets of about 1000 tokens, in order to write the rules. We wrote around 50 rules (which generated more than 150 operative rules) for the first subset, while for the 8th, only 12 rules were necessary to reach a score close to 100% on set A. These rules are using intermediate symbols (such as the Kleene star) in order to ease and improve the rule-writing process, these symbols are replaced when the operative rules are generated.</Paragraph>
      <Paragraph position="3"> 6For comparison, the average ambiguity rate is about 25-30% in unrestricted corpora.</Paragraph>
      <Paragraph position="4"> Here is an example of a rule: prop\[**\];v\[**\]/nc\[**\] ---+ prop\[**\];v\[**\] This rule says 'if a token is ambiguous between (/) a verb (v), whatever (**) features it has (3rd or lst/2nd person, singular or plural), and a common noun, whatever (**) features it has, and such token is preceded by a personal pronoun (prop), whatever (**) features this pronoun has (3rd or lst/2nd person), then the ambiguous token can be rewritten as a verb, keeping its original features (**)'.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="112" end_page="113" type="metho">
    <SectionTitle>
4 Performances
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="112" end_page="113" type="sub_section">
      <SectionTitle>
4.1 Maximizing the minimal
commitment
</SectionTitle>
      <Paragraph position="0"> Four successive evaluations were conducted (tab. 3); after each session, the necessary rules were added in order to get a tagging score close to 100%. In parallel, words were entered into the lexicon, and productive endings were added into the MS guesser. The second, third, and fourth evaluations were performed with activating the MS guesser. Let us note that translation phenomena (Paroubek and al., 1998), which turn the lexical category of a word into another one, seem rare in medical texts (only 3 cases were not foreseen in the lexicon).</Paragraph>
      <Paragraph position="1"> A success rate of 98% (tab. 3, evaluation 4) is not a bad result for a tagger, but the main result concerns the error rate, with less than 1% of error, the system seems particularly minimally committed 7. Another interesting result concerns the residual ambiguity (tokens still rLet us note that in the assessment 1, the system had  ambiguous, with GC): in the set E, at least half of these ambiguities could be handled by writing more rules. However some of these ambiguities are clearly untractable with such contextual rules, and would demand more lexical information, as in le patient prdsente une douleur abdominale brutale et diffuse (the patient shows an acute and diffuse abdominal pain/the patient shows an acute abdominal pain and distributes*S), where diffuse could be adjective or verb.</Paragraph>
    </Section>
    <Section position="2" start_page="113" end_page="113" type="sub_section">
      <SectionTitle>
4.2 Maximizing the success rate
</SectionTitle>
      <Paragraph position="0"> A last experiment is made: on the set E, which has been disambiguated by the rule-based tagger, we decided to apply two more disambiguations, in order to handle the residual ambiguity. First, we apply the most frequent tag (MFT) model, as baseline, then the HMM. Both the MFT and the HMM transitions are calculated on the set B-t-C/D, tagged manually, but without any manual improvement (bias) of the model.</Paragraph>
      <Paragraph position="1"> Table 4 shows that for the residual ambiguity, i.e. the ambiguity, which remained untractable by the rule-based tagger, the HMM provides an interesting disambiguation accuracy 9.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML