File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0903_metho.xml

Size: 4,911 bytes

Last Modified: 2025-10-06 14:07:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0903">
  <Title>Comparing corpora and lexical ambiguity</Title>
  <Section position="2" start_page="15" end_page="15" type="metho">
    <SectionTitle>
2 Morphological analysers, lexicons and
</SectionTitle>
    <Paragraph position="0"> guessers Lexical ambiguities have two origins: the lexicon, and the guessing stages :for unknown tokens. However, all the ambiguities considered in this study are strictly lexical, and so translation phenomena (Tesni6r'e 1959, and Paroubek 1997) are not considered \]here.</Paragraph>
    <Section position="1" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
2.1 Medical lexicon
</SectionTitle>
      <Paragraph position="0"> The medical lexicon is tailored to biomedical texts, thus, with about 20000 lexemes, it covers exhaustively ICD-10. The biomedical language is not only a &amp;quot;big' sub language, as its morphology is also more complex. This high level of composition (at least compared to regular French or English languages) concerns about 10% of tokens within clinical patient records; therefore the lexicon contains also about 2000 affixes. For example, the token il~ojdjunostoraie is absent from the lexicon, however, this type of token may be recognized via its compounds (see Levis and al., 1997, for the so-called morphosemantemes): il~o, jdjuno, and stomie.</Paragraph>
    </Section>
    <Section position="2" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
2.2 Morphological analysis and medical
</SectionTitle>
      <Paragraph position="0"> morphology The morphological analysis associates every surface form with a list of morpho-syntactic features. When the surface form is not found in the lexicon, it follows a two.step guessing process: the first level (oraclel) is a more complex morphological analyzer, based on the morphosemantemes, while the second level guesser (orcale2) attempts to provides a set of MS features looking at the longest ending (as described in Chanod and Tapanainen, 1995).</Paragraph>
      <Paragraph position="1"> The importance of these two levels is not clear for POS tagging, but becomes manifest when dealing with sense tagging. Let us consider three examples of tokens absent fxom the lexicon: allomorphiques, allomorphiquement (equivalent to allomorphic and allomorphically in Eng.</Paragraph>
      <Paragraph position="2"> remained ambiguous aRer disambiguafion, the residual ambiguity is therefore about 5.5%. In this sample, and before disambiguation, the number of ambiguous tokens was 150, which means an ambiguity rate of 20%. Thus, even using the same lexicon, the ambiguity rate seem higher for general corpora than for domain-specific ones.</Paragraph>
      <Paragraph position="3"> language) and allocution. In the first case, the prefix allo and the sufFm morphiques are listed in the morphosemantemes database (MDB). In the second case, morphiquement is not listed within the MDB, but ment can be found in it, In these two cases, therefore, oraclel is able to provide both the MS and the WS information associated. The latter example cannot be split into any morphemes, as cution is absent from the MDB. Thus, oraclesl is unable to recogniTe it, and finally oracle2 will be applied and will provide some MS features regarding exclusively the endings. The major role given to oraclel and the semantic featu_es it provides is obvious for IR purposes.</Paragraph>
      <Paragraph position="4"> The final stage transforms some of the lexical features returned by the morphological analysis in a tag-like representation to be processed later by the tagger.</Paragraph>
    </Section>
    <Section position="3" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
2.3 FIPSTAG tagger and lexicon
</SectionTitle>
      <Paragraph position="0"> The FIPSTAG lexicon is a general French lexicon, therefore it contains most well-formed French words. The overall structure of the lexicon is mere or less stable, but the content is regularly updated in order to improve the coverage. Currently, the coverage is about 200000 words with around 30000 lexical items.</Paragraph>
      <Paragraph position="1"> The lexicon is designed for deep parsing, so that, together with classical morpho.syntactic features, we can also find sub categorization of verbs, semantic features, and some very specific grammatical classes.</Paragraph>
      <Paragraph position="2"> As the system is claimed to be general, it is supposed to master efficiently any unknown words: the lexical modules supply, in an equiprobable way, all the possible lexical categories (i.e. nouns, verbs, adjectives, and adverbs), as other categories are considered to be exhaustively listed in the lexicon.</Paragraph>
      <Paragraph position="3"> Consequently, the guesser does not rely on any morphological information, and only syntactic principles are applied to choose the relevant features.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML