File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1096_metho.xml

Size: 15,352 bytes

Last Modified: 2025-10-06 14:14:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1096">
  <Title>Disambiguation of morphological analysis in Bantu languages</Title>
  <Section position="3" start_page="0" end_page="568" type="metho">
    <SectionTitle>
1 Morphological analysis
</SectionTitle>
    <Paragraph position="0"> The morphological analysis of Swahili is carried out by SWATWOL, which is based on the two-level formalism (Koskenniemi 1983). The application of this formalism to Swahili has been under process since 1987, and it has now, after having been tested with a corpus of one million words, reached a mature phase with a recall of 99.8% in average running text, and precision of close to 100%. The performance of SWATWOL corresponds to what is reported of ENGT-WOL, the morphological parser of English (Voutilainen et al 1992; Tapanainen and J/irvinen 1994), and SWETWOL, the morphological analyzer of Swedish (Karlsson 1992).</Paragraph>
    <Paragraph position="1"> SWATWOL uses a two-level rule system for describing morphophonological variation, as well as a lexicon with 288 sub-lexicons. Unlike in languages with right-branching word formation, where word roots can be grouped together into a root lexicon, here word roots have been divided into several sub-lexicons.</Paragraph>
    <Paragraph position="2"> Because SWATWOL has been described in detail elsewhere (Hurskainen 1992), only a sketchy description of its parts is given here.</Paragraph>
    <Section position="1" start_page="0" end_page="568" type="sub_section">
      <SectionTitle>
1.1 SWATWOL rules
</SectionTitle>
      <Paragraph position="0"> Two-level rules have been written mainly for handling morhophonological processes, which occur principally in morpheme boundaries. Part of such processes take place also in verbal extensions, whereby the quality of the stem vowel(s) defines the surface form of the suffix. The total number of rules is 18, part of them being combined rules.</Paragraph>
      <Paragraph position="1"> An example of a combined rule:</Paragraph>
      <Paragraph position="3"> Chanqe lexical 'U' to surface 'w' iff there is 'k' on the lcft and a surface character belonging to the set 'Vo' on the right; or there is 't' on the left and a lexical diacritic '/1' on the right followed by a lexical 'a '.</Paragraph>
    </Section>
    <Section position="2" start_page="568" end_page="568" type="sub_section">
      <SectionTitle>
1.2 SWATWOL lexicon
</SectionTitle>
      <Paragraph position="0"> SWATWOL lexicon is at tree, where the morphemes of Swahili are located so that each route fY=om the root lexicon leads to a well-formed wordtbrm. null The most complicated part of the lexicon is the description of verb-forms, which requires a total of :\[25 sub-lexicons. For describing verbs, there are a number of consecutive :prefix and suffix 'slots', which may or may not be filled by morphemes.</Paragraph>
      <Paragraph position="1"> The verb root is in the middle, and verbal extensions used mainly for derivation are suffixed to the root.</Paragraph>
      <Paragraph position="2"> A noun is composed of a class prefix and root. Noun roots are located in 22 separate sublexicons, and access to them is permitted from the corresponding class prefix(es). Adjectives are grouped according to whether they t, ake class prefixes or not. Also numerals are grouped according to the same principle. The lexicon has a total of about 27,000 'words'.</Paragraph>
      <Paragraph position="3"> Here is a simplified example of a sab-lexicon:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="568" end_page="568" type="metho">
    <SectionTitle>
LEXICON M/MI
</SectionTitle>
    <Paragraph position="0"> mU M/MIr &amp;quot;mU 3/4-SG N&amp;quot;; mi M/MIr &amp;quot;mU 3/4-PL N&amp;quot;; This is a sub-lexicon with the name 'M/MI' containing prefixes of the noun classes 3 and 4. Each entry may have three parts, but only the middle part is compulsory. In the first entry, 'mU' is the lexical representation of a morpheme, and 'M/MIr' is the name of the sub-lexicon where the processing will continue. The third part within quotes is the output string.</Paragraph>
    <Paragraph position="1"> In constructing the lexicon, underspecification of analysis was avoided. Although it may be used for decreasing the number of ambiguous readings (of. Karlsson 1992), it leaves ambiguity within readings themselves in the form of underspecifica*ion, and it has to be resolved later in any case.</Paragraph>
  </Section>
  <Section position="5" start_page="568" end_page="569" type="metho">
    <SectionTitle>
2 Extent of morphological
</SectionTitle>
    <Paragraph position="0"> ambiguity 1,'or the purposes of writing and testing disarnbiguation rules, a corpus of about 10,000 words of prose text was compiled (Corpus 1). The text  was analyzed with SWATWOL, and the results in regard to ambiguity are given in Table 1. As can be seen in Table 1, about half of word-form tokens in Swahili are at least two-ways ambiguous. About one fifth of tokens are precisely two-ways ambiguous, and the share of three-ways and four-ways ambiguous tokens is almost equal, about 10%. The share of five-ways ambiguous tokens is 5.68%, but the number of still more ambiguous tokens decreases drastically. There are word-forms with more than 20 readings, tile largest number in the corpus being 60 readings. If we compare these numbers with those in Table 2 we note significant differences and similarities. Table 2 was constructed exactly in the same manner as Table 1, only the source text being different. Whereas in Table 1 a corpus of running text (Corpus 1) was used, in Table 2 the source text was a list of unique word-forms (Corpus 2).</Paragraph>
    <Paragraph position="1"> The number of word-forms with more than one reading is almost equal in both corpora, slightly over 50%. The percentages in Table 2 decrease rather systematically the more readings a word-form has. While there were more four-ways ambiguous word-forms (10.97%) than three-ways ones (9.12%) in Table 1, in Table 2 the numbers are as expected. The only unexpected result is the share of six-ways ambiguous words (3.44%), which is higher than the share of the five-ways ambiguous ones (2.94%). In Corpus 2, the high percentage of four-ways ambiguous readings found in Corpus 1 does not exist.</Paragraph>
    <Paragraph position="2"> The ambiguity rate in Swahili is somewhat lower than in Swedish (60%, Berg 1978). It seems to correspond to that of English (Voutilainen ct al 1992:5), although Dett.ose (1988) gives somewhat  Swahili list of unique word-forms (Corpus 2). N(r) = number of readings, N(t) = number of word-form tokens, % = percent of the total, cum-% =  lower figures, 11% for word-form types and 40% for word-form tokens. In Finnish the corresponding figures are still lower, 3.3% for word-form types and 11.2% for word-form tokens (Niemikorpi 1979).</Paragraph>
    <Paragraph position="3"> While the reported ambiguity counted from word-form tokens is generally much higher than that counted from word-form types, in Swahili the difference is small. This is due to the fact that in addition to ambiguity found in several of the most common words, verb-forms are typically ambiguous, as are almost half of the nouns.</Paragraph>
    <Paragraph position="4"> Karlsson (1994:23) suggests an inverse correlation between the number of unique word-forms and rate of ambiguity. Therefore, heavily inflecting languages would tend to produce unambiguous word-forms. Swahili does not seem to fully support this hypothesis, although the numbers in Table 1 and 2 are not directly comparable with results of other studies. In Swahili lexicon, under-specification was avoided which adds to ambiguity.</Paragraph>
  </Section>
  <Section position="6" start_page="569" end_page="570" type="metho">
    <SectionTitle>
3 Disambiguation with Constraint
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="569" end_page="569" type="sub_section">
      <SectionTitle>
Grammar Parser
</SectionTitle>
      <Paragraph position="0"> Morphological disambiguation as well as syntactic mapping is carried out with Constraint Grammar Parser (CGP). Descriptions of its development phases are found in several publications (e.g. Karlsson 1.990; Karlsson 1994a, 1994b; Karlsson et al 1994; Voutilainen et al 1992; Voutilainen and Tapanainen 1993; Tapanainen 1996). It sets off from the idea that rather than trying to write rules by pointing out the conditions necessary for the acceptance of a reading in an ambiguous case, it allows the writing of such rules that discard a certain reading as illegitimate. The rule system is typically a combination of deletion and selection rules.</Paragraph>
      <Paragraph position="1"> The morphological analyzer SWATWOL was so designed that it would be ideal for further processing with CGP. The output of SWATWOL contains such information as part-of-speech features, features for adjectives, verbs, adverbs, nouns, numerals, and pronouns, as well as information on noun class marking (also zero marking) wherever it occurs, etc. In the present application also syntactic tags are included into the morphological lexicon as far as the marking can be done unambiguously.</Paragraph>
      <Paragraph position="2"> The syntactic mapping of context-sensitive word-forms is left to the CGP.</Paragraph>
      <Paragraph position="3"> In order to simplify disambiguatiOn, fixed phrases, idioms, multi-word prepositions and non-ambiguous collocations are joined together already in the preprocessing phase of the text (e.g.</Paragraph>
      <Paragraph position="4"> mbele ya &gt; mbele_ya 'in front of'), and the same constructions are written into the lexicon with corresponding analysis.</Paragraph>
    </Section>
    <Section position="2" start_page="569" end_page="570" type="sub_section">
      <SectionTitle>
3.1 Constraint Grammar rule formalism
</SectionTitle>
      <Paragraph position="0"> The subsequent discussion of the Constraint  In DELIMITERS, those tags are listed which mark the boundary of context conditions. If the rule system tries to remove all readings of a cohort, the target listed in the section PREFERRED-TARGET is the one which survives. SETS is a section where groups of tags are defined. Syntactic parsing is carried out with rules located under the heading MAPPINGS. CONSTRAINTS contains constraint rules with tile following schema: \[WORDFORM\] OPERATION (target) \[(context condition(s) )\] WORDFORM can be any surface word-form, for which a rule will be written. OPERATION may have two forms: REMOVE and SELECT.</Paragraph>
      <Paragraph position="1"> These are self-explanatory. In TARGET is defined the concrete morphological tag (or sequence of tags), to which the operation is applied. A target may be also a set, which is defined in the SETS  section. If the target is left without parentheses it is interpreted as a set. CONTEXT CONDITIONS is an optional part, but in most cases necessary. In it, conditions for the application of tile rule are defined in detail. Context conditions are defined in relation to the target reading, which has the default position 0. Positive integers refer to the number of words to the right, and the negative ones to the left. In context conditions, reference can be made to any of the features or tags found in the unambiguous reading, e.g. (1C ADJ), or in the whole cohort, e.g. (1 ADJ). These references can be made either directly to a tag or indirectly through sets, which are defined in a special section (SETS) of the rule formalism.</Paragraph>
      <Paragraph position="2"> Any context may also be negated by placing the key-word NOT to the beginning of the context clause. It is also possible to refer to more than one context in the same position.</Paragraph>
      <Paragraph position="3"> If there is a need to define further conditions for a reading found by scanning (by using position markers *-1 or *1), the linking mechanism may be used. This can be done by adding the key-word LINK to the context, whereafter the new context follows. For example, the context condition (*-1 N LINK 1 PP~ON LINK 1 ADJ) reads: 'there is a noun (N) on the left followed by pronoun (PI{ON) followed by and adjective (ADJ)'.</Paragraph>
    </Section>
    <Section position="3" start_page="570" end_page="570" type="sub_section">
      <SectionTitle>
3.2 Order of rules
</SectionTitle>
      <Paragraph position="0"> The algorithm allows a sequential rule order. This can be done by grouping the rules into separate sections. The sequential order of rules within a section does not guarantee that the rules are applied in the order where they appear. The rules of the first section are applied first. Any number of consecutive sections can be used. There are presently four sections of constraint rules in the rule file. Certain types of rules should be applied first, without giving a possibility to other, less clearly stated, rules to interfere. Typical of such first-level rules are those where disambiguation is done within a phrase structure. In intermediate sections there are rules which use larger structures for disambiguation. By first disambiguating noun phrases and genitive constructions, the use of otherwise too permissive rules becomes possible, when clear cases are already disambiguated.</Paragraph>
      <Paragraph position="1"> The disambiguation of verbJorms belongs to these middle levels. 2?he risk of wrong interpretations decreases substantially by first disambiguating noun phrases and other smaller units.</Paragraph>
      <Paragraph position="2"> The CGP of Swahili has presently a total of 656 rules in four different sections for disambiguation and 50 rules for syntactic mapping. So far about  The CG rules reduce the number of multiple readings so that optimally only one reading survives. Rule S:816 removes an object reading of the word-form. After that, a selection rule S:1099 is applied.</Paragraph>
      <Paragraph position="4"> Select noun reading of Ncl 1/2-PL if followed immediately by genitive connector belonging to the set NCL-2. This description is equal to the grammatical rule. Also other rules follow the same principle. E.g. the reading 1/2-PL GEN-CON is chosen for the analysis of wa on the basis of the Ncl of the preceding noun. The rule states:</Paragraph>
      <Paragraph position="6"> Select Ncl 1/2-PL of the word 'wa' if in the preceding cohort there is a feature belonging to the set NCL-2.</Paragraph>
      <Paragraph position="7"> Although both washiriki and wa are initially ambiguous, and in rules the context reference does not extend beyond this pair of words, we get the correct result. This is because in both of the cohorts there is only one such reading which refers to the same noun class.</Paragraph>
      <Paragraph position="8"> The word semina is both SG and PL, and the following pronoun zote, which has the PL reading, solves the problem. The word nchi is disambiguated with a rule relying on the Ncl of the following genitive connector (GEN-CON).</Paragraph>
      <Paragraph position="9"> The word katika has four readings. The grammatically correct way of disambiguating it is by referring to the following word.</Paragraph>
      <Paragraph position="10"> &amp;quot;&lt;katika&gt;&amp;quot; SELECT (PREPOS) (I N OR INF OR PRON) Select the reading PREPOS of &amp;quot;katika&amp;quot; if there is a noun or infinitive of a verb or pronoun in the following cohort.</Paragraph>
      <Paragraph position="12"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML