File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1037_metho.xml
Size: 13,703 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1037"> <Title>Optimizing disambiguation in Swahili</Title> <Section position="4" start_page="7" end_page="9" type="metho"> <SectionTitle> 2 Maximal morphological and semantic </SectionTitle> <Paragraph position="0"> description as precondition The basic strategy in processing is that the morphological description is as full and detailed as possible. Each string in text is interpreted and all possible interpretations of each string are made explicit. The maximal recall and precision are achieved by updating the dictionary from time to time with the help of the changing target language . As a result of analysis there is a text where every string has at least one interpretation and no legitimate interpretation is excluded. By target language I mean the kind of text, for which the application is intended. It is hardly possible to maintain a dictionary that is optimal for handling all types of domain-specific texts. Although the large size of the dictionary would not be a problem, it would be difficult to handle e.g. such words that in one type of text are individual lexemes but in another domain are part of multi-word concepts that should be treated as one unit. In addition to new words, misspellings also cause problems. Some commonly occurring misspellings and non-standard spellings can be encoded into the dictionary and thus give the word a precise interpretation.</Paragraph> <Paragraph position="2"> Without disambiguation, the following interpretations are possible: (a) A fat person, who lives in lakes, has eaten tomatoes. (b) A fat person, who lives in lakes, has eaten grandmothers. (c) A fat person, who lives in breasts, has eaten tomatoes. (d) A fat person, who lives in breasts, has eaten grandmothers. (e) A fat person, who lives in milk, has eaten tomatoes. (f) A fat person, who lives in milk, has eaten grandmothers. (g) A hippo, which lives in lakes, has eaten tomatoes. (h) A hippo, which lives in lakes, has eaten grandmothers. (i) A hippo, which lives in breasts, has eaten tomatoes. (j) A hippo, which lives in breasts, has eaten grandmothers. (k) A hippo, which lives in milk, has eaten tomatoes. (l) A hippo, which lives in milk, has eaten grandmothers.</Paragraph> <Paragraph position="3"> The situation would be even worse if &quot;aishiye&quot; with relative marker (GEN-REL 1/2-SG) were missing. It requires that the preceding referent be animate and thus excludes inanimate alternatives. The subject prefix in the main verb &quot;amekula&quot; also refers to an animate subject. But because it can also stand without an overt subject, this clue is not reliable.</Paragraph> <Paragraph position="4"> When we look for the possible subject in the sentence, we seem to have three candidates. &quot;Kiboko&quot; certainly is one of them, because it is a noun and some of its readings agree with the In this case agreement means something other than morphological agreement. The noun belongs to subject prefix of the main verb. In regard to its position, &quot;ziwani&quot; would also suit, but it is ruled out because it has a locative suffix. Finally, no overt subject would be necessary, whereby the phrase preceding the main verb would be an object dislocated to the left and the sentence would mean, &quot;The grandmother has eaten the hippo/fat person who lives in the lakes/breasts/milk&quot;.</Paragraph> </Section> <Section position="5" start_page="9" end_page="14" type="metho"> <SectionTitle> 3 Disambiguation with linguistic rules </SectionTitle> <Paragraph position="0"> From the analysed sentence we can see that part of the ambiguity is easy to resolve with rules. For example, &quot;kiboko&quot; cannot be an adverbial form (ADV:ki) of &quot;boko&quot; (= in the manner of a gourd), because it is the referent of the following relative verb &quot;aishiye&quot;, which for its part requires that the referent has to be animate. Therefore, the interpretation &quot;whip&quot; and more rare meanings, &quot;beautiful thing&quot; and &quot;ornamental stitch&quot;, are also ruled out. So we are left with two animate meanings, &quot;fat person&quot; and &quot;hippo&quot;, for which there are no reliable tags available for writing disambiguation rules.</Paragraph> <Paragraph position="1"> One of the three interpretations of &quot;kwenye&quot; can be removed (15-SG), because no infinitive precedes it. The word &quot;maziwa&quot; with three interpretations has no grammatical criteria for disambiguation.</Paragraph> <Paragraph position="2"> The interpretations with object marker (OBJ) of &quot;amekula&quot; (has eaten you) can be removed on the basis of the following noun (without locative), which is reliably the real object.</Paragraph> <Paragraph position="3"> For &quot;nyanya&quot; there are no reliable criteria for disambiguation. Because it is in object position and without qualifications, no clues for disambiguation can be found among agreement markers.</Paragraph> <Paragraph position="4"> Now follows the hard part of disambiguation, because no reliable linguistic rules can be written. The easiest case is &quot;kwenye&quot;, because the two interpretations represent different phases of the grammaticalization process, and the semantic difference between them is marginal.</Paragraph> <Paragraph position="5"> The preposition &quot;kwenye&quot; is in fact formally a locative (17-SG) form of the relative word &quot;enye&quot; (which has).</Paragraph> <Paragraph position="6"> For &quot;Kiboko&quot; we can make use of the common knowledge that fat persons do not normally live in lakes, or in breasts, or in milk. Therefore, a rule based on the co-occurrence of &quot;kiboko&quot; and &quot;maziwa&quot; with appropriate meanings can be written.</Paragraph> <Paragraph position="7"> The word &quot;maziwa&quot; is even more difficult to disambiguate. The word &quot;kiboko&quot; in the sense of hippo can easily co-occur with all three meanings of &quot;maziwa&quot;. Here we have to rely on A set of words referring to places where a hippo resides can be defined and used in the rule.</Paragraph> <Paragraph position="8"> It is possible to write also a context-sensitive rule, where use is made of the fact that rhinos can live in lakes but not in breasts or milk, but such a rule easily becomes too specific.</Paragraph> <Paragraph position="9"> The word &quot;nyanya&quot; in object position is almost impossible to disambiguate elegantly. The subject of eating can be one or more tomatoes, as well as one or more grandmothers. It is not rare at all that hippos devour people, although there is no proof that they would be particularly fond of grandmothers. Nobody has heard fat men eating grandmothers, but those do not come into question in any case, because they do not live in lakes.</Paragraph> <Paragraph position="10"> If we assume that hippos hardly eat grandmothers we can remove the reading, which has the tag &quot;grandmother&quot;. We are still left with singular and plural alternatives of tomato. Here plural would be more natural, because tomatoes are here treated as a mass rather than as individual fruits.</Paragraph> <Paragraph position="11"> When context-sensitive semantic rules and heuristic rules are applied, the reading is as shown in (3).</Paragraph> <Paragraph position="12"> Although the possibilities for generalisation in semantics are limited, in noun class languages relevant semantic clusters can be found. Even though classes in Swahili are only in exceptional cases semantically 'pure', the class membership often provides sufficient information for disambiguation, either by direct selection or, more often, by exclusion of a reading.</Paragraph> <Paragraph position="13"> The grades of animacy (e.g. human, animal, vegetation) are an example of useful semantic groupings, which can be used in generalising disambiguation. Another useful feature, actually belonging to syntax, is the division of verbs into categories according to their argument structure (e.g. SV, SVO, SVOO) Neural networks have been used successfully for identifying clusters of co-occurrence of words and their accompanying tags (Veronis and Ide 1990; Sussna 1993; Resnik 1998a). Research results, carried out with the Self-Organizing Map (Kohonen 1995) on semantic clustering of verbs and their arguments in Swahili, are very promising, and useful generalizations have been found (Ng'ang'a 2003).</Paragraph> <Paragraph position="14"> These findings can be encoded into the morphological parser and used in writing semantic disambiguation rules.</Paragraph> <Paragraph position="15"> 6 When means for rule writing fail It sometimes happens that linguistic disambiguation rules cannot be written.</Paragraph> <Paragraph position="16"> Particularly problematic is the noun of the Class 9/10 in object position without qualifiers, many of which would help in disambiguation. In this noun class there are no features in nouns for determining whether the word is in singular or plural . The detailed survey of about 11,000 occurrences of class 9/10 nouns in object position shows, however, that 97% of them are unambiguously in singular. Among the remaining 3%, 2% can be either in singular or plural, and only one percent are such cases where the noun is clearly in plural. These 2% are typically count nouns, which sometimes can be disambiguated, if, for example, they are members in a list of nouns. Nouns in such lists tend to be either in singular or in plural, and often at least one list member belongs to one of the other noun classes, where singular and plural are distinguished.</Paragraph> <Paragraph position="17"> The solution for the nouns of the class 9/10 in object position is, therefore, that for the rare plural cases, disambiguation rules are written, while singular is the default interpretation.</Paragraph> <Paragraph position="18"> The likelihood of co-occurrence can be established between word pairs, or clusters, and also between words and tags attached to them. Therefore, the full range of information in an analysed corpus can be utilized in establishing relationships.</Paragraph> <Paragraph position="19"> Singular and plural are identical in this class, and it is the biggest class of the language, consisting of about 39% of all nouns.</Paragraph> <Paragraph position="20"> 7 Treatment of multi-word concepts and idioms In computational description of a language, multi-word concepts and idioms can be treated as one unit, because in both cases the meaning is based on more than one string in text. If a multi-word concept consists of a collocation or noun phrase, it can be encoded in the tokenizer (4) and the morphological lexicon (5). Such constructions have two forms (SG and PL) at the most.</Paragraph> <Paragraph position="21"> If the concept has a non-finite verb as part of the construction, as is often the case in idioms, the constructions cannot be handled on the surface level. It is possible to handle them with disambiguation rules. Example (6), which is an idiom, shows how each of its constituent parts is interpreted in isolation.</Paragraph> <Paragraph position="22"> With the help of disambiguation rules, the idiom can be identified, although the verb &quot;piga&quot; may have several surface forms, including extended forms. The solution adopted here is the following: As a first step we identify the constituent parts of the idiom and describe its structure by a tag, as is shown in (7). The angle brackets (<>>) show that the idiom contains the current word as well as the preceding word and two following words. Also the meaning of the idiom (&quot;to bribe&quot;) is attached to this word.</Paragraph> <Paragraph position="23"> Then we mark each of the other constituent parts of the idiom and show their relative location in the structure by using angle brackets, as shown in (8). For example, &quot;nyuma&quot; is the last constituent and all three words before it are part of the idiom. Original glosses of other constituent parts are removed. The verb retains its morphological tags, and a special tag (IDIOM-V) is added to show that it is part of the idiom. Although it would be possible to write disambiguation rules for practically all such cases where sufficient features for rule writing are available, it is sometimes impractical, especially in selecting the right semantic interpretation. This can be implemented in more than one way, for example by constructing the morphological analyser so that the alternative semantic analyses are in frequency order (9).</Paragraph> <Paragraph position="24"> (9) taa &quot;taa&quot; N 9/10-SG { lamp , lantern } AR &quot;taa&quot; N 9/10-SG { discipline , obedience } &quot;taa&quot; N 9/10-SG { large flat fish , skate } AN &quot;taa&quot; N 9/10-PL { lamp , lantern } AR &quot;taa&quot; N 9/10-PL { discipline , obedience } &quot;taa&quot; N 9/10-PL { large flat fish , skate } AN The word &quot;taa&quot; gets three semantic interpretations, each in singular and plural. The most obvious gloss (lamp, lantern) is the first in order, and if no rule has chosen any of the other alternatives, this one is chosen as the default case. The choice of other alternatives is controlled by rules as far as possible. For example, the animate reading can often be chosen with congruence rules.</Paragraph> </Section> class="xml-element"></Paper>