File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/j01-1003_metho.xml
Size: 56,910 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?> <Paper uid="J01-1003"> <Title>Machine Learning</Title> <Section position="4" start_page="60" end_page="61" type="metho"> <SectionTitle> 3. The BOAS Project </SectionTitle> <Paragraph position="0"> Boas (Nirenburg 1998; Nirenburg and Raskin 1998) is a semiautomatic knowledge elicitation system that guides a team of two people (a language informant and a programmer) through the process of developing the static knowledge sources required to produce a moderate-quality, broad-coverage MT system from any &quot;low-density&quot; language into English. Boas contains knowledge about human language phenomena and various realizations of these phenomena in a number of specific languages, as well as extensive pedagogical support, making the system a kind of &quot;linguist in a box,&quot; intended to help nonprofessional users with the task. In the spirit of the goaldriven, &quot;demand-side&quot; approach to computational applications of language processing (Nirenburg and Raskin 1999), the process of acquiring this knowledge has been split into two steps: (i) acquiring the descriptive, declarative knowledge about a language and (ii) deriving operational knowledge (content for the processing engines) from this descriptive knowledge.</Paragraph> <Paragraph position="1"> An important goal that we strive to achieve regarding these descriptive and operational pieces of information, be they elicited from human informants or acquired via machine learning, is that they be transparent, human-readable, and, where necessary, human-maintainable and human-extendable, contrary to the opaque and uninterpretable representations acquired by various statistical learning paradigms.</Paragraph> <Paragraph position="2"> Before proceeding any further, we would also like to make explicit the aims and limitations of our approach. Our main goal is to significantly expedite the develop- null Computational Linguistics Volume 27, Number 1 root word can be associated with a finite number of word forms, one can, with a lot of work, generate a list of word forms with associated morphological features encoded, then use this as a lookup table to analyze word forms in input texts. Since this process is time consuming, expensive, and error-prone, it is something we would like to avoid. We prefer to capture general morphophonological and morphographemic phenomena using sample paradigms as the basis of lexical abstractions. This reduces the acquisition process to assigning citation forms to one of the established paradigms; the automatic generation process described below does the rest of the work. 4 This process is still imperfect, as we expect human informants to err in making their paradigm abstractions and to overlook details and exceptions. So, the whole process is an iterative one, with convergence to a wide-coverage analyzer coming slowly at the beginning (where morphological phenomena and lexicon abstractions are being defined and tested), but significantly speeding up once wholesale lexical acquisition starts. Since the generation of the operational content (data files to be used by the morphological analyzer engine) from the elicited descriptions is expected to take only a few minutes, feedback on operational performance can be provided very quickly.</Paragraph> <Paragraph position="3"> Human languages have many diverse morphological phenomena and it is not our intent at this point to have a universal architecture that can accommodate any and all phenomena. Rather, we propose an extensible approach that can accommodate additional functionality in future incarnations of Boas. We also intend to limit morphological processing to single tokens and to deal with multitoken phenomena, such as partial or full word reduplications, with additional machinery that we do not discuss here.</Paragraph> </Section> <Section position="5" start_page="61" end_page="66" type="metho"> <SectionTitle> 4. The Elicit-Build-Test Loop </SectionTitle> <Paragraph position="0"> In this paper we concentrate on operational content in the context of building a morphological analyzer. To determine this content, we integrate the information provided by the informant with automatically derived information. The whole process is an iterative one, as illustrated in Figure 1: the elicited information is transformed into the operational data required by the generic morphological analyzer engine and the resulting analyzer is then tested on a test corpus, s'6 Any discrepancies between the output of the analyzer and the test corpus are then analyzed and potential sources of errors are given as feedback to the elicitation process. Currently, this feedback is limited to identifying problems in handling morphographemic processes (such as for instance the change of word-final -y to -i when the suffix -est is added).</Paragraph> <Paragraph position="1"> The box in Figure 1 labeled Morphological Analyzer Generation is the main component, which takes in the elicited information and generates a series of regular expressions for describing the morphological lexicon and morphographemic rules. The morphographemic rules describing changes in spelling as a result of affixation operations are induced from the examples provided by using transformation-based learning (Brill 1995; Satta and Henderson 1997). The result is an ordered set of contextual replace or rewrite rules, much like those used in phonology.</Paragraph> <Paragraph position="2"> 4 We use the term citation form to refer to the word form that is used to look up a given inflected form in a dictionary. It may be the root or stem form that affixation is applied to, or it may have additional morphological markers to indicate its citation form status.</Paragraph> <Section position="1" start_page="62" end_page="64" type="sub_section"> <SectionTitle> 4.1 Morphological Analyzer Architecture </SectionTitle> <Paragraph position="0"> We adopt the general approach advocated by Karttunen (1994) and build the morphological analyzer as the combination of several finite-state transducers, some of which are constructed directly from the elicited information, and others of which are constructed from the output of the machine learning stage. Since the combination of the transducers is computed at compile-time, there are no run-time overheads. The basic architecture of the morphological analyzer is depicted in Figure 2. The analyzer consists of the union of transducers, each of which implements the morphological analysis process for one paradigm. Each transducer is the composition of a number of components. These components (from bottom to top) are described below: .</Paragraph> <Paragraph position="1"> .</Paragraph> <Paragraph position="2"> The bottom component is an ordered sequence of morphographemic rules that are learned via transformation-based learning from the sample inflectional paradigms provided by the human informant. These rules are then composed into one finite-state transducer (Kaplan and Kay 1994).</Paragraph> <Paragraph position="3"> The citation form and affix lexicon contains the citation forms and the affixes. We currently assume that all affixation is concatenative and that the lexicon is described by a regular expression of the sort \[ Prefixes \]* \[ CitationForms \] \[ Suffixes \].7 7 We currently assume that we have at most one prefix and at most one suffix, but this is not a fundamental limitation. The elicitation of morphotactics for an agglutinating language like Turkish or Finnish requires a significantly more sophisticated elicitation machinery. General architecture of the morphological analyzer.</Paragraph> <Paragraph position="5"> The morpheme to surfacy feature mapping essentially maps morphemes to feature names but retains some encoding of the surface morpheme. Thus, allomorphs that encode the same feature would be mapped to different surfacy features.</Paragraph> <Paragraph position="6"> The lexical and surfacy constraints specify any conditions to constrain the possibly overgenerating morphotactics of the citation form and morpheme lexicons. These constraints can be encoded using the citation forms and the surfacy features generated by the previous mapping. The use of surfacy features also enables reference to zero morphemes, which otherwise could not be used. For instance, if in some paradigm a certain prefix does not co-occur with a certain suffix, or always occurs with some other suffix, or if a certain citation form in that paradigm has exceptional behavior with respect to one or more of the affixes, or if the affixal aUomorph that goes with a certain citation form depends on the properties of the citation form, these are encoded at this level as finite-state constraints.</Paragraph> <Paragraph position="7"> The surfacy feature to feature mapping module maps the surfacy representation of the affixes to symbolic feature names; as a result, no surface information remains except for the citation form. Thus, for instance, allomorphs that encode the same feature and map to different surfacy features now map to the same feature symbol.</Paragraph> <Paragraph position="8"> Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers . The feature constraints specify constraints among the symbolic features. They are different means of constraining morphotactics than the one provided by lexical and surfacy constraints. At this level, one refers to and constrains symbolic morphosyntactic features as opposed to surfacy features. This may provide a more natural or convenient abstraction, especially for languages with long-distance morphotactic constraints.</Paragraph> <Paragraph position="9"> These six finite-state transducers are composed to yield a transducer for the paradigm. The union of the transducers for all paradigms produces one (possibly large) transducer for morphological analysis, where surface strings applied at the lower end produce all possible analyses at the upper end.</Paragraph> </Section> <Section position="2" start_page="64" end_page="64" type="sub_section"> <SectionTitle> 4.2 Information Elicited from Human Informants </SectionTitle> <Paragraph position="0"> The Boas environment guides the language informant through a series of questions leading up to paradigm delineation. The informant indicates the parameters for which a given part of speech inflects (e.g., Case, Number), the relevant values for those parameters (e.g., Nominative, Accusative; Singular, Plural), and the licit combinations of parameter values (e.g., Nominative Singular, Nominative Plural). The informant then posits any number of paradigms, whose members are expected to show similar patterns of inflection. It is assumed that all citation forms that belong to the same paradigm take essentially the same set of inflectional affixes (perhaps subject to morphophonological variations). It is expected that the citation forms and/or the affixes may undergo systematic or idiosyncratic morphographemic changes. It is also assumed that certain citation forms in a given paradigm may behave in some exceptional way (for instance, contrary to all other citation forms, a given citation form may not have one of the inflected forms.) A paradigm description provides the full inflectional pattern for one characteristic or distinguished citation form and additional examples for any other citation forms whose inflectional forms undergo nonstandard morphographemic changes. If necessary, any lexical and feature constraints can be encoded. Currently the provisions we have for such constraints are limited to writing regular expressions (albeit at a much higher level than standard regular expressions); however, capturing such constraints using a more natural language (e.g., Ranta 1998) can be incorporated into future versions.</Paragraph> </Section> <Section position="3" start_page="64" end_page="66" type="sub_section"> <SectionTitle> 4.3 Elicited Descriptive Data </SectionTitle> <Paragraph position="0"> Figure 3 presents the encoding of the information elicited for one paradigm of a Polish morphological analyzer, which will be covered in detail later, s The data elicited using the user interface component of Boas is converted into a description text file with various components delineated by SGML-like tags. The components in the description are as follows: * The <LANGUAGE-DESCRIPTION... > component lists information about the language and specifies its vowels and consonants, and other orthographic symbols that do not fall into those two groups.</Paragraph> <Paragraph position="1"> Sample paradigm description generated by Boas elicitation.</Paragraph> <Paragraph position="2"> additional morphosyntactic features that are common to all citation forms in this paradigm. In the example in Figure 3, the paradigm is for masculine nouns. Everything up to the </PARADIGM> tag is part of the descriptive data for the paradigm. This descriptive data consists of a primary example, a series of zero or more additional examples, and the lexicon.</Paragraph> <Paragraph position="3"> The primary example is given between the <PRIMARY-EXAMPLE> and </PRIMARY-EXAMPLE> tags. The description is given as a sequence of one or more inflection groups between <INF-GROUP> and </INF-GROUP> tags. In some instances, a given lexical item can use different citation forms in different inflectional forms. For example, one citation form might be used in the present tense and another in the past tense; or one might be Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers used with multisyllable affixes and another with single-syllable affixes. Thus, a given lexical item can have multiple citation forms, each of which gets associated with a mutually exclusive subset of inflectional forms. All the citation forms for a given lexical item, plus all its inflectional forms, are represented in an inflection group. If the association of citation forms with inflectional forms is predictable (as indicated by the language informant), the subsets of inflectional forms are processed separately; if not, we assume that all citation forms can be used in all inflectional forms and hence overgenerate. Manual constraints can later be added, if necessary, to constrain this overgeneration.</Paragraph> <Paragraph position="4"> Additional examples are provided between <EXAMPLE> and </EXAMPLE> tags. Examples contain new citation forms plus any inflectional forms that are not predictable based on the primary example. Each example is considered an inflectional group and is enclosed within the corresponding tags.</Paragraph> <Paragraph position="5"> The citation forms given in the primary example and any additional examples are considered to be a part of the citation form lexicon of the paradigm definition. Any additional citation forms in this paradigm are listed between the <LEXICON> and </LEXICON> tags.</Paragraph> </Section> </Section> <Section position="6" start_page="66" end_page="81" type="metho"> <SectionTitle> 5. Generating the Morphological Analyzer </SectionTitle> <Paragraph position="0"> The morphological analyzer is a finite-state transducer that is actually the union of the transducers for each paradigm definition in the description provided. Thus, the elicited data is processed one paradigm at a time. For each paradigm we proceed as follows:</Paragraph> <Paragraph position="2"> The elicited primary citation form and associated inflected forms are processed to find the &quot;best&quot; segmentation of the forms into stem and affixes. 9 Although we allow for inflectional forms to have both a prefix and a suffix (one of each), we expect only suffixation to be employed by the inflecting languages with which we are dealing (Sproat 1992).</Paragraph> <Paragraph position="3"> Once the affixes are determined, we segment the inflected forms for the primary example and any additional examples provided, and pair them with the corresponding surface forms. The segmented forms are now based on the citation form plus the affixes (not the stem). The reason is that we expect the morphological analyzer to generate the citation form for further access to lexical databases to be used in the applications. The resulting segmented form-surface form pairs make up the example base of the paradigm.</Paragraph> <Paragraph position="4"> The citation forms given in the primary example, in additional examples, and explicitly in the lexicon definition of the elicited data, along with the mapping from suffix strings to the corresponding morphosyntactic features, are compiled (by our morphological analyzer generating system) into suitable regular expressions (expressed using the regular 9 The stern is considered to be that part of the citation form onto which affixes are attached, and in our context it has no function except for determining the affix strings.</Paragraph> <Paragraph position="5"> Computational Linguistics Volume 27, Number 1 .</Paragraph> <Paragraph position="6"> .</Paragraph> <Paragraph position="7"> expression language of the XRCE finite-state tools \[Karttunen et al. 1996\]). ldeg The example base of the paradigm generated in step 2 is then used by a learning algorithm to generate a sequence of morphographemic rules (Kaplan and Kay 1994) that handle the morphographemic phenomena.</Paragraph> <Paragraph position="8"> The regular expressions for the lexicon in step 3 and the regular expressions for the morphographemic rules induced in step 4 are then compiled into finite-state transducers and combined by composition to generate the finite-state morphological analyzer for the paradigm.</Paragraph> <Paragraph position="9"> The resulting finite-state transducers for each paradigm are then unioned to give the transducer for the complete set of paradigms.</Paragraph> <Section position="1" start_page="67" end_page="68" type="sub_section"> <SectionTitle> 5.1 Determining Segmentation and Affixes </SectionTitle> <Paragraph position="0"> The suffixes and prefixes in a paradigm are determined by segmenting the inflected forms provided for the primary example. This process is complicated by the fact that the citation form may not correspond to the stem--it may contain a morphological indication that it is the citation form. Furthermore, since the language informant provides only a small number of examples, statistically motivated approaches like the one suggested by Theron and Cleoete (1997) are not applicable. We have experimented with a number of approaches and have found that the following approach works quite well.</Paragraph> <Paragraph position="1"> Using the notion of description length (Rissanen 1989), we try to find a stem and a set of affixes that account for all the inflected forms of the primary example. Let C = (cl, c2 ..... ccl be the character string for the citation form in the primary example (ci are symbols in the alphabet of the language). Let Sk = (cl, c2 ..... Ckl, 1 < k <_ c be a (string) prefix of C length k. We assume that the stem onto which morphological affixes are attached is Sk for some k. 11 The set of inflectional forms given in the primary J J ,fill (f//are alphabet example are {F1, F2,..., El}, with each Fj = ~f~,f~ .... symbols in the of the language and lj is the length of the jth form). The function ed(v,w) (ed for edit distance), where v and w are strings, measures the minimum number of symbol insertions and deletions (but not substitutions) that can be applied to v to obtain w (Damerau 1964). 12 We define</Paragraph> <Paragraph position="3"> as a measure of the information needed to account for all the inflected forms. The first term above, k, is the length of the stem. The second term, the summation, measures how many symbols must be inserted and deleted to obtain the inflected form. The Sk with the minimum d(Sk) is then chosen as the stem S. Creating segmentations based on stem S proceeds as follows: To determine the affixes in each inflected form Fj = ~f~,f~ ..... f/i/, we compute the projection of the stem Pj = ~f~ .... ,f/el in Fj, as that Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers substring of Fj whose alignment with S provides the minimum edit distance, that is, P j = argmin ed ( S, ~f~ ...... /d,>) (f~ ..... ,fd,>,l<_b'<e'<lj Then we select the substring ~f~ ..... f~-l> of Fj (if it exists) as the prefix and ... ,J~} q~+l&quot; (if it exists) as the suffix. If there are multiple substrings of Fj that give the same (minimum) edit distance when aligned with S, we prefer the longer substring. We then create</Paragraph> <Paragraph position="5"> as an aligned segmented form-surface form pair and add it to the example base that we will use in the learning stage. Note that we now use the citation form C, and not the stem S, as a part of the segmented form.</Paragraph> <Paragraph position="6"> Thus, at the end of the process we generate pairs of inflected forms and their corresponding segmented forms to be used in the derivation of the morphographemic rules. These pairs come from both the inflected forms given in the primary example and from any additional examples given.</Paragraph> <Paragraph position="7"> For example, suppose we have the following primary example: <PRIMARY-CIT-FORM FORM = &quot;strona&quot;> <INF-FORM FORM = &quot;strona&quot; FEATURE = &quot;Nom. Sg.&quot;> <INF-FORM FORM = &quot;strong&quot; FEATURE = &quot;Acc. Sg.&quot;> <INF-FORM FORM = &quot;strony&quot; FEATURE = &quot;Gen. Sg.&quot;> <INF-FORM FORM = &quot;stronie&quot; FEATURE = &quot;Dat.Sg.&quot;> <INF-FORM FORM = &quot;stronie&quot; FEATURE =&quot;Loc. Sg.&quot;> <INF-FORM FORM = &quot;strong&quot; FEATURE =&quot;Instr. Sg.&quot;> <INF-FORM FORM = &quot;strony&quot; FEATURE = &quot;Nom. Pl.&quot;> <INF-FORM FORM = &quot;strony&quot; FEATURE = &quot;Acc.PI.&quot;> <INF-FORM FORM = &quot;stron&quot; FEATURE = &quot;Gen. Pl.&quot;> <INF-FORM FORM = &quot;stronom&quot; FEATURE = &quot;Dat. Pl.&quot;> <INF-FORM FORM = &quot;stronach&quot; FEATURE = &quot;Loc.Pl.&quot;> <INF-FORM FORM = &quot;stronami&quot; FEATURE = &quot;Instr.Pl.&quot;> </INF-GROUP> </PRIMARY-EXAMPLE> For this example, stems Sk: s, st, str, stro, stron, strona, are considered. Table 1 tabulates d(Sk) considering all the unique inflected forms above. It can be seen that the value of d(Ss) is minimum for $5 = S = stron. We then determine suffixes based on this stem selection. The suffixes are given in this table under k = 5, where the stem S = stron perfectly aligns with the initial substring stron in each inflected form Fj, with 0 edit distance.</Paragraph> <Paragraph position="8"> The segmented form-surface form pairs in Table 2 are then generated from the alignment of the stem with each surface form.</Paragraph> </Section> <Section position="2" start_page="68" end_page="73" type="sub_section"> <SectionTitle> 5.2 Learning Segmentation and Morphographemic Rules </SectionTitle> <Paragraph position="0"> The citation form and the affix information elicited and extracted by the process described above are used to construct regular expressions for the lexicon component Form Fj s st str stro stron strona 5 4 3 2 1 stron~ 5 4 3 2 1 strony 5 4 3 2 1 stronie 6 5 4 3 2 stron G 5 4 3 2 1 stron 4 3 2 1 0 stronom 6 5 4 3 2 stronach 7 6 5 4 3 stronami 7 6 5 4 3 of each paradigm. 13 The example segmentations are fed into the learning module to induce morphographemic rules.</Paragraph> <Paragraph position="1"> a list of pairs of segmented lexical forms and surface forms. The segmented forms contain the citation forms and affixes; the affix boundaries are marked by the + symbol. This list is then processed by a transformation-based learning paradigm (Brill 1995; Satta and Henderson 1997), as illustrated in Figure 4. The basic idea is that we consider the list of segmented words as our input and find transformation rules (expressed as contextual rewrite rules) to incrementally transform this list into the list of surface forms. The transformation we choose at every iteration is the one that makes the list of segmented forms closest to the list of surface forms. The first step in the learning process is an initial alignment of pairs using a standard dynamic programming scheme. The only constraints in the alignment are: (i) a + in the segmented lexical form is always aligned with an empty string on the surface side, notated by 0; (ii) a consonant on one side is always aligned with a consonant or 0 on the other side, and likewise for vowels; (iii) the alignment must correspond to 13 The result of this process is a script for the XRCE finite-state tool xfst. Large-scale lexicons can be more efficiently compiled by the XRCE tool lexc. We currently do not generate lexc scripts, but it is trivial to do so.</Paragraph> <Paragraph position="2"> the minimum edit distance between the original lexical and surface forms. 14 From this point on, we will use a simple example from English to clarify our points. Assume that we have the pairs (un+happy+est, unhappiest) and (shop+ed, shopped) in our example base. We align these and determine the total number of &quot;errors&quot; in the segmented forms that we have to fix to make all segmented forms match the corresponding surface forms. The initial alignment produces the aligned pairs: un + happy + est shopO+ ed un 0 happi 0 est shopp 0 ed with a total of five errors. From each segmented pair we generate rewrite rules of the sort is u-> i \[\] LeftContext, RightContext ; where u(pper) is a symbol in the segmented form, l(ower) is a symbol in the surface form. Rules are generated only from those aligned symbol pairs that are different. LeftContext and RightContext are simple regular expressions describing contexts in the segmented side (up to some small length), also taking into account the word boundaries. For instance, from the first aligned-pair example, this procedure would generate rules such as the following (depending on the amount of left and right context readability, we will ignore the escape symbol (%) that should precede any special characters (e.g., /) used in these rules.</Paragraph> <Paragraph position="3"> Computational Linguistics Volume 27, Number 1 The # symbol denotes a word boundary and is intended to capture any word-initial and word-final phenomena. The segmentation rules (+ -> 0) require at least some minimal left or right context (usually longer than the minimal context for other rules in order to produce more accurate segmentation decisions). We disallow contexts that consist only of a morpheme boundary, as such contexts are usually not informative. It should be noted that these rules transform a segmented form into a surface form (contrary to what may be expected for analysis). This lets us capture situations where multiple segmented forms map to the same surface form, which occurs when the language has morphological ambiguity. Thus, in a reverse lookup, a given surface form may be interpreted in multiple ways, if applicable.</Paragraph> <Paragraph position="4"> Since we have many examples of aligned pairs in our example base, it is likely that a given rule will be generated from many pairs. For instance, if the pairs (stop+ed, stopped) and (trip+ed, tripped) were also in the list, the gemination rule 0 -> p I I p - + e d (along with certain others) will also be generated from these examples. We count how many times a rule is generated and associate this number with the rule as its promise, meaning that it promises to fix this many &quot;errors&quot; if it is selected to apply to the current list of segmented forms.</Paragraph> <Paragraph position="5"> above refer to specific strings of symbols as left and right contexts. It is, however, possible to obtain more generalized rules by classifying the symbols in the alphabet into phonologically relevant groups, like vowels and consonants. The benefit of this approach is that the number of rules thus induced is typically smaller, and more unseen cases can be covered.</Paragraph> <Paragraph position="6"> For instance, in addition to a rule like 0 -> p I I p - + e, the rules</Paragraph> <Paragraph position="8"> can be generated, where symbols such as CONSONANTS and VOWELS stand for regular expressions denoting the union of relevant symbols in the alphabet. The promise scores of the generalized rules are found by adding the promise scores of the original rules generating them. Generalization substantially increases the number of candidate rules to be considered during each iteration, but this is not a very serious issue, as the number of examples per paradigm is expected to be quite small. The rules thus learned would be the most general set of rules that do not conflict with the evidence in the examples. It is possible to use a more refined set of classes that correspond to subclasses of vowels (e.g., high vowels) and consonants (e.g., fricatives) but these will substantially increase the number of candidate rules at every iteration and will have an impact on the iteration time unless examples are chosen carefully.</Paragraph> <Paragraph position="9"> are generated from the current state of the example pairs. The rules generated are then ranked based on their promise scores, with the top rule having the highest promise. Among rules with the same promise score, we rank more general rules higher, with generality being based on context subsumption (i.e., preference goes to rules using shorter contexts and/or referring to classes of symbols, like vowels or consonants). All segmentation rules go to the bottom of the list, though within this group, rules are still ranked based on decreasing promise and context generality. The reasoning Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers for treating the segmentation rules separately and later in the process is that affixation boundaries constitute contexts for all morphographemic changes; therefore they should not be eliminated if there are any (more) morphographemic phenomena to process.</Paragraph> <Paragraph position="10"> Starting with the top-ranked rule, we test each rule on the segmented component of the pairs. A finite-state engine emulates the replace rules to see how much the segmented forms are &quot;fixed.&quot; The first rule that fixes as many &quot;errors&quot; as it promises to fix, and does not generate an interim example base with generation ambiguity, is selected. 16 The issue of generation ambiguity refers to cases where the same segmented forms are paired with distinct surface forms. 17 In such cases, finding a rule that fixes both pairs is not possible, so in choosing rules, we avoid any rules whose tentative application generates an interim example base with such ambiguities. In this way, we can account for all the discrepancies between the surface and segmented forms without falling into a local minima. Although we do not have formal proof that this simple heuristic avoids such local minima situations, in our experimentation with a large number of cases we have never seen such an instance. null The complete procedure for rule learning can now be given as follows: - Align surface and segmented forms in the example base; - Compute total Error; - while(Error > O) { -Generate all possible rewrite rules subject to context size limits; -Rank Rules ; -while (there are more rules and a rule has not yet been selected) { - Tentatively apply the next rule to all the segmented forms; - Re-align the resulting segmented forms with the corresponding surface forms to see how many ''errors'' have been fixed; - If the number of errors fixed is equal to what the rule promised to fix AND the result does not have generation ambiguity, select this rule;</Paragraph> <Paragraph position="12"> -Commit the changes performed by the rule on the segmented forms to the example base; -Reduce Error by the promise score of the selected rule; This procedure eventually generates an ordered sequence of two ordered groups of rewrite rules. The first group of rules is for any morphographemic phenomena in the given set of examples, and the second group of rules handles segmentation. All these rules are composed in the order in which they are generated to construct the Morphographemic Rules transducer at the bottom of each paradigm (see Figure 2).</Paragraph> <Paragraph position="13"> 16 Note that a rule may actually introduce unintended errors in other pairs, since context checking is done only on the segmented form side; therefore what a rule delivers may be different than what it promises, as promise scores also depend on the surface side. 17 Consider a state of the example base where some segmented lexical form L is paired with different surface forms $1 and $2, that is, we have pairs (L, $1) and (L, $2) in our example base. Any rule that will bring L closer to $1 will also change L of the second pair and potentially make it impossible to bring it closer to $2.</Paragraph> </Section> <Section position="3" start_page="73" end_page="73" type="sub_section"> <SectionTitle> Computational Linguistics Volume 27, Number 1 5.3 Identifying Errors and Providing Feedback </SectionTitle> <Paragraph position="0"> Once the Morphographemic Rules transducers are compiled and composed with the lexicon transducer that is generated automatically from the elicited information, we obtain an analyzer for the paradigm. The analyzer for the paradigm can be tested by using the xfst environment of the XRCE finite-state tools. This environment provides machinery for testing the output of the analyzer by generating all forms involving a specific citation form, a specific morphosyntactic feature, or the like. This kind of testing has proved quite sufficient for our purposes.</Paragraph> <Paragraph position="1"> When the full analyzer is generated by unioning all the analyzers for each paradigm, one can do a more comprehensive test against a test corpus to see what surface forms in the test corpus are not recognized by the generated analyzer. Apart from revealing obvious deficiencies in coverage (e.g., missing citation forms in the lexicon), such testing provides feedback about minor human errors--the failure to cover certain morphographemic phenomena, or the incorrect assignment of citation forms to paradigms, for example.</Paragraph> <Paragraph position="2"> Our approach is as follows: we use the resulting morphological analyzer with an error-tolerant finite-state recognizer engine (Oflazer 1996). Using this engine, we try to find words recognized by the analyzer that are (very) close to a rejected (correct) word in the test corpus, essentially performing a reverse spelling correction. If the rejection is due to a small number of errors (1 or 2), the erroneous words recognized by the recognizer are aligned with the corresponding correct words from the test corpus.</Paragraph> <Paragraph position="3"> These aligned pairs can then be analyzed to see what the problems may be.</Paragraph> </Section> <Section position="4" start_page="73" end_page="75" type="sub_section"> <SectionTitle> 5.4 Applicability to Infixing, Circumfixing, and Agglutinating Languages </SectionTitle> <Paragraph position="0"> The machine learning procedure for inducing rewrite rules is not language dependent.</Paragraph> <Paragraph position="1"> It is applicable to any language whose lexical representation is a concatenation of free and bound morphemes (or portions thereof). All this stage requires is a set of pairs of lexical and surface representations of the examples compiled for the example base.</Paragraph> <Paragraph position="2"> We have tested the rule learning component above on several other languages including Turkish, an agglutinating language, using an example base with lexical forms produced by a variant of the two-level morphology-based finite-state morphological analyzer described in Oflazer (1994). The lexical representation for Turkish also involved meta symbols (such as H for high vowels, D for dentals, etc.), which would be resolved with the appropriate surface symbol by the rules learned. For instance, vowel harmony rules would learn to resolve H as one of ~, i, u, ii in the appropriate context.</Paragraph> <Paragraph position="3"> Furthermore, the version of the rule learning (sub)system used for Turkish also made use of context-bound morphophonological distinctions that are not elicited in Boas, such as high vowels, low unrounded vowels, dentals, etc. The rules generated were the most general set of rules that did not conflict with the example base. There were many examples in the example base that involved multiple suffixes, not just one, as in the inflecting languages we address in this paper. It was quite satisfying to observe that the system could learn rules for dealing with vowel harmony, devoicing, and so on. A caveat is that if there were too many examples and too many morphophonological classes, the number of candidate rules to be tried increased exponentially. This could be alleviated to a certain extent by a careful selection of the example base.</Paragraph> <Paragraph position="4"> Thus, the rule-learning component is applicable to agglutinative, and also to infixing and circumfixing languages, provided there is a proper representation of the lexical and surface forms. However, for infixing languages it could be very problem- null Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers atic to have a linear representation of the infixation, with the lexical root being split in two and the morphotactics picking up the first part, the infix, and the second part. To prevent overgeneration, the infix lexicon might have to be replicated for each root, to enforce the fact that the two parts of the stem go together. 18 The case for circumfixation is simpler since the number of such morphemes is assumed to be much smaller than the number of stems, so the circumfixing morphemes can be split up into two lexicons and treated as a prefix-suffix combination. The co-occurrence restrictions for the respective pairs can then be manually enforced with finite-state constraints that can be added to the lexical and surfacy constraints section of the analyzer (see Figure 2).</Paragraph> <Paragraph position="5"> Thus, in all three cases, learning the rules is not a problem provided the example base is in the requisite linear representation. On the other hand, this approach as such is inapplicable to languages like Arabic, which have radically different word formation processes (for which a number of other finite-state approaches have been proposed; (see, for example, Beesley \[1996\] and Kiraz \[2000\]).</Paragraph> <Paragraph position="6"> On the other hand, in contrast to acquiring the rewrite rules, eliciting the morphotactics and the affix lexicons for an agglutinating language (semi)automatically is a very different process and is yet to be addressed. There are three parts to this problem:</Paragraph> <Paragraph position="8"> Determining the boundaries of free and bound morphemes, accounting for any morphographemic variations; Determining the order of morphemes; Determining the &quot;semantics&quot; of the morphemes, that is, the features they encode.</Paragraph> <Paragraph position="9"> These are complicated by a number of additional issues such as zero morphemes, local and long-distance co-occurrence restrictions (e.g., for allomorph selection), exceptions, productive derivations, circular derivations, and morphemes with the same surface forms but a totally different morphotactic position and function. Also, in languages that have a phenomenon like vowel harmony, such as Turkish, even if all harmonic allomorphs of a certain suffix are somehow automatically grouped into a lexicon without any further abstraction, severe overgeneration would result, unless the all root and suffix lexicons were split or replicated along vowel lines. In such cases, a human informant (who possesses a certain familiarity with morphographemics and issues of overgeneration) may have to resort to manual abstraction of the morpheme representations. Then the process of acquiring the features for inflectional and derivational morphemes could proceed.</Paragraph> <Paragraph position="10"> 6. Bootstrapping a Polish Analyzer This section presents a quite extensive example of bootstrapping a morphological analyzer for Polish by iteratively providing examples and testing the morphological analyzer systematically. The idea of this exercise was to have a relatively limited number of paradigms that bunched words showing slight inflectional variations. 19 For reasons 18 This is much like what one encounters when dealing with reduplication in the FS framework. Also note that this is a lexicon issue and not a rule issue. 19 Nonexpert language informants using Boas will be encouraged to split, rather than bunch, paradigms, for the sake of simplicity.</Paragraph> <Paragraph position="11"> Computational Linguistics Volume 27, Number 1 of space, the exposition is limited to developing four paradigms, of which one will be covered in detail. The paradigms here cover only a subset of masculine norms, and do not treat feminine or neuter nouns at all; however, they cover all the problems that would be found in words of those genders.</Paragraph> <Paragraph position="12"> For purposes of testing the learner off-line (i.e., outside the Boas environment), we tried to keep to a minimum the number of inflected forms given for each additional citation form. This was a learner-oriented task and intended to determine how robust the learner could become with a minimum of input. When using the Boas interface, the language informant will not have the option of selectively providing inflected forms. The interface works as follows: the informant gives all forms of the primary example and lists other citation forms that he or she thinks belong to the given paradigm. Having learned rules from the primary example, the learner generates all the inflectional forms for each citation form provided. The informant then corrects all mistakes and the learner relearns the rules. So, the informant never has the opportunity to say &quot;Well, I know the learner can't predict the locative singular for this word, so I will supply it overtly from the outset.&quot; The informant will just have to wait for the learner to get the given forms wrong and then correct them. Any other approach would make for a complex interface and would require a sophisticated language informant--not what we are expecting.</Paragraph> <Paragraph position="13"> Polish is a highly inflectional West Slavic language that is written using extended Latin characters (six consonants and three vowels have diacritics). Certain phonemes are written using combinations of letters: e.g., sz, cz, and szcz represent phonetic ~, G and ~, respectively. Rdeg Polish nominals inflect for seven cases: Nominative (Nom.), Accusative (Acc.), Genitive (Gen.), Dative (Dat.), Locative (Loc.), Instrumental (Instr.), and Vocative (Voc.); and two numbers: Singular (Sg.) and Plural (P1.). 21 The complexity of Polish declension derives from four sources: (i) certain stem-final consonants mutate during inflection; these are called &quot;alternating&quot; consonants, and are contrasted with so-called &quot;nonalternating&quot; consonants (alternating/nonalternating is a crucial diagnostic for paradigm delineation in Polish); (ii) certain letters are spelled differently depending on whether they are word-final or word-internal (e.g., word-final -d is written -si when followed by a vocalic ending); (iii) final-syllable vowels are added/deleted in some (not entirely predictable) words; and (iv) declension is not entirely phonologically driven--semantics and idiosyncrasy affect inflectional endings. null The following practical simplifications have been made for testing purposes: Words that are normally capitalized (like names) are not capitalized here.</Paragraph> <Paragraph position="14"> Some inflectional form(s) that might not be semantically valid (e.g., plurals for collectives) were disregarded. Thus a bit of overgeneration still remains but can be removed with some additional effort.</Paragraph> </Section> <Section position="5" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 6.1 Paradigm 1 </SectionTitle> <Paragraph position="0"> The process starts with the description of Paradigm 1, which describes alternating inanimate masculine nouns with genitive singular in -u and no vowel shifts. The 20 We actually treat these as single symbols during learning. Such symbols are indicated in the description file in a special section that we have omitted in Figure 3. 21 The Vocative case was not included in these tests because it is not expected to occur widely in the journalistic prose for which the system is being built.</Paragraph> <Paragraph position="1"> Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers following primary example for the t-,c, d--,dz, st--+gc, zm--*;~m f~l, r--*rz, sl--*gl * Instr.Sg. and Nom.P1. depend on the final consonant; two velars have an idiosyncratic ending:</Paragraph> </Section> <Section position="6" start_page="76" end_page="78" type="sub_section"> <SectionTitle> Final Consonant(s) </SectionTitle> <Paragraph position="0"> b, p, f, w, m, n, s, z t, d, st zm, L r, st, ch The following examples were provided in addition to the inflectional forms of the primary example in order to show Loc.Sg. endings and accompanying consonant alternations that could not be predicted based on the primary example: 1. t~c: akcent (Nom.Sg.), akcencie (Loc.Sg.) 2. d --* dz: wyktad (Nom.Sg.), wyktadzie (Loc.Sg.) 3. st ---~dc: most (Nom.Sg.), modcie (Loc.Sg.) 4. zm---~m: komunizm (Nom.Sg.), komuni~mie (Loc.Sg.) 5. t-*l: artykut (Nom.Sg.), artykule (Loc.Sg.) 6. r--*rz: teatr (Nom.Sg.), teatrze (Loc.Sg.) 7. st~sl: pomyst (Nom.Sg.), pomydle (Loc.Sg.) The following additional examples were provided to show velar pecularities: 8. g: pociqg (Nom.Sg.), pociqgu (Loc.Sg.), pociqgiem (Instr.Sg.), pociqgi (Nom.Pl.) 22 Strictly speaking, the consonants b, p,f, w, m, n, s, and z alternate as well in the Loc.Sg., since alternating/nonalternating is a phonological distinction, not a graphotactic one. The softening of these consonants is indicated by the -i that precedes the canonical Loc.Sg. ending -e. However, for our purposes it is more straightforward to consider the Loc.Sg. ending for these consonants -ie with no accompanying graphotactic alternation.</Paragraph> <Paragraph position="1"> gave *wirchie to lexicon for wirch, of not wirchu for testing ~miech wirch Loc.Sg.</Paragraph> <Paragraph position="2"> .</Paragraph> <Paragraph position="3"> 10.</Paragraph> <Paragraph position="4"> k: bank (Nom.Sg.), banku (Loc.Sg.), bankiem (Instr.Sg.), banki (Nom.Pl.) ch: dach (Nom.Sg.), dachu (Loc.Sg.) Table 3 summarizes the first three runs for this paradigm, which were sufficient to create a relatively robust set of morphological rules that required only slight amendment and further testing in two additional runs. For this and subsequent such tables we use the following conventions: Key 0 shows the primary citation form and additional citation forms whose inflectional patterns should be fully covered by the rules generated for the primary example. The other key numbers correspond to the additional examples given above. Boldface citation forms under the lexicon column are those for which some additional inflectional examples were given. The citation forms given in plain text are for testing purposes. Oblique cases refer to the Genitive, Dative, Locative, and Instrumental cases.</Paragraph> <Paragraph position="5"> The original assumption for Paradigm 1 was that it would be sufficient to provide one unmutated form (the Nom.Sg.) plus the mutated form (the Loc.Sg.) for words ending in mutating consonants. This led to overgeneralization of the alternation; there- null Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers fore, another unmutated form had to be added as a &quot;control.&quot; Adding the Nom.P1. forms fixed most oblique forms for all the words, but it left the Instr.Sg. mutated. This appears to be because the inflectional ending for the Loc.Sg. (which mutates) and the Instr.Sg. (which does not) both begin in -e for the words in question. Adding the Instr.Sg. overtly counters overgeneralization of the alternation. The source of the velar errors is not immediately evident.</Paragraph> <Paragraph position="6"> Supplementary testing was carried out after the above-mentioned words were all correct. Correct forms were produced for all new words showing consonant mutations and velar peculiarities: samolot, przyklad, pretekst, podziaL kolor, dtug, lek, gmach. One error for a nonmutating word (in Key 0) occurred. This word, herb, ends in a different consonant than the primary example and produced the wrong Loc.Sg. form. This was later added overtly and more words with other nonmutating consonants (postcp, puf, gniew, film, opis, raz) were tested; all were covered correctly.</Paragraph> </Section> <Section position="7" start_page="78" end_page="79" type="sub_section"> <SectionTitle> 6.2 Paradigm 2 </SectionTitle> <Paragraph position="0"> The paradigm implemented next was Paradigm 2: alternating inanimate masculine nouns with genitive singular in -u and vowel shifts. The following primary example for the citation form gr6b was given in full: This paradigm is just like Paradigm 1, except that there are vowel shifts that are not entirely graphotactically predictable; therefore, words showing these shifts must be classed separately. The vowel shifts occur in all inflectional forms except the Nom.Sg. and the Acc.Sg., which are identical. The following vowel shifts occurred in the cases we considered (~b indicates vowel deletion).</Paragraph> <Paragraph position="1"> Vowel in Vowel in Based on the experience of Paradigm 1, the Instr.Sg. forms for all words with consonant alternation were provided as examples at the outset to avoid the overgeneralization of the alternation. The velar pecularities are still in effect and must be dealt with explicitly.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 27, Number 1 The following examples were given to exemplify vowel shifts with an unmutating consonant: 1. e --* q~ shift with n: sen(Nom.Sg.), snie (Loc.Sg.) The following examples were employed to show vowel shifts in combination with various consonant alternations in the Loc.Sg. forms: At the end of first run for this paradigm only one of the eight groups above was covered completely. All vowel shifts for all groups came out right. However, the Nom.P1. and Acc.P1. endings were incorrectly generalized as -i instead of -y, probably because two &quot;exceptional&quot; velar examples (in -i) were provided in contrast to one &quot;regular&quot; nonvelar example (in -y). Adding the Nom.P1. forms of three nonvelar words fixed this error. The results for velars were perfect except for the loss of z in 10 of 12 forms of obowiqzek. Adding the Nom.P1. form obowi~zki fixed this. For st6t and d6t, the consonant alternation was incorrectly extended to Gen.Sg. Adding the Gen.Sg. form of st6t fixed this error for both words. At the end of the second run, all groups were correctly learned.</Paragraph> <Paragraph position="3"> Supplementary testing after the above-mentioned words were correct included the words naw6z, doch6d, poz6r, rozbi6r, gr6d, rozch6d, nar6d, wtorek, kierunek; all forms were correct.</Paragraph> </Section> <Section position="8" start_page="79" end_page="80" type="sub_section"> <SectionTitle> 6.3 Paradigm 3 </SectionTitle> <Paragraph position="0"> Paradigm 3 contains alternating &quot;man&quot; nouns--that is, masculine nouns referring to human men. The following primary example for the citation form pasierb was given in full: Oflazer, Nirenburg, and McShane Bootstrapping Morphological Analyzers In this paradigm, all of the consonant alternations encountered above are still in effect and some word-final consonants undergo additional alternations in the Nom.P1. The velar peculiarities remain in effect. One additional complication in this paradigm is that there may be multiple Nom.P1. forms for a given citation form (e.g., pasierbowie and pasierbi are both acceptable Nora.P1. forms for pasierb). Furthermore, -i/-y are allomorphs in complementary distribution (i.e., the second Nom.P1. form in this paradigm is realized with -y for certain word-final consonants).</Paragraph> <Paragraph position="1"> -owie or -y or both Since the analyzer needs only to analyze (and not generate) forms, there is no need to split this paradigm into five different ones to account for each Nom.P1. possibility: -owie,-owie/-i,-i, -owie/-y, -y. We simply permit overgeneration, allowing each word to have two Nom.P1. forms: the correct one of the -i/-y allomorphs and -owie. Further, since the analyzer has no way to predict which of the -i/-y allomorphs is used with a given word-final consonant, explicit examples of each word-final consonant must be provided.</Paragraph> <Paragraph position="2"> These considerations lead to splitting the citation forms for this paradigm into 14 groups, which represent the primary example plus 13 inflectional groups added as supplementary examples. The Nom.Sg., Loc.Sg., and both (or applicable) Nom.P1. forms were provided for all groups apart from the primary example. After the first run, 13 of 14 groups were correctly covered. The remaining group was handled correctly in two additional runs: two more inflectional forms of the example in word-final r had to be provided to counter overgeneralization of the r --* rz alternation. Supplementary testing after the above-mentioned words were correct included the citation forms drab, piastun, kasztelan, faraon, w6jt, mnich, biedak, norweg, wtoch. The following errors were encountered: norweg got the Acc.Sg./Gen.Sg. form *norweda instead of norwega.</Paragraph> <Paragraph position="3"> Adding the correct Acc.Sg. form fixed this problem.</Paragraph> <Paragraph position="4"> wtoch got the Nom.P1. form *wtoci instead of wtosi. This form was added overtly.</Paragraph> <Paragraph position="5"> mnich got the Nom.P1. form *mnici instead of mnisi. This form was added overtly.</Paragraph> <Paragraph position="6"> After these final additions, wtoch and mnich ended up with the Acc.Sg./Gen.Sg. forms *wtosa and *mnisa instead of wtocha and mnicha (i.e., the alternation was overgeneralized again). Overtly adding the correct Acc.Sg. form wtocha solved this problem for both words and all forms were now correct.</Paragraph> </Section> <Section position="9" start_page="80" end_page="81" type="sub_section"> <SectionTitle> 6.4 Paradigm 4 </SectionTitle> <Paragraph position="0"> Paradigm 4 was for nonalternating inanimate masculine nouns with genitive singular in -a and no vowel shifts. The following declension for bicz was provided as the A spelling rule of Polish comes into play in this paradigm: letters that take a diacritic word-finally or when followed by a consonant are spelled with no diacritic plus an -i when followed by a vowel. For instance: ~+u --* niu, ~+owi --* niowi, d+u --* ciu, d+owi --+ ciowi. Some, but not all, word-final letters in this paradigm have diacritics. In addition, in this paradigm, Gen.Sg. endings depend on the final consonant: they can be -6w (for j, ch, szcz), -i (for L ~, ~) or -y (for cz, sz, rz, ~). In many instances, more than one form is possible, but this test covers only the most common form for each stem-final consonant.</Paragraph> <Paragraph position="1"> The citation forms in this paradigm broke down into 10 groups based on the final consonant. The Nom.Sg., Gen.Pl., and Instr.P1. forms were provided for the 9 groups (the tenth is the primary example, for which all forms were provided). Eight of the 10 groups were handled correctly after the first run. The spelling-rule related to -i required some extra forms to be learned correctly. Otherwise, everything came out as predicted. Supplementary testing included the citation forms klawisz, b~bel, strumie~, tach, cyrkularz; all inflectional forms were produced correctly.</Paragraph> </Section> </Section> class="xml-element"></Paper>