XML Viewer - c88-1043

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-1043_metho.xml
Size: 17,062 bytes
Last Modified: 2025-10-06 14:12:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="C88-1043">
  <Title>A Finite State Approach to German Verb Morphology</Title>
  <Section position="5" start_page="0" end_page="212" type="metho">
    <SectionTitle>
3 A Parallel ttewriting Variant of FSTs with
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="212" type="sub_section">
      <SectionTitle>
Feature Unification
</SectionTitle>
      <Paragraph position="0"> Although Koskenniemi's machinery works in parallel with respect to the rules, rewriting is still performed in a sequential manner: each word form is processed letter by letter (or morpheme by morpheme) such that all replacements are done one at a time. Certainly this model does not depend on the processing direction from left to right, but at any time during processing it focusses on only one symbol on the input tape.</Paragraph>
      <Paragraph position="1"> It is precisely this feature, where our approach, based on a suggestion by Kay, differs from Koskenniemi's. Our work grew out of discussions with M. Kay, to which the first author had the opportunity during a research stay at CSLI, Stanford, in summer 1985. Without his help oar investigations would not have been possible. In oul system, rewriting is performed over complete surface words, not letters or morphemes. There is no translation from lexical to surface strings, because there is only one level, the level of surface strings. Rewriting is defined by rules satisfying the scheme  where both, Pattern and Replacement, are strings that are allowed to contain the wild card &amp;quot;?&amp;quot; character which matches exactly one (and the same) letter. Let a,b, wl,w2 E ~* where \]E is an alphabet. For all wl, w2 the rule a --~ b, with Pattern= a and Replacement= b, rewrites wlaw2 to wlbw2. It should be noted that only one occurrence of the Pattern is rewritten. Furthermore, it can be specified whether the search is to be conducted from left to right or vice versa. Hence, it is possible to perform rewriting in parallel in contrast to Koskenniemi's sequential mode.</Paragraph>
      <Paragraph position="2"> The rules are attached to the edges of a FST; hence the application order of the rules is determined by the sequence of admissible transitions. Conflicts arising from the fact that at a given state the patterns of several rules match are resolved by the strategy described in sec. 5.</Paragraph>
      <Paragraph position="3">  Matchil~g of the left hand side of a rule is only one condition to do a transition successfully. The second condition is that the list of morphosyntactic features of the actual item can be successfully unified with the fe~ture llst attadmd to the resp. edge of the automaton.</Paragraph>
      <Paragraph position="4"> The required unification procedure realizes a slightly extended version of the well known term unification algorithm. The objects to be unified are not lists of functors and arguments in a fixed order with fixed lengths, but sets of attributes (named arguments) of arbitrary length. The argument values, however, are restricted to atomic objects, and therefore not allowed to be attribute lists themselves (as it is the case with the recnrsively defined functional structure datatype in unification grammars).</Paragraph>
      <Paragraph position="5"> Example I: Note that words are delimited by angle brackets such that affixes can be substituted for the empty string at the beginning or end of a word.</Paragraph>
      <Paragraph position="6">  This automaton fragment generates &amp;quot;&lt;kam&gt;&amp;quot; with the feature list ((tempus import) (group 1) (hUm sing) (mode indic) (pets 1)) t rein the infinitive form &amp;quot;&lt;kommen&gt;'.</Paragraph>
      <Paragraph position="7"> Currently, there is no cmnpiler wldch generates an automaton from a given set of rules like the one by Karttunen et al. /1987/ \[3\], i.e. the automaton has to be coded manually.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="212" end_page="212" type="metho">
    <SectionTitle>
4 The \[,exicon
</SectionTitle>
    <Paragraph position="0"> In order to achieve fast access and to avoid redundancy wherever possible, the lexicon is realized as a letter tree with annotated feature lists for terminal nodes.</Paragraph>
    <Paragraph position="1">  Example 2: A section of the letter4ree lexicon containing &amp;quot;wagen', &amp;quot;wiegePS', and &amp;quot;w~igen'.</Paragraph>
    <Paragraph position="2"> 4&amp;quot; (\~ (\a (\g (\e (\n (\/ ((group 2))))))) (\i (\e (\g (\e (\n (\/ ((group 4)))))))) (\a (\g (\e (\n 4\/ ((group 3)))))))))</Paragraph>
  </Section>
  <Section position="7" start_page="212" end_page="212" type="metho">
    <SectionTitle>
5 The Control Strategy
</SectionTitle>
    <Paragraph position="0"> Our implementation follows Kay's suggestion, in that processing -analysis am\[ generation as well -- is done in two essential steps: First, along v. path beginning at a start state, for all applicable rules the attached feature unifications are performed until a final state is reached. The search strategy is depth-first, i.e. at each state the fir~t applicable rewriting rule in the list of transitions is selected.</Paragraph>
    <Paragraph position="1"> In a second phase, such a successful path is traced back to its origin with simultaneous execution of the corresponding rewriting rules.</Paragraph>
    <Paragraph position="2"> For rewriting, a device called exclusion list is employed, which allows to coml)ine several distinct rules into one unit (which has been omitted ia example 1 for tim sake of simplicity). This adds a further restriction to \[ransitions: A transition is blocked if the pattern of the corresponding rule matches, but is contained in the exclusion list.</Paragraph>
  </Section>
  <Section position="8" start_page="212" end_page="213" type="metho">
    <SectionTitle>
6 German Verb Morphology
</SectionTitle>
    <Paragraph position="0"> In German, inflected verb forms exist for the four tense/mode combinations: present tense/indicative, present tense/conjunctive, past tense/ indicative and past tense/ conjunctive. Furthermore there are two participles (present and perfect) and two imperative forms derived from the infinitive verb stem. This adds up to 29 possibly different forms per verb.</Paragraph>
    <Paragraph position="1"> With respect to inflection, German verbs can be divided into three classes: regular &amp;quot;weak&amp;quot; verbs (&amp;quot;schwache Verben&amp;quot;), &amp;quot;strong&amp;quot; verbs (&amp;quot;starke Verben&amp;quot;) and irregular verbs.</Paragraph>
    <Paragraph position="2"> Inflection of weak verbs is done by simply adding a suffix to the stem. In the special case of the past participle the prefix &amp;quot;ge-&amp;quot; is added too. This class can easily be handled by existing algorithms, like the one of Kay described above.</Paragraph>
    <Paragraph position="3"> Inflection of strong verbs is also done by adding to the stem suffixes, which slightly differ from the ones used with weak verbs. In addition to the change of the ending, the stein itself may vary, too.</Paragraph>
    <Paragraph position="4"> In most cases it is not the whole stem that changes, but only one special vowel in the stem, tbe stem vowel.</Paragraph>
    <Paragraph position="5"> This change introduces the problems that make an extension of the existing algorithm necessary.</Paragraph>
    <Paragraph position="6"> In most cases irregular verbs can be treated like regular strong verbs with the exception of some special forms.</Paragraph>
    <Paragraph position="7"> Example 3: &amp;quot;sein&amp;quot; (engl.: to be) To conjugate the verb in past tense (&amp;quot;ich war&amp;quot;, &amp;quot;du wurst&amp;quot;, ...), conjugate &amp;quot;war-&amp;quot; as a regular strong verb.</Paragraph>
    <Paragraph position="8"> The following fourteen graphemes can be stem vowels: &amp;quot;a&amp;quot;, &amp;quot;e&amp;quot;, &amp;quot;i', &amp;quot;o', &amp;quot;u', &amp;quot;~\[&amp;quot;, &amp;quot;5&amp;quot;, &amp;quot;ii&amp;quot;, &amp;quot;el&amp;quot;, &amp;quot;ai&amp;quot;, &amp;quot;au&amp;quot;, &amp;quot;~u&amp;quot;, &amp;quot;eu&amp;quot; and &amp;quot;ie&amp;quot;. When conjugating a verb, the stem vowel may change up to six times, as the following example demonstrates: Form Intl. Type Grammatical Description ich helle (1) present ist person du hilfst (2) present 2nd person er half (3) past tense er hffife (rarely used, (4) past tense conj.</Paragraph>
    <Paragraph position="9"> but correct) er hPS1fe (4) past tense conj.</Paragraph>
    <Paragraph position="10"> er hat geholfen (5) past participle This gives rise to tile combinatorial explosion of 14 6 possible series of stem vowels for each verb conjugation (&amp;quot;paradigm&amp;quot;). Only a small number of those are actually used in the language, but even this number is too big to be handled easily by one of the described algorithms.</Paragraph>
    <Paragraph position="11"> 7 Hard Problems in German Verb Inflection The following problems are hard to be solved by any one of the existing algorithms: * How can the stem vowel be located? This may be difficult, especially when compound verbs are to be analyzed, like &amp;quot;beherzigen&amp;quot;. null * Given an inflected verb form, how can we find the infinitive stein from which this form is derived? Example: &amp;quot;wSge&amp;quot;: &amp;quot;wages'? or &amp;quot;wiegen'? or &amp;quot;wPSgen&amp;quot;? * tIow can the lexicon be kept small; i.e. can we get around adding all the possible changes of the stem to the lexicon? The general idea behind our solution is to build a &amp;quot;shell&amp;quot; around Kay's generic two-state-morphology scheme which takes care of the special stem vowel problems in German verbs. The core of this scheme, which is the rewriting-rule algorithm, remains unchanged and adds all appropriate affixes to the stem. This leads to an algorithm that can generate all forms of any German verb, even of a  prefixed verb, and analyze these forms as well. One important part of the extended algorithm is a matrix called the stem-vowel table which contains all the information about the vowel series occurring in the conjugations of one verb. After some compression and combination of related series the size of the table is 40*5 lists of characters. This matrix is organized in tile following manner: There are five columns corresponding to the five cases of stem vowel change ill example 3. Each entry in a column is a list of char utters; mostly this list has length one. (The fourth element of the list corresponding to the verb &amp;quot;helfen&amp;quot; would have the two elements &amp;quot;ii&amp;quot; and &amp;quot;~&amp;quot;).</Paragraph>
    <Paragraph position="12"> The rows list all the possible combinations of vowel change that occur in the present use of the language.</Paragraph>
    <Paragraph position="13"> The ,;hell consists of five basic parts (placed in order of tile way they are called when the algorithm 9ene~*ttes forms):  1. A routine for locating the stem vowel and replacing it by a generic symbol; it is realized by a simple function.</Paragraph>
    <Paragraph position="14"> 2. An algorithm that separates prefixes from the stem when a compound verb is to be analyzed. It also strips off the infinitive ending. This is done by a simple lookup in the prefix table.</Paragraph>
    <Paragraph position="15"> 3. A lexicon module which also adds some default intormation to  the grammatical information obtained from the lexicon entry.</Paragraph>
    <Paragraph position="16"> Irregular and strong vm'bs get a group number added to the feature list. The prefix, if one is found~ is compared with the list of permissible prefixes in ttle lexicon.</Paragraph>
    <Paragraph position="17"> 4. The core of the algorithm uses an automaton and rewriting rules to modify the affixes of the verb. In the course of unification new attributes are added to the feature list. In particular, if the verb is strong or irregular, information about the stein vowel is added to the list. The new information contains an offset into the stem vowel table.</Paragraph>
    <Paragraph position="18"> 5. The generic symbol is replaced by the stem vowel indicated by the feature list using a single rewriting rule. The new vowel is looked up in the table which is indexed by two values in the feature list, namely the group number of the verb (whirh is either defaulted or part of the lexical information), amt a column number, which is added by the automaton.</Paragraph>
  </Section>
  <Section position="9" start_page="213" end_page="213" type="metho">
    <SectionTitle>
8 Further Enhancements to Keep the Anal-
</SectionTitle>
    <Paragraph position="0"> ysis of Verbs Fast The main problem with the analysis of German verb forms is to find the infinitive stem belonging to the stein. As soon as this stem is found, the search tree can be pruned considerably. This is because the lexicon information of the infinitive form may restrict the possible unifications when stepping from one state of the automaton to another one.</Paragraph>
    <Paragraph position="1"> This problem h~ been solved in tile following way. Given an inflected form with a possible changed stem vowel, we can at least find the position of the actual stem vowel. We can also strip off the ending and the prefix, if one exists (e.g. &amp;quot;erwSge&amp;quot; \[infinitive :'erwPSgen&amp;quot;\] --~ &amp;quot;wXg-&amp;quot;). This leads to a rather peculiar structure for the lexicon. Tile lexicon mainly contains verb infinitives in an encoded form. The stem vowel of the infinitive is replaced by a place holder, the stem vowel is added to the end of the form, separated frmn the stem by a hyphen: Stein vowels consisting of more than one character are encoded as a single symbol.</Paragraph>
    <Paragraph position="2">  The analysis is simplified. Immedia.l, ely after preprocessing the the form we can reduce the possible candidates ibr tile related inliui-tire to the subtree below the hyphen. This special encoding has the side effect that the nmnber of nodes of the lexicon tree is reduced  when many similar forms are added to the lexicon.</Paragraph>
    <Paragraph position="3"> 9 Constraints On ~.\['he :~3e:dco~ Three other classes of verbs have to be considered, if we want to find the stem of any German verb easily: l. Verbs which change the stein at places other than the stein vowel.</Paragraph>
    <Paragraph position="4"> 2. Verbs with an infinitive ending on &amp;quot;-era&amp;quot; or 'C/-eln'~ These verbs omit in some cases the %&amp;quot; which belongs to the stem (!). 3. Verbs with the ending &amp;quot;-ssen&amp;quot; or &amp;quot;-lieu'. For these verbs i.he  &amp;quot;ss&amp;quot; and &amp;quot;fl&amp;quot; have to be exchanged in some forms. ~br (1) all the changed stems are added to the lexicon together with the grammatical information, that restricts their use to the permissible forms, whicll results in about 75 new entries for the lexicon The verbs in (2) and (3) are em:oded in a special way. The encoding has no side effects on the rest of the Ngorithm. it only add8 some transitions to the automaton (el./Paulus 1986/\[9\]).</Paragraph>
  </Section>
  <Section position="10" start_page="213" end_page="213" type="metho">
    <SectionTitle>
10 Furthew N:~&lt;tc:*;mions
</SectionTitle>
    <Paragraph position="0"> Tile algorithm as implemented can handle all rases of prefixed verbs, even the cases where the prefix is separated from the verb for some forms (e.g. %r kani an&amp;quot;).</Paragraph>
    <Paragraph position="1"> The prefixes are added to the lexical information of the infinitive torm. Thus an extra prefix requires only little extra, ,storage lbr the lexicon. The analysis-mode checks whether the prefix is allowable or not.</Paragraph>
    <Paragraph position="2"> Finally the algorithm also takes care of tile tra.usitive and intransitive use of a verb, if this alfects ~he way the verb is inflected (e.g. &amp;quot;er schrak', &amp;quot;er erschreckte reich&amp;quot;).</Paragraph>
  </Section>
  <Section position="11" start_page="213" end_page="213" type="metho">
    <SectionTitle>
11 Practical Experience
</SectionTitle>
    <Paragraph position="0"> The complete system for analysis and generation including all of the mentioned extensions has been implmnented in TLC-LISP on a PC.</Paragraph>
    <Paragraph position="1"> The lexicon contains all irregular and strong verbs with their prefixes, and many other verbs, without running into memory limitations.</Paragraph>
    <Paragraph position="2"> In a first try the German lexicon was built in a straightforward way (as shown in example 2) and all the inflection was done using, rewriting-rules only. Comparison with the extended algorithm strawed a runtime improvement of more than 75 percent. In absolute figures the performance of analysis is less than 1 second per verb tbrm; the present version of the program consists of non-optimized compiled LISP code. French and Spanish verbs can he haudled directly by the kernel algorithm without the described extensions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML