File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-4195_metho.xml

Size: 9,225 bytes

Last Modified: 2025-10-06 14:13:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-4195">
  <Title>BROAD COVERAGE AUTOMATIC MORPHOLOGICAL SEGMENTATION OF GERMAN WORDS</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 THE LINGUISTIC
KNOWLEDGE
</SectionTitle>
    <Paragraph position="0"> The main linguistic knowledge sourccs of thc system are the morph dictionary, which contains information about tile morph class/es each morph belongs to, and tile word syntax.</Paragraph>
    <Paragraph position="1"> We developed a classification scheme lbr German morphs and a suitable word syntax.</Paragraph>
    <Paragraph position="2"> The signiticant step to our current version was the classification of an extensive German morph list based on about 9,(100 nlorphs compiled by the Institut fi~r deutsche Sprachc m Mannheim (Germany). We merged thesc morphs with an expcrimcntal list of about 2,200 morphs which we used in tile former versions of our system. Additionally, we increased the resulting list up to ahnost I1,000 entries by many loreign morphs.</Paragraph>
    <Paragraph position="3"> It turned out that for the manual development of the syntax the tormalism o(&amp;quot; finite state networks is easier to handle than a right linear regular grammar. So we lirst represented the syntax with a finite statc network, which finally was translated into a tuoctionally equivalent right linear grammar.</Paragraph>
    <Paragraph position="4">  So far, we have developed and tested successively four classification schemes, each with a new, better syntax. We describe the third and fourth scheme, which are of actual interest (cf. table 1).</Paragraph>
    <Paragraph position="5"> The substructures of the entire transition net dealing with the word classes verb, adjective, noun etc. will be called verb net, adjective net, noun net etc. We should stress that these sub-structures are not independent automata with any separate input. Nevertheless we call them nets; parts of these nets will be called subnets.- Although the word formation of the different word classes is not fully distinct and does share some substructures, it was not possible to design the entire net in such a way that the nets for these word classes physically share some subnets. Instead physical copies of common subnets had to be created for each occurrence of such a subnet in each of the nets. This is since we used a finite state network for the representation of the word syntax. This formalism does not allow to activate from different points one common subnet and afterwards to return to the appropriate activation point.</Paragraph>
    <Paragraph position="6"> We will limit the following description to the nets for those word classes with productivity in word formation.</Paragraph>
    <Paragraph position="7"> Verbs Our Verb Net is responsible for the segmentation of finite verbs. Those of its subnets containing stem-labelled arcs refer to different combinations of mood, tense, and weakness vs strongness of the verb stem. Each of these combinations demands specific inflexional endings. - Weak stems are tense-invariant, strong stems can vary - according to tense by vowel gradation (Umlaut or Ablaut). As a consequence, the classification of strong verb stems is oriented towards their suitability for certain tenses. For example, &lt; = ging&gt; is an imperfect tense form of &lt; = geh%en&gt; (engl.</Paragraph>
    <Paragraph position="8"> to go). In our morph dictionary the two morphs &lt;geh&gt; and &lt;ging&gt; are two independent entries, each with its own tenseoriented classification. Weak stems are classified according to prefixation and derivation needs. We took into account three groups Overview of the Extent of the Syntaxes and Classification Systems Developed ofderivational suffixes: 1) &lt;-el&gt;, &lt;-er&gt;, 2) &lt; -ig&gt;, &lt; -lich &gt;, and 3) &lt; ~ier&gt;. In the area of verb prefixes, one problem solved in our current version was to avoid the splitting of particular prefixes, e.g. * &lt; + her+ unter=geh%en&gt; apart of the correct segmentation which is &lt; + herunter = geh%en &gt; (engl.: to go down).</Paragraph>
    <Paragraph position="9"> In German, each infinitive can take the role of a noun, and each participle can do the same after being inflected. As a consequence, the part of our transition net related to infinite verbs is integrated into the Noun Net.</Paragraph>
    <Paragraph position="10"> The set of verb stem classes had to be expanded for our current version to implement composition restrictions concerning verb stems as parts of nouns. For example, we had to cope with missegmentations such as</Paragraph>
    <Paragraph position="12"> the correct segmentation &lt; + Er= find%er=schon%ung&gt; (engl. careful treatment of inventors). At least two restrictions exist: Firstly, the verb stem &lt; find&gt; is not allowed before a noun (which the word * &lt; Erschonung&gt; would be, if existing) but, e.g., the originally identically classified verb stem &lt;bind&gt; is, as in &lt; Bindladen&gt; (engl.</Paragraph>
    <Paragraph position="13"> string). Secondly, the morph &lt;er&gt; is no suitable prefix for the verb stem &lt;schon&gt; but, e.g., for the originally identically classified verb stem &lt; schein &gt;, leading to &lt;erscheinen&gt; (engl. to appear). Verb-stemrelated restrictions like these, which we implemented in our system by adding morph classifications to the existing ones, are only relevant for nouns, in the first case mentioned above, this is obvious. The second restriction does not concern finite verbs, because missegmentations only occur when the morph &lt; er&gt; is positioned between two stems. At the beginning of a word, the morph &lt;er&gt; can be seen as a prefix without any restrictions.</Paragraph>
    <Paragraph position="14"> Adl'ectives The adjective net consists of three subnets, each representing a possible way of adjectival derivation in German.</Paragraph>
    <Paragraph position="15"> I. Simple adjectives like &lt; schnell&gt; (engl.: last), &lt;schOn&gt; (engl.: beautiful) etc.</Paragraph>
    <Paragraph position="16"> AcrEs DE COLING-92. NAN'rF_S, 23-28 AOl'rr 1992 l 2 2 0 PROC. OF COLING-92. NANTES, AUG. 23-28, 1992 These stems can be compared and inflected. Some stems occur only in a certain degree o\[ comparison like &lt;bess&gt; (stem of engl.: better), &lt; bcs&gt; (stem ofengl.: best). They have obligatory comparative or superlative sufiixes while the corresponding stems of the positive degree must not be followed by tllese suffixes, like e.g. &lt;gut&gt; (engl.: good).</Paragraph>
    <Paragraph position="17">  2. Adjectives derived from verbs or verbal stems like &lt; +be=gch%bar&gt; (engl.: passable) 3. Adjectives derived from nonns. Example:</Paragraph>
    <Paragraph position="19"> As a peculiarity o1&amp;quot; German word formation, a past participle may be compared and intlected like an adjective stem. l!xampte: The past participle of &lt;gclingcn&gt; (engl.: to succeed) has the comparative forms &lt; +ge-</Paragraph>
    <Paragraph position="21"> may be translated as &amp;quot;successful, more successful, most successful'.</Paragraph>
    <Paragraph position="22"> Roughly speaking, the concept of the adjective net is to allow an adjective stem to be substituted by more complex constructions, like the ones described above. Special subnets are existing for adverbs and tot adjectives with non-German stems. The latter is needed for marking foreign suffixes like in &lt; =parall_el&gt; because these suffixes attract the word accent, which in (ierman causes a vowel to be pronounced long.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Nouns
</SectionTitle>
      <Paragraph position="0"> A very productive feature of German word formation in thc arca of nouns is composition: New nouns may be fbrmed by concatenation of lexical morphs, optionally interspersed with prefixes, suffixes, and infixes. In our noun net, this feature is modelled by loops over lexical morphs which can be left by inflexion modules to reach a final state and which cross infix modules (including zero-infix), prefix modules, and (derivational) suffix modules.</Paragraph>
      <Paragraph position="1"> lnflexional suffixes occurring in compound nouns between lexical morphs are treated by us the same way as infixes.</Paragraph>
      <Paragraph position="2"> Noun stems are classilied according to the features umlaut, etymology (German vs not German), obligatory affix, inIlexional suflix, and composition needs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Foreign Words
</SectionTitle>
      <Paragraph position="0"> Each of the described nets contains subnets dealing with the l~brmation of foreign words which are involved in German word formation, e.g. &lt; = mum.if iz%ier%en&gt; (engl. to mumify), &lt; = Bas is &gt; (engl. basis), &lt; = l'ort~ier (Frenctii engl. porter).</Paragraph>
      <Paragraph position="1"> Foreign words without connection to German word tbrmation and names are not intended to be segmented by our system. So an unsegmented word is not necessarily a system failure but can be a required rejection (el.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML