File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0609_metho.xml
Size: 25,101 bytes
Last Modified: 2025-10-06 14:15:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-0609"> <Title>Automatic Collection and Analysis of GermanCompounds</Title> <Section position="3" start_page="0" end_page="61" type="metho"> <SectionTitle> 2. The challenge of compounds </SectionTitle> <Paragraph position="0"> In general, the analyst cannot know whether a given language forms its compounds with fully inflected words or with stems (that is, inflected words minus the inflectional suffix), but the latter is by far the most common pattern. The challenge, then, is to determine whether an analysis of the non-compound words in a corpus will give rise to a sufficient inventory of stems (in the correct surface form, so to speak) so that actual compounds found in the corpus can be identified as concatenation of two such stems, possibly separated by a linker element chosen from a very small inventory. At the same time, it is critical that the analysis not Over-recognize compounds, that is, that it not &quot;recognize&quot; compounds that are not there - an error that will typically arise if there exist true stems that are homographs of suffixes, or of subparts of suffixes. We have labelled this problem the Schweinerei problem (from Schweinerei &quot;mess&quot; \[lit., pig + erei nominal derivational suffix\]) because the word can be misanalyzed as a compound incorporating the linker er and the</Paragraph> <Section position="1" start_page="0" end_page="61" type="sub_section"> <SectionTitle> Right Element Ei &quot;egg&quot;. </SectionTitle> <Paragraph position="0"> In addition, the challenge of identifying compounds raises the question as to whether there is a clear distinction to be drawn (in German, and in other languages as well) between a (prefix + stem) structure and a compound (stem + stem) structure. Duden 1995, for example, characterizes one use of Haupt &quot;head&quot; as a prefix (e.g., in Hauptstadt &quot;capital&quot;), based, presumably, on the semantic bleaching that often accompanies long-time use of a word in various compounds. English has similar uses of the stem head, with cases ranging from head teacher, written with a space and in which the element head contributes a very clear semantics even though it has almost nothing to do with the original sense of head, all the way to headline, where the meaning of the word is barely, if at all, decomposable into two parts. In our work we have employed the definition of affix that is integrated into our automatic morphological analyzer, which is the following: after establishing a tentative set of candidate affixes, a set of affixes is identified which occurs with each given stem (a distinct set of prefixes and suffixes). If exactly the same set (of two or more suffixes) is used by two or more stems, then that set of affixes is &quot;approved&quot;, and the affixes are definitively identified as affixes (rather than as compounds, for example).</Paragraph> </Section> </Section> <Section position="4" start_page="61" end_page="61" type="metho"> <SectionTitle> 3. The challenge of German </SectionTitle> <Paragraph position="0"> compounds Compounding in German is common, ranging from the v~ry frequent formation of compound nouns to the less common but also productive formation of compound verbs and adjectives) Multisegmented compounds, such as Anwendungsprogrammschnittstelle &quot;applications program interface&quot;, can be viewed as recursively applied binary compounds</Paragraph> <Paragraph position="2"> will refer to the element on the left of such a binary structure as the Left Element, the element on the right as the Right Element, and the sequence of linking characters used to join the Left Element and Right Element as the LinkerP</Paragraph> </Section> <Section position="5" start_page="61" end_page="62" type="metho"> <SectionTitle> 2 See Duden 1995 3 We use this linguistically neutral terminology in </SectionTitle> <Paragraph position="0"> order to emphasize the automatic, concatenative nature of the text processing described here. In general, for noun-noun compounds, Left Element, In our example, the Linker s joins Anwendung and Programm, whilst the null Linker joins Anwendungsprogramm and Schnittstelle.</Paragraph> <Paragraph position="1"> In German, the Linkers are e, es, en, er, n, ens, ns, s, and the zero morpheme nu//. In general, the Left Element, Linker, and Right Element are simply concatenated (Bewegung &quot;movement&quot; + s + Achse &quot;axis&quot; = Bewegungsachse &quot;axis of rotation&quot;), although the Left Element is occasionally umlauted. (Huhn &quot;hen&quot; + er + ei &quot;egg&quot; = Hiihnerei &quot;hen's egg&quot;)? A hyphen can be used to emphasize the point of linkage between the Left Element+Lhlker and the Right Element. This effectively doubles the number of Linkers we consider, i.e. we add (e- es- en- er- n- ens- ns- sand -) to our list. Duden 1995 reports that the hyphen is prescribed if the Left Element is an abbreviation and generally present if the Left Element is a proper name, and otherwise, it is generally employed to improve readability or to emphasize the individual components of the compound. Our actual results confirm some of these guidelines but also yield data that seem not to be covered by the guidelines. The leading hyphenated Left Elements in our data, for example, are (in order): US-, Tang-, and Ballett-. Ballett is neither an abbreviation nor a proper name, nor does it seem that it leads to especially unreadable compounds; nevertheless, it is near the top of the list.</Paragraph> <Paragraph position="2"> If the Left Element ends in the suffix -e or -en, this suffix is sometimes dropped (Schule &quot;school&quot; + Kind &quot;child&quot; = Schulkind &quot;schoolage child&quot;) 5. But there is another view of compounding in which no subtraction occurs.</Paragraph> <Paragraph position="3"> Rather, the form without the -e or-en (e.g.</Paragraph> <Paragraph position="4"> Linker, and Right Element correspond to the German terms Bestimmungsworr, Fugenelement, and Grundwort, or to the English terms determinant, connecting morpheme, and head.</Paragraph> <Paragraph position="5"> 4 Umlauting of the Left Element (e.g.</Paragraph> <Paragraph position="6"> Land+Spiel=LA'nderspiel) can occur in conjunction with the null linker, the Linker e, and the Linker er. In these cases, the resulting form coincides orthographically with the plural form, but is not necessarily semantically motivated as a plural; see e.g. Duden 1995.</Paragraph> <Paragraph position="7"> 5 Zepi6 1970, borrowing from Charles Hockett, refers to these as subtractive morphs.</Paragraph> <Paragraph position="8"> schul) is the stem: Our corpus processing returns such suffixless stems. Furthermore, the stems returned by corpus processing can contain umlauts. In our task at hand of automatically assigning a linker distribution to lexicalized nouns, we simply have to add the -e or-en suffix and/or deumlaut the suffix to find the lexicalized noun for which we wish to determine a distribution of Linkers (schul -> schule; l~ind -> land).</Paragraph> <Paragraph position="9"> In general, the choice of a Linker (as well as umlauting and desuffixing) is determined by the Left Element: Part-of-speech combinations of the Left Element and Right Element include noun-noun, nounverb, verb-noun, adjective-noun, noun-adjective, etc. In this paper we are only concerned with noun-noun compounds, i.e. ones whose Left Element and Right Element are both lexicalized nouns. Non-nominal Left Elements exhibit fairly trivial Linker distributions: Previous studies of automatic treatment of German compounds have not dealt with the treatment of the Linker element. Geutner 1995 describes the effect on a speech recognition system of the recognition of compounding in German as a productive and significant process.</Paragraph> <Paragraph position="10"> He notes that treatment of compounds decreases a substantial part of the nagging out-of-vocabulary problem, a major part of the cause for OOV being more significant in German than in English. Berton et al. 1996 also describe work 6 This view is strongly linguistically motivated. Recognizing schul as a stem, for example, illustrates the relationship between Schule and schulen.</Paragraph> <Paragraph position="11"> Similarly, treating fried as a stem motivates Frieden, friedlich, befriedigen, etc.</Paragraph> </Section> <Section position="6" start_page="62" end_page="62" type="metho"> <SectionTitle> 7 Some Left Elements govern multiple linking </SectionTitle> <Paragraph position="0"> sequences. Consider, for example, Tag-e-buch &quot;day + book = diary&quot; vs. Tag-es-themen &quot;day + topics = news items&quot;, which share the Left Element Tag &quot;day&quot;. This is why we wish to calculate a Linker distribution, not just a single Linker, for each noun used as a Left Element.</Paragraph> <Paragraph position="1"> s For verbs, the bare stem, i.e. the form without the infinitival -(e)n suffix is used with the null Linker, e.g. sprechen + Stunde = Sprechstunde. Adjectives are generally used as Left Elements in their uninflected positive form (Rotkehlchen) and occasionally in the superlative form (see e.g. Duden 1995).</Paragraph> <Paragraph position="2"> aimed at improving OOV responses of a speech recognition system by allowing the language-model to include compounds. Results of that experiment showed that in the context of speech recognition, the addition of compounding (along with the removal of the compounds from the lexicon) could decrease the performance of the system, especially in the case where the compound was of high frequency, and the case where one of the compounds was phonologically short.</Paragraph> <Paragraph position="3"> Our goals were formulated in the context of a system which must be equally robust in the context of analysis and generation; furthermore, we set out to obtain information that could be placed in our lexicon, but the analysis of compounds that we used did not need to be performed in real-time together with a user's speech or keyboard input. On the other hand, we set quite stringent targets for the correctness of the materials that we obtain.</Paragraph> </Section> <Section position="7" start_page="62" end_page="63" type="metho"> <SectionTitle> 4. Linker distributions </SectionTitle> <Paragraph position="0"> To overcome the out-of-vocabulary problem, German natural language processing systems must accommodate compounds. Encoding in the lexicon for each noun a statistical distribution of Linkers governed by that noun when it is used as a Left Element provides the requisite lexical support. 9 This information is critical for the generation of compound words and can increase the precision of compound analysis. We believe that this lexical approach is preferable to a rule-driven one both for computational efficiency and because the rules governing the selection of a Linker are tempered by such wide-ranging factors as gender, wordlength, phonology, diachrony, ~and dialectal variation ~o and are fraught with exceptions.</Paragraph> <Paragraph position="1"> Our broad-coverage German natural language processing system includes a lexicon with over 140,000 entries, including approximately 100,000 nouns, none of which contained Linker distribution information prior to our 9 For example, if in an examined corpus, the noun Staat were used 96 times with the Linker s, and 12 times with the Linker en, we would calculate the distribution ( p(-s)=0.89; p(-en)=0.11 ).</Paragraph> <Paragraph position="2"> undertaking. Our goal was to identify stems and suffixes in a large German corpus, then post-process the results to yield Linker distributions for a large number of nouns in our lexicon. This goal was largely met. Both the stem/suffix identification and the subsequent post-processing were implemented to run fully automatically, so that the process can be applied to an arbitrarily large corpus, yielding distributions for a maximal number of lexicalized nouns.</Paragraph> </Section> <Section position="8" start_page="63" end_page="63" type="metho"> <SectionTitle> 5. Procedures </SectionTitle> <Paragraph position="0"> We now summarize the steps involved in first morphologically processing a corpus to detect stems and suffix, then using the stem/suffix information to find compounds, and finally post-processing the compound list to calculate Linker distributions for the nouns used as Left Elements.</Paragraph> <Paragraph position="1"> Since the object of our inquiry has been noun-noun compounds, and since German nouns are capitalized, we restricted our processing to words in the corpus beginning with a capital letter. We therefore fii-st applied our automatic morphological analyzer to the first 300,000 capitalized words in Microsoft's Encarta, an encyclopedia, to establish a list of 8,426 noun stems. These are identified by first automatically extracting the productive suffixes in the corpus; 74 were identified, in frequency dominated by the top si,g suffixes (en, e, er, s, ung, n); see Table 1) 1 When the algorithm identifies two distinct words as composed of the same stem followed by different suffixes, it accepts that stern as legitimate. For example, the string beobacht(stem for &quot;watch&quot;) is identified as a stem because it appears in the corpus with the following five suffixes: -ere'-er/-ers/-ung/tmgen. In addition, if a potential stem occurs as a free-standing word, we consider that to count as an appearance of the stem with a null suffix. For example, the stem Alaska &quot;Alaska&quot; appears with l, We note that four &quot;suffixes&quot; identified by this procedure are in fact from compounds: -land, szentrum, -produktion, and -sgebiet. Given our algorithm for determing suffixes, it follows that such errors will occur less often as we move to larger corpora. In addition, these spurious suffixes are also classified as stems.</Paragraph> <Paragraph position="2"> three &quot;suffixes&quot;: -s, -n, and Null. Thus any freestanding word which also appears with at least one(independently determined) suffix counts as a stem for our purposes. See Table 2.</Paragraph> <Paragraph position="3"> Table 2 illustrates the fact that this procedure includes in our list of stems noun compounds that are found in the corpus with more than one suffix. This is not a problem, and in fact is a good thing, because, as we noted above, compounds are frequently recursively composed out of pieces which are themselves compounds.</Paragraph> <Paragraph position="4"> With this list of stems in hand, we revisit the original corpus, checking each entry now for the possibility of one or more parses as compounds.</Paragraph> <Paragraph position="5"> Given the set of linkers (established in advance, as we have noted), we can very simply review each word to see if it can be parsed as the concatenation of an item from the list of stems + one of the linkers + another item from the list of stems + one of the 74 recognized suffixes. All forms that can be so parsed are added to a list of compounds found; in our corpus, we found 5522 compounds, based on 3866 distinct First Element stems. For each distinct FirstElement stem, we produce a record of the form: ( Left Stem, Linker { Exemplart, Exemplar, ..... Exemplar, } ) where each Exemplar is the Right Element of a compound, and is i:self of the form (Stem + Suffix ).</Paragraph> <Paragraph position="6"> Next, the compounds are filtered so that they only include unambiguous noun-noun compounds. This filtering processed is described in the following section. Finally, the filtered set of data is used to calculate a distribution of Linker governance for each surviving Left Stem.</Paragraph> </Section> <Section position="9" start_page="63" end_page="65" type="metho"> <SectionTitle> 6. Filtering </SectionTitle> <Paragraph position="0"> In a compound such as Anwendungsprograrnme (anwendung + s + programm + e), we call a (Left Stem + Suffix) pair such as (anwendung + s) a candidate, while a (Right Stem + Suffix) pair like (programm + e) is called an exemplar.</Paragraph> <Paragraph position="1"> Thus, our set of compounds is logically of the form:</Paragraph> <Paragraph position="3"> For example, if the corpus contains Anwendungsprogramm &quot;applications program&quot; and Anwendungsprograrame, &quot;applications G ~ ,l proorams , then we would have the item</Paragraph> <Paragraph position="5"> Since our specific goal is to produce Linker distribution information for nouns used as the Left Element in noun-noun compounds, we must now filter this raw data so that we end up with candidates and associated exemplars that are unambiguously involved in noun-noun compounding. This filtering process is now described.</Paragraph> <Paragraph position="6"> In order to calculate meaningful linker distributions, the raw data must first be passed through a series of simple filters.</Paragraph> <Paragraph position="7"> Step 1 Left stems which are not the stems of lexicalized nouns are excluded. The stem and the lexicalized words may differ with regard to umlauting, and in addition the lexicalized word may contain the -e/-en suffL~. For example, the left stems schul and land correspond to the lexical entries Schule and Land, and are thus not excluded. But this step does properly exclude e.g. the candidate ab+null since ab is not a noun, obviating compounds like Abzug and Abbildung.</Paragraph> <Paragraph position="8"> Step 2. Left stems with multiple parts of speech are excluded. For example, gut can be an adjective (&quot;good&quot;) or a noun (&quot;property&quot;). Since German compounds can be built with e.g. a verb or adjective as the Left Element, we cannot automatically determine whether a compound starting with the Left Element gut is combining the adjective or the noun. We therefore eliminate the candidate gut + null. 12 A special instance of excluding multiple parts of speech is the case of verb stems. When a verb is used as the Left Element of a compound, the verb stem, i.e. the infinitive without the final (e)n, is used. This leads to a number of ambiguous Left Elements such as bhtt (noun Bhtt = &quot;blood&quot;; verb bluten = &quot;bleed&quot;) and block (noun Block = &quot;block&quot;; verb blocken = &quot;block&quot;), which are excluded, since it cannot be automatically determined whether the compounding is based upon the verb stem or the homographic noun.</Paragraph> <Paragraph position="9"> Step 3. Cases in which the division between the Left Stem and the Linker is ambiguous are 12 These, and other ambiguous cases, are logged to a file for possible later manual review.</Paragraph> <Paragraph position="10"> excluded. For example, the candidate mark &quot;mark&quot; + en, with exemplars such as Weltmeister+schafi &quot;world championship&quot; and nam+e &quot;name&quot;, is excluded, since there is an alternate division: marke &quot;brand&quot;+n. 13 Step 4. Combinations of Left Stem and Linker in which the final character of the Left Stem and the initial character of the Linker are identical are excluded.</Paragraph> <Paragraph position="11"> This is for phonological reasons, and applies both to vowels and consonants. Thus, the candidate boden with the exemplar es+ter is properly rejected, as is industrie &quot;industry&quot; + er, with exemplars like (zeugnisse, null). 14 These first four filters remove invalid and/or ambiguous candidates; next, a few more filters are applied to remove invalid and/or ambiguous exemplars. If this filtering of exemplars results in a candidate being left with no valid exemplars, then the candidate is of course removed from the list.</Paragraph> <Paragraph position="12"> Step 5. Exemplars whose stem is not a lexicalized noun are excluded. This is a reasonable filtering step, since we are interested in noun-noun compounds. The exemplar bella + null (associated with the candidate Ara &quot;parrot&quot; + null), derived from the compound Arabella, for example, is excluded in this step.</Paragraph> <Paragraph position="13"> Step 6. Exemplars in which the division between the Stem and the Suffix is ambiguous are excluded. For example, the exemplar kamm &quot;comb&quot; + er (associated e.g. with the candidate architekt &quot;architect&quot; + en) is ambiguous with the exemplar kammer &quot;chamber&quot; + null, and is therefore excluded.</Paragraph> <Paragraph position="14"> Step 7. Cases in which the division between the Linker and the Suffix is ambiguous are excluded. Consider the candidate Abfall &quot;trash&quot; + er, associated with the exemplar fassung 13 In this example, the alternate division is the linguistically motivated one.</Paragraph> <Paragraph position="15"> 14 The proper parse of the compound Industr~eerzeugnisse is Industrie+null+erzeugnis+se &quot;industry products&quot;, not Industrie+er+zeugnis+se &quot;*industry certificates&quot;. Similarily, Bodennester is parsed Boden+null+nest+er &quot;ground nests&quot;, not Boden+n+ester+null &quot;ground ester&quot;. Note that excluding the candidates industrie+er and boden+n does not affect the candidates industrie+null and boden+null.</Paragraph> <Paragraph position="16"> &quot;fixture&quot; + null. The exemplar is excluded, since there is an alternate division of linker and stem: abfall &quot;trash&quot; +mdl, with the exemplar erfassung &quot;acquisition&quot; + null. Another example of this kind of ambiguity is Blut-s-tau vs. Blut-stau, -- that is, Bhtt &quot;blood&quot;+s associated with Tart &quot;dew&quot; + null over against Blut &quot;blood&quot; + null associated with Stau &quot;congestion&quot; + null.</Paragraph> <Paragraph position="17"> Step 8. Cases in which the entire compound, i.e. candidate plus exemplar, is lexicalized are excluded. For example, there is a candidate Ara &quot;parrot&quot; + null associated with the exemplar Rat &quot;council&quot; + null. The exemplar is excluded, however, since the candidate plus the exemplar yields Ararat &quot;Ararat&quot;, which is lexicalized. A small amount of noise survives the filtering process. For example, the Linker ns is improperly included in the linker distribution of the noun Ar, based on the proper noun Arnsberg, which resembles a compound noun: Ar-ns-berg.</Paragraph> <Paragraph position="18"> This minimal amount of noise is further reduced by thresholding: Any candidate (Left Element + Linker) for which there is only one remaining exemplar does not contribute to the distribution. After this final filtering, the surviving (Left Element + Linker) candidates and their associated surviving exemplars are used to calculate linker distributions for each Left Element.</Paragraph> <Paragraph position="19"> Of the 8,49_6 candidates entering the filtering and thresholding process, 1361 of them survive.</Paragraph> <Paragraph position="20"> Of these, 20 share a common Left Element with another candidatetS; thus we are able to calculate a Linker distribution for 1341 lexicalized nouns.</Paragraph> </Section> <Section position="10" start_page="65" end_page="65" type="metho"> <SectionTitle> 7. Linker Distributions </SectionTitle> <Paragraph position="0"> The filtering described in the previous section yields a set of reliable candidates and exemplars for noun-noun compounding. For example,</Paragraph> <Paragraph position="2"> process.</Paragraph> <Paragraph position="3"> Based on these vetted candidates and exemplars, we now calculate a Linker governance distribution for lexicalized nouns used as the Left Element of a noun-noun compound.</Paragraph> <Paragraph position="4"> t5 For example, the candidates Stand+null and Stand+es share the Left Stem Stand.</Paragraph> <Paragraph position="5"> First, from each set of exemplars associated with a given candidate, we squeeze out the exemplars with a common stem. In our example, the exemplar (prograram + e) is removed, since the exemplar (prograrnm + mdl) is also associated with the candidate (anwendung + s).</Paragraph> <Paragraph position="6"> Next, for each Left Stem, we simply tally the total number T of exemplars associated with that Left Stem. Then, for each Linker associated with Left Stem, we calculate its probability by tallying the number of exemplars associated with the candidate (Left Stem + Linker), then dividing by T.</Paragraph> <Paragraph position="7"> We wish to incorporate this data into our lexicon as follows. For each noun entry N, derive the distribution D(N) of Linkers governed by N ~6. For example, for the entry Staat, the distribution ( en = 0.I I; s = 0.89 ) is calculated.</Paragraph> </Section> class="xml-element"></Paper>