File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1065_metho.xml
Size: 10,207 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1065"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 515-522, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Disambiguation of Morphological Structure using a PCFG</Title> <Section position="4" start_page="516" end_page="517" type="metho"> <SectionTitle> 3 SMOR </SectionTitle> <Paragraph position="0"> SMOR (Schmid et al., 2004) is a German FST-based morphological analyzer which covers inflection, compounding, and prefix as well as suffix derivation. It builds on earlier work reported in (Schiller, 1996) and (Schmid et al., 2001).</Paragraph> <Paragraph position="1"> SMOR uses features to represent derivation constraints. German derivational suffixes select their base in terms of part of speech, the stem type (derivation or compounding stem)1, the origin (native, classical, foreign), and the structure (simplex, compound, prefix derivation, suffix derivation) of the stem which they combine with. This information is encoded with features. The German derivation suffix lich e.g. combines with a simplex derivation stem of a native noun to form an adjective. The feature constraints of lich are therefore (1) part of speech = NN (2) stem type = deriv (3) origin = native and (4) structure = simplex.</Paragraph> </Section> <Section position="5" start_page="517" end_page="517" type="metho"> <SectionTitle> 4 The Grammar </SectionTitle> <Paragraph position="0"> The grammar used by the morphological disambiguator has a small set of rather general categories for prefixes (P), suffixes (S), uninflected base stems (B), uninflected base suffixes (SB), inflectional endings (F) and other morphemes (W). There is only one rule for compounding and prefix and suffix derivation, respectively, and two rules for the stem and suffix inflection. Additional rules introduce the start symbol TOP and generate special word forms like hyphenated (Thomas-Mann-Strasse) or truncated words (Vor-). Overall, the base grammar has 13 rules. Inflection is always attached low in order to avoid spurious ambiguities. The part of speech is encoded as a feature.</Paragraph> <Paragraph position="1"> Like SMOR, the grammar encodes derivation constraints with features. Number, gender and case are not encoded. Ambiguities in the agreement features are therefore not reflected in the parses which the grammar generates. This allows us to abstract away from this type of ambiguity which cannot be resolved without contextual information. If some application requires agreement information, it has to be reinserted after disambiguation.</Paragraph> <Paragraph position="2"> The feature grammar is compiled into a context-free grammar with 1973 rules. In order to reduce the grammar size, the features for origin and complexity were not compiled out. Figure 3 shows a compounding rule (building a noun base stem from a noun compounding stem and a noun base stem), a suffix derivation rule (building an adjective base stem from a noun derivation stem and a derivation suffix), a prefix derivation rule (prefixing a verbal compounding stem) and two inflection rules (for the inflection of a noun and a nominal derivation suffix, respectively) from the resulting grammar. The quote symbol marks the head of a rule.</Paragraph> <Paragraph position="3"> The parser retrieves the categories of the morphemes from a lexicon which also contains information about the standard form of a morpheme.</Paragraph> <Paragraph position="4"> The representation of the morphemes returned by the FST-based word splitter is close to the surface form. Only capitalization is taken over from the standardform. Theadjectiveurs&quot;achlich(causal), for instance, is split into Urs&quot;ach and lich. The lexicon assigns to Urs&quot;ach the category W.NN.deriv and the standard form Ursache (cause).</Paragraph> </Section> <Section position="6" start_page="517" end_page="517" type="metho"> <SectionTitle> 5 PCFG Training </SectionTitle> <Paragraph position="0"> PCFG training normally requires manually annotated training data. Because a treebank of German morphological analyses was not available, we decided to try unsupervised training using the Inside-Outside algorithm (Lari and Young, 1990).</Paragraph> <Paragraph position="1"> We worked with unlexicalized as well as head-lexicalized PCFGs (Carroll and Rooth, 1998; Charniak, 1997). The lexicalized models used the standard form of the morphemes (see the previous section) as heads.</Paragraph> <Paragraph position="2"> The word list from a German 300 million word newspaper corpus was used as training data. From the 3.2 million tokens in the word list, SMOR successfully analyzed 2.3 million tokens which were used in the experiment. Training was either typebased(witheachwordformhavingthesameweight) null or token-based (with weights proportional to the frequency). We experimented with uniform and non-uniform initial distributions. In the uniform model, each rule had an initial frequency of 1 from which the probabilities were estimated. In the non-uniform model, the frequency of two classes of rules was increased to 1000. The first class are the rules which expand the start symbol TOP to an adjective or adverb, leading to a preference of these word classes over other word classes, in particular verbs. The second class is formed by rules generating inflectional endings, which induces a preference for simpler analyses.</Paragraph> </Section> <Section position="7" start_page="517" end_page="518" type="metho"> <SectionTitle> 6 Test Data </SectionTitle> <Paragraph position="0"> The test data was extracted from a corpus of the German newspaper Die Zeit which was not part of the training corpus. We prepared two different test corpora. The first test corpus (data1) consisted of 425 words extracted from a randomly selected part of the corpus. We only extracted words with at least one letter which were ambiguous (ignoring ambiguities in number, gender and case) and either nouns, verbs or adjectives and not from the beginning of a sentence. Duplicates were retained. The words were parsed and manually disambiguated. We looked at the context of a word, where this was necessary for disambiguation. Words without a correct analysis were deleted.</Paragraph> <Paragraph position="1"> In order to obtain more information on the types of ambiguity and their frequency, 200 words were manually classified wrt. the class of the ambiguity. The following results were obtained: * 39 words (25%) were ambiguous between an adjective and a verb like gerecht - &quot;just&quot; (adjective) vs. past participle of rechen (to rake). * 28 words (18%) were ambiguous between a noun and a proper name like Mann - &quot;man&quot; vs.</Paragraph> <Section position="1" start_page="518" end_page="518" type="sub_section"> <SectionTitle> Thomas Mann </SectionTitle> <Paragraph position="0"> * 19wordswereambiguousbetweenanadjective and an adverb like gerade - &quot;straight&quot; vs. &quot;just&quot; (adverb) * 14 words (9%) showed a complex ambiguity involving derivation and compounding like the word &quot;uberlieferung (tradition) which is either a nominalization of the prefix verb &quot;uberliefern (to bequeath) or a compound of the stems &quot;uber (over) and Lieferung (delivery).</Paragraph> <Paragraph position="1"> * 13 words (8%) were compounds which were ambiguous between a left-branching and a right-branching structure like Weltrekordh&quot;ohe (world record height) * In 10 words (5%), there was an ambiguity between an adjective and a proper name or noun stem - as in H&quot;ochstleistung (maximum performance) where h&quot;ochst can be derived from the proper name H&quot;ochst (a German city) or the superlative h&quot;ochst (highest) * 6 words (3%) showed a systematic ambiguity between an adjective and a noun caused by adding the suffix er to a city name, like Moskauer - &quot;Moskau related&quot; vs. &quot;person from Moskau&quot; * Another 6 words were ambiguous between two different noun stems like Halle which is either singular form of Halle (hall) or the plural form of Hall (reverberation) Overall 50% of the ambiguities involved a part-of-speech ambiguity.</Paragraph> <Paragraph position="2"> The second set of test data (data2) was designed to contain only infrequent words which were not ambiguous wrt. part of speech. It was extracted from the same newspaper corpus. Here, we excluded words which were (1) sentence-initial (in order to avoid problems with capitalized words) (2) not analyzed by SMOR (3) ambiguous wrt. part of speech (4) from closed word classes or (5) simplex words. Furthermore, we extracted only words with more than one simplest2 analysis, in order to make the test data more challenging. The extracted words were sorted by frequency and a block of 1000 word forms was randomly selected from the lower frequency range. All of them had occurred 4 times. We focussed on rare words because frequent words are better disambiguated manually and stored in a table (see the discussion in the introduction).</Paragraph> <Paragraph position="3"> The 1000 selected word forms were parsed and manually disambiguated. 193 problematic words were deleted from the evaluation set because either (1) no analysis was correct (e.g. Elsevier, which was not analyzed as a proper name) or (2) there was a true ambiguity (e.g. Rottweiler which is either a dog breed or a person from the city of Rottweil or (3) the lemma was not unique (Dreht&quot;ur (revolving door) could be lemmatized to Dreht&quot;ur or Dreht&quot;ure with no difference in meaning.) or (4) several analyses were equivalent. The disambiguation was often difficult. Even among the words retained in the test set, there were many that we were not fully sure about. An example is the compound Natureisbahn (&quot;natural ice rink&quot;) which we decided to analyze as Natur-Eisbahn (nature ice-rink) rather than Natureis-Bahn (nature-ice rink).</Paragraph> </Section> </Section> class="xml-element"></Paper>