XML Viewer - w04-3229

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3229_metho.xml
Size: 12,049 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3229">
  <Title>A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Why TnT?
</SectionTitle>
    <Paragraph position="0"> Readers may wonder why we chose to use TnT, which was not designed for Slavic languages. The short answer is that it is convenient and successful, but the following two sections address the issue in rather more detail.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The encoding of lexical information in TnT
</SectionTitle>
      <Paragraph position="0"> TnT records some lexical information in the emission probabilities of its second order Markov Model. Since Russian and Czech do not use the same words we cannot use this information (at least not directly) to tag Russian. Given this, the move from Czech to Russian involves a loss of detailed lexical information. Therefore we implemented a morphological analyzer for Russian, the output of which we use to provide surrogate emission probabilities for the TnT tagger (Brants, 2000). The details are described below in section 4.2.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The modelling of word order in TnT
</SectionTitle>
      <Paragraph position="0"> Both Russian and Czech have relatively free word order, so it may seem an odd choice to use a Markov model (MM) tagger. Why should second order MM be able to capture useful facts about such languages? Firstly, even if a language has the potential for free word order, it may still turn out that there are recurring patterns in the progressions of parts-of-speech attested in a training corpus. Secondly, n-gram models including MM have indeed been shown to be successful for various Slavic languages, e.g., Czech (HajiVc et al., 2001) or Slovene (DVzeroski et al., 2000); although not as much as for English. This shows that the transitional information captured by the second-order MM from a Czech or Slovene corpus is useful for Czech or Slovene.2 The present paper shows that transitional information acquired from Czech is also useful for Russian.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Russian versus Czech
</SectionTitle>
    <Paragraph position="0"> A deep comparative analysis of Czech and Russian is far beyond the scope of this paper. However, we would like to mention just a number of the most important facts. Both languages are Slavic (Czech is West Slavonic, Russian is East Slavonic). Both have extensive morphology whose role is important in determining the grammatical functions of phrases.</Paragraph>
    <Paragraph position="1"> In both languages, the main verb agrees in person and number with subject; adjectives agree in gender, number and case with nouns. Both languages are free constituent order languages. The word order in a sentence is determined mainly by discourse.</Paragraph>
    <Paragraph position="2"> It turns out that the word order in Czech and Russian is very similar. For instance, old information mostly precedes new information. The &amp;quot;neutral&amp;quot; order in the two languages is Subject-Verb-Object. Here is a parallel Czech-Russian example from our development corpus:  'It was a bright cold day in April, and the clocks were striking thirteen.' [from Orwell's '1984'] Of course, not all utterances are so similar. Section 5.4 briefly mentions how to improve the utility of the corpus by eradicating some of the systematic differences.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Realization
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The tag system
</SectionTitle>
      <Paragraph position="0"> We adopted the Czech tag system (HajiVc, 2000) for Russian. Every tag is represented as a string of 15 symbols each corresponding to one morphological category. For example, the word vidjela is assigned the tag VpFS- - -XR-AA- - -, because it is a verb (V), past participle (p), feminine (F), singular (S), does not distinguish case (-), possessive gender (-), possessive number (-), can be any person (X), is past tense (R), is not gradable (-), affirmative (A), active voice (A), and does not have any stylistic variants (the final hyphen).</Paragraph>
      <Paragraph position="1"> No. Description Abbr. No. of values  The tagset used for Czech (4290+ tags) is larger than the tagset we use for Russian (about 900 tags). There is a good theoretical reason for this choice - Russian morphological categories usually have fewer values (e.g., 6 cases in Russian vs. 7 in Czech; Czech often has formal and colloquial variants of the same morpheme); but there is also an immediate practical reason - the Czech tag system is very elaborate and specifically devised to serve multiple needs, while our tagset is designed solely to capture the core of Russian morphology, as we need it for our primary purpose of demonstrating the portability and feasibility of our technique. Still, our tagset is much larger than the Penn Treebank tagset, which uses only 36 non-punctuation tags (Marcus et al., 1993).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Morphological analysis
</SectionTitle>
      <Paragraph position="0"> In this section we describe our approach to a resource-light encoding of salient facts about the Russian lexicon. Our techniques are not as radical as previously explored unsupervised methods (Goldsmith, 2001; Yarowsky and Wicentowski, 2000), but are designed to be feasible for languages for which serious morphological expertise is unavailable to us. We use a paradigm-based morphology that avoids the need to explicitly create a large lexicon. The price that we pay for this is overgeneration. Most of these analyses look very implausible to a Russian speaker, but significantly increasing the precision would be at the cost of greater development time than our resource-light approach is able to commit. We wish our work to be portable at least to other Slavic languages, for which we assume that elaborate morphological analyzers will not be available. We do use two simple pre-processing methods to decrease the ambiguity of the results handed to the tagger - longest ending filtering and an automatically acquired lexicon of stems. These were easy to implement and surprisingly effective.</Paragraph>
      <Paragraph position="1"> Our analyzer captures just a few textbook facts about the Russian morphology (Wade, 1992), excluding the majority of exceptions and including information about 4 declension classes of nouns, 3 conjugation classes of verbs. In total our database contains 80 paradigms. A paradigm is a set of endings and POS tags that can go with a particular set of stems. Thus, for example, the paradigm in Table 3 is a set of inflections that go with the masculine stems ending on the &amp;quot;hard&amp;quot; consonants, e.g., slon 'elephant', stol 'table'.</Paragraph>
      <Paragraph position="2"> Unlike the traditional notions of stem and ending, for us a stem is the part of the word that does not change within its paradigm, and the ending is the part of the word that follows such a stem. For example, the forms of the verb moVc' 'can.INF': mogu '1sg', moVzeVs' '2sg', moVzet '3sg', etc. are analyzed as</Paragraph>
      <Paragraph position="4"> the stem mo followed by the endings gu, VzeVs', Vzet. A more linguistically oriented analysis would involve the endings u, eVs', et and phonological alternations in the stem. All stem internal variations are treated as suppletion.3 Unlike the morphological analyzers that exist for Russian (Segalovich and Titov, 2000; Segalovich, 2003; Segalovich and Maslov, 1989; Kovalev, 2002; Mikheev and Liubushkina, 1995; Yablonsky, 1999; Segalovich, 2003; Kovalev, 2002, among others) (Segalovich, 2003; Kovalev, 2002; Mikheev and Liubushkina, 1995; Yablonsky, 1999, among others), our analyzer does not rely on a substantial manually created lexicon. This is in keeping with our aim of being resource-light. When analyzing a word, the system first checks a list of monomorphemic closed-class words and then segments the word into all possible prefix-stem-ending triples.4 The result has quite good coverage (95.4%), but the average ambiguity is very high (10.9 tags/token), and even higher for open class words. We therefore have two strategies for reducing ambiguity.</Paragraph>
      <Paragraph position="5">  The first approach to ambiguity reduction is based on a simple heuristic - the correct ending is usually one of the longest candidate endings. In English, it would mean that if a word is analyzed either as having a zero ending or an -ing ending, we would consider only the latter; obviously, in the vast majority of cases that would be the correct analysis. In addition, we specify that a few long but very rare endings should not be included in the maximum length calculation (e.g., 2nd person pl. imperative).</Paragraph>
      <Paragraph position="6"> 3We do in fact have a very similar analysis, the analyzer's run-time representation of the paradigms is automatically produced from a more compact and linguistically attractive specification of the paradigms. It is possible to specify the basic paradigms and then specify the subparadigms, exceptions and paradigms involving phonological changes by referring to them.</Paragraph>
      <Paragraph position="7"> 4Currently, we consider only two inflectional prefixes - negative ne and superlative nai.</Paragraph>
      <Paragraph position="8">  The second approach uses a large raw corpus5 to generate an open class lexicon of possible stems with their paradigms. In this paper, we can only sketch the method, for more details see (Hana and Feldman, to appear). It is based on the idea that open-class lemmata are likely to occur in more than one form. First, we run the morphological analyzer on the text (without any filtering), then we add to the lexicon those entries that occurred with at least a certain number of distinct forms and cover the highest number of forms. If we encounter the word talking, using the information about paradigms, we can assume that it is either the -ing form of the lemma talk or that it is a monomorphemic word (such as sibling). Based on this single form we cannot really say more. However, if we also encounter the forms talk, talks and talked, the former analysis seems more probable; and therefore, it seems reasonable to include the lemma talk as a verb into the lexicon. If we encountered also talkings, talkinged and talkinging, we would include both lemmata talk and talking as verbs.</Paragraph>
      <Paragraph position="9"> Obviously, morphological analysis based on such a lexicon overgenerates, but it overgenerates much less than if based on the endings alone. For example, for the word form partii of the lemma partija 'party', our analysis gives 8 possibilities - the 5 correct ones (noun fem sg gen/dat/loc sg and pl nom/acc) and 3 incorrect ones (noun masc sg loc, pl nom, and noun neut pl acc; note that only gender is incorrect). Analysis based on endings alone would allow 20 possibilities - 15 of them incorrect (including adjectives and an imperative).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Tagging
</SectionTitle>
      <Paragraph position="0"> We use the TnT tagger (Brants, 2000), an implementation of the Viterbi algorithm for second order Markov models. We train the transition probabilities on Czech (1.5M tokens of the Prague Dependency Treebank (B'emov'a et al., 1999)). We obtain surrogate emission probabilities by running our morphological analyzer, then assuming a uniform distribution over the resulting emissions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML