File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3229_evalu.xml

Size: 10,246 bytes

Last Modified: 2025-10-06 13:59:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3229">
  <Title>A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Corpora
</SectionTitle>
      <Paragraph position="0"> For evaluation purposes, we selected and morphologically annotated (by hand) a small portion from  slaviska.uu.se/ryska/corpus.html.</Paragraph>
      <Paragraph position="1"> the Russian translation of Orwell's '1984'. This corpus contains 4011 tokens and 1858 types. For development, we used another part of '1984'. Since we want to work with minimal language resources, the development corpus is intentionally small - 1788 tokens. We used it to test our hypotheses and tune the parameters of our tools.</Paragraph>
      <Paragraph position="2"> In the following sections, we discuss our experiments and report the results. Note that we do not report the results for tag position 13 and 14, since these positions are unused; and therefore, always trivially correct.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Morphological analysis
</SectionTitle>
      <Paragraph position="0"> As can be seen from Table 4, morphological analysis without any filters gives good recall (although on a non-fiction text it would probably be lower), but also very high average ambiguity. Both filters (the longest-ending filter and automatically acquired lexicon) reduce the ambiguity significantly; the former producing a considerable drop of recall, the latter retaining high recall. However, we do best if we first attempt lexical lookup, then apply LEF to the words not found. This keeps recall reasonably high at the same time as decreasing ambiguity. As expected, performance increases with the size of the unannotated Russian corpus used to generate the lexicon. All subsequent experimental results were obtained using this best filter combination, i.e., the combination of the lexicon based on the 1Mword corpus and LEF.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Tagging
</SectionTitle>
      <Paragraph position="0"> Table 7 summarizes the results of our taggers on test data. Our baseline is produced by the morphological analyzer without any filters followed by a tagger randomly selecting a tag among the tags offered by the morphological analyzer. The direct-full tag column shows the result of the TNT tagger with transition probabilities obtained directly from the Czech corpus and the emission symbols based on the morphological analyzer with the best filters.</Paragraph>
      <Paragraph position="1"> To further improve the results, we used two techniques: (i) we modified the training corpus to remove some systematic differences between Czech and Russian (5.4); (ii) we trained batteries of taggers on subtags to address the data sparsity problem (5.5 and 5.6).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Russification
</SectionTitle>
      <Paragraph position="0"> We experimented with &amp;quot;russified&amp;quot; models. We trained the TnT tagger on the Czech corpus with modifications that made the structure of training data look more like Russian. For example, plural adjectives and participles in Russian, unlike Czech, do not distinguish gender.</Paragraph>
      <Paragraph position="1">  In addition, reflexive verbs in Czech are formed by a verb followed by a reflexive clitic, whereas in Russian, the reflexivization is the affixation process:  Even though auxiliaries and the copula are the forms of the same verb byt' 'to be', both in Russian and in Czech, the use of this verb is different in the two languages. For example, Russian does not use an auxiliary to form past tense:  It also does not use the present tense copula, except for emphasis; but it uses forms of the verb byt' in some other constructions like past passive.</Paragraph>
      <Paragraph position="2"> We implemented a number of simple &amp;quot;russifications&amp;quot;. The combination of random omission of the verb byt', omission of the reflexive clitics, and negation transformation gave us the best results on the development corpus. Their combination improves the overall result from 68.0% to 69.4%. We admit we expected a larger improvement.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.5 Sub-taggers
</SectionTitle>
      <Paragraph position="0"> One of the problems when tagging with a large tagset is data sparsity; with 1000 tags there are 10003 potential trigrams. It is very unlikely that a naturally occurring corpus will contain all the acceptable tag combinations with sufficient frequency to reliably distinguish them from the unacceptable combinations. However, not all morphological attributes are useful for predicting the attributes of the succeeding word (e.g., tense is not really useful for case). We therefore tried to train the tagger on individual components of the full tag, in the hope that each sub-tagger would be able to learn what it needs for prediction. This move has the additional benefit of making the tag set of each such tagger smaller and reducing data sparsity. We focused on the first 5 positions - POS (P), SubPOS (S), gender (g), number (n), case (c) and person (p). The selection of the slots is based on our linguistic intuition - for example it is reasonable to assume that the information about part-of-speech and the agreement features (gnc) of previous words should help in prediction of the same slots of the current word; or information about part-of-speech, case and person should assist in determining person. On the other hand, the combination of tense and case is prima facie unlikely to be much use for prediction. Indeed, most of our expectations have been met. The performance of some of the models on the development corpus is summarized in Table 5. The bold numbers indicate that the tagger outperforms the full-tag tagger. As can be seen, the taggers trained on individual positions are worse than the full-tag tagger on these positions. This proves that a smaller tagset does not necessarily imply that tagging is easier see (Elworthy, 1995) for more discussion of this interesting relation. Similarly, there is no improvement from the combination of unrelated slots - case and tense (ct) or gender and negation (ga). However, the combinations of (detailed) part-of-speech with various agreement features (e.g., Snc) outperform the full-tag tagger on at least some of the slots.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.6 Combining Sub-taggers
</SectionTitle>
      <Paragraph position="0"> We now need to put the sub-tags back together to produce estimates of the correct full tags. We cannot simply combine the values offered by the best taggers for each slot, because that could yield illegal tags (e.g., nouns in past tense). Instead we select the best tag from those offered by our morphological analyzer using the following formula:</Paragraph>
      <Paragraph position="2"> Nk - the total # of taggers on slot k That means, that the best tag is the tag that received the highest average percentage of votes for each of full-tag all best 1 best 3  its slots. If we cared about certain slots more than about others we could weight the slots in the val function.</Paragraph>
      <Paragraph position="3"> We ran several experiments, the results of three of them are summarized in Table 6. All of them work better than the full-tag tagger. One ('all') uses all available subtaggers, other ('best 1') uses the best tagger for each slot (therefore voting in Formula 6 reduces to finding a closest legal tag). The best result is obtained by the third tagger ('best 3') which uses the three best taggers for each of the Pgcp slots and the best tagger for the rest. We selected this tagger to tag the test corpus, for which the results are summarized in Table 7.</Paragraph>
      <Paragraph position="4"> Russian Gloss Correct Xerox Ours VClen member noun nom gen partii party noun gen obl po prep prep obl acc vozmoVznosti possibility noun obl acc staralsja tried vfin nje not ptcl govorit' to-speak vinf ni nor ptcl o about prep obl Bratstvje Brotherhood noun obl , cm ni nor ptcl o about prep obl knigje book noun obl Errors 3 1 'Neither the Brotherhood nor the book was a subject that any ordinary Party member would mention if there was a way of avoiding it.' [Orwell: '1984']</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.7 Comparison with Xerox tagger
</SectionTitle>
      <Paragraph position="0"> A tagger for Russian is part of the Xerox language tools. We could not perform a detailed evaluation since the tool is not freely available. We used the online demo version of Xerox's Disambiguator6 to tag a few sentences and compared the results with the results of our tagger. The Xerox tagset is much smaller than ours, it uses 63 tags, collapsing some cases, not distinguishing gender, number, person, tense etc. (However, it uses different tags for different punctuation, while we have one tag for all punctuation). For the comparison, we translated our tagset to theirs. On 201 tokens of the testing corpus, the Xerox tagger achieved an accuracy of 82%, while our tagger obtained 88%; i.e., a 33% reduction in error rate. A sample analysis is in Table 8.</Paragraph>
    </Section>
    <Section position="8" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.8 Comparison with Czech taggers
</SectionTitle>
      <Paragraph position="0"> The numbers we obtain are significantly worse than the numbers reported for Czech (HajiVc et al., 2001) (95.16% accuracy); however, they use an extensive manually created morphological lexicon (200K+ entries) which gives 100.0% recall on their testing data. Moreover, they train and test their taggers on the same language.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML