File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2033_intro.xml

Size: 5,606 bytes

Last Modified: 2025-10-06 14:02:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2033">
  <Title>POS tagger combinations on Hungarian text</Title>
  <Section position="3" start_page="0" end_page="191" type="intro">
    <SectionTitle>
1 Introduction an related works
</SectionTitle>
    <Paragraph position="0"> Part-of-speech (POS) tagging is perhaps one of the most basic tasks in natural language processing. In this paper we will review the current state-of-the-art in Hungarian POS tagging, and investigate the possibilities of improving the results of the taggers by applying classi er combination techniques.</Paragraph>
    <Paragraph position="1"> We used a Transformation Based Learner (TBL) as the base tagger for combination experiments, and the learner algorithm determines the set of applicable combination schemes. From this set we chose two algorithms called Bagging and Adaboost.M1.</Paragraph>
    <Paragraph position="2"> In the next subsection, the most important published results of the last few years in Hungarian POS tagging are summarized.</Paragraph>
    <Paragraph position="3"> The TBL tagger is described in detail in Section 2, then the corpora and the data sets we used for our investigations are presented in Section ??. The classi er combination and details about the implementation of this technique are described in Section 3. After the results of the boosted tagger are presented in Section 4. Lastly, some conclusions about the e ectiveness and e ciency of our boosting approach are made in the nal section.</Paragraph>
    <Section position="1" start_page="0" end_page="191" type="sub_section">
      <SectionTitle>
1.1 POS Tagging of Hungarian Texts
</SectionTitle>
      <Paragraph position="0"> Standard POS tagging methods were applied to Hungarian as soon as the rst annotated corpora appeared that were big enough to serve as a training database for various methods. The TELRI corpus (Dimitrova et al., 1998) was the rst corpus that was used for testing di erent POS tagging methods.</Paragraph>
      <Paragraph position="1"> This corpus contains approximately 80; 000 words. Later, as the Hungarian National Corpus (V aradi, 2002) and the Manually Annotated Hungarian Corpus (the Szeged Corpus) (Alexin et al., 2003) became available, an opportunity was provided to test the results on bigger corpora (153M and 1.2M words, respectively). null In recent years several authors have published many useful POS tagging results in Hungarian. It is generally believed that, ow- null ing to the fairly free word order and the agglutinative property of the Hungarian language, there are more special problems associated with Hungarian than those of the Indo-European languages. However, the latest results are comparable to results achieved in English and other well-studied languages.</Paragraph>
      <Paragraph position="2"> Fruitful approaches for Hungarian POS tagging are Hidden Markov Models, Transformation Based Learning and rule-based learning methods.</Paragraph>
      <Paragraph position="3"> One of the most common POS tagging approaches is to build a tagger based on Hidden Markov Models (HMM). Tu s (Tu s et al., 2000) reported good results with the Tri-grams and Tags (TnT) tagger (Brants, 2000). A slightly better version of TnT was employed by Oravecz (Oravecz and Dienes, 2002), and it achieved excellent results. In their paper, Oravecz and Dienes (Oravecz and Dienes, 2002) argue that regardless of the rich morphology and relatively free word order, the POS tagging of Hungarian with HMM methods is possible and e ective once one is able to handle the data sparsity problem. They used a modi ed version of TnT that was supported by an external morphological analyzer. In this way the trigram tagger was able to make better guesses about the unseen words and therefore to get better results. An example of the results achieved by this trigram tagger is presented in the rst row of Table 1. Another approach besides the statistical methods is the rule-based learning one. A valuable feature of the rule-based methods is that the rules these methods work with are usually more intelligible to humans than the parameters of statistical methods. For Hungarian, a few such approaches are available in the literature.</Paragraph>
      <Paragraph position="4"> In a comprehensive investigation, Horv ath et al. (Horv ath et al., 1999) applied ve di erent machine learning methods to Hungarian POS tagging. They tested C4.5, PHM, RIBL, Progol and AGLEARN (Alexin et al., 1999) methods on the TELRI corpus.</Paragraph>
      <Paragraph position="5"> The results of C4.5 and the best tagger found in this investigation (RIBL) are presented in the second and third rows of Table 1.</Paragraph>
      <Paragraph position="6">  ian POS taggers.</Paragraph>
      <Paragraph position="7"> H ocza (H ocza et al., 2003) used a di erent rule generalization method called RGLearn.</Paragraph>
      <Paragraph position="8"> Row 4 shows the test results of that tagger in Table 1. Transformation Based Learning is a rule-based method that we will discuss in depth in Section 2. Megyesi (Megyesi, 1999) and Kuba et al. (Kuba et al., 2004) produced results with TBL taggers that are given in Table 1, in rows 5 and 6, respectively. Kuba et al. (Kuba et al., 2004) performed experiments with combinations of various tagger methods. The combinations outperformed their component taggers in almost every case. However, in the di erent test sets, di erent combinations proved the best, so no conclusion could be drawn about the best combination. The combined tagger that performed best on the largest test set is shown in row 7 of Table 1.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML