File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0706_concl.xml

Size: 3,806 bytes

Last Modified: 2025-10-06 13:54:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0706">
  <Title>Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew</Title>
  <Section position="9" start_page="44" end_page="45" type="concl">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> Developing a word segmenter and POS tagger for Hebrew with less than 30K annotated words for training is a challenging task, especially given the morphological complexity and high degree of ambiguity in Hebrew. For comparison, in English a baseline model that selects the most frequent POS tag achieves accuracy of around the 90% (Charniak et al., 1993). However, in Hebrew we found that a parallel baseline model achieves only 84% using the available corpus.</Paragraph>
    <Paragraph position="1"> The architecture proposed in this paper addresses the severe sparseness problems that arise in a number of ways. First, the M+h model, which was found to perform best, is based on morpheme-level tokenization, which suffers of data sparseness less than word tokenization, and makes use of multi-morpheme nonterminals only in specific cases where it was found to be valuable. The number of nonterminal types found in the corpus for this model is 49 (including 11 types of punctuation marks), which is much closer to the morpheme-level model (39 types) than to the word-level model (205 types).</Paragraph>
    <Paragraph position="2"> Second, the bootstrapping method we present exploits additional resources such as a morphological analyzer and an untagged corpus, to improve lexical probabilities, which suffer from data sparseness the most. The improved lexical model contributes 1.5% to the tagging accuracy, and 0.6% to the segmentation accuracy (compared with using the basic lexical model), making it a crucial component of our system.</Paragraph>
    <Paragraph position="3"> Among the few other tools available for POS tagging and morphological disambiguation in Hebrew, the only one that is freely available for extensive training and evaluation as performed in this paper is Segal's ((Segal, 2000), see section 2.2). Comparing our best architecture to the Segal tagger's results under the same experimental setting shows an improvement of 1.5% in segmentation accuracy and 4.5% in tagging accuracy over Segal's results.</Paragraph>
    <Paragraph position="4"> Moving on to Arabic, in a setting comparable to (Diab et al., 2004), in which the correct segmentation is given, our tagger achieves accuracy per morpheme of 94.9%. This result is close to the re- null sult reported by Diab et al., although our result was achieved using a much smaller annotated corpus.</Paragraph>
    <Paragraph position="5"> We therefore believe that future work may benefit from applying our model, or variations thereof, to Arabic and other Semitic languages.</Paragraph>
    <Paragraph position="6"> One of the main sources for tagging errors in our model is the coverage of the morphological analyzer.</Paragraph>
    <Paragraph position="7"> The analyzer misses the correct analysis of 3.78% of the test words. Hence, the upper bound for the accuracy of the disambiguator is 96.22%. Increasing the coverage while maintaining the quality of the proposed analyses (avoiding over-generation as much as possible), is crucial for improving the tagging results. null It should also be mentioned that a new version of the Hebrew treebank, now containing approximately 5,000 sentences, was released after the current work was completed. We believe that the additional annotated data will allow to refine our model, both in terms of accuracy and in terms of coverage, by expanding the tag set with additional morpho-syntactic features like gender and number, which are prevalent in Hebrew and other Semitic languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML