File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0706_metho.xml

Size: 18,994 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0706">
  <Title>Choosing an Optimal Architecture for Segmentation and POS-Tagging of Modern Hebrew</Title>
  <Section position="4" start_page="0" end_page="39" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Texts in Semitic languages like Modern Hebrew (henceforth Hebrew) and Modern Standard Arabic (henceforth Arabic), are based on writing systems that allow the concatenation of different lexical units, called morphemes. Morphemes may belong to various Part-of-Speech (POS) classes, and their concatenation forms textual units delimited by white space, which are commonly referred to as words. Hence, the task of POS tagging for Semitic languages consists of a segmentation subtask and a classification subtask. Crucially, words can be segmented into different alternative morpheme sequences, where in each segmentation morphemes may be ambiguous in terms of their POS tag. This results in a high level of overall ambiguity, aggravated by the lack of vocalization in modern Semitic texts.</Paragraph>
    <Paragraph position="1"> One crucial problem concerning POS tagging of Semitic languages is how to adapt existing methods in the best way, and which architectural choices have to be made in light of the limited availability of annotated corpora (especially for Hebrew). This paper outlines some alternative architectures for POS tagging of Hebrew text, and studies them empirically.</Paragraph>
    <Paragraph position="2"> This leads to some general conclusions about the optimal architecture for disambiguating Hebrew, and (reasonably) other Semitic languages as well. The choice of tokenization level has major consequences for the implementation using HMMs, the sparseness of the statistics, the balance of the Markov condi- null tioning, and the possible loss of information. The paper reports on extensive experiments for comparing different architectures and studying the effects of this choice on the overall result. Our best result is on par with the best reported POS tagging results for Arabic, despite the much smaller size of our annotated corpus.</Paragraph>
    <Paragraph position="3"> The paper is structured as follows. Section 2 defines the task of POS tagging in Hebrew, describes the existing corpora and discusses existing related work. Section 3 concentrates on defining the different levels of tokenization, specifies the details of the probabilistic framework that the tagger employs, and describes the techniques used for smoothing the probability estimates. Section 4 compares the different levels of tokenization empirically, discusses their limitations, and proposes an improved model, which outperforms both of the initial models. Finally, section 5 discusses the conclusions of our study for segmentation and POS tagging of Hebrew in particular, and Semitic languages in general.</Paragraph>
  </Section>
  <Section position="5" start_page="39" end_page="39" type="metho">
    <SectionTitle>
2 Task definition, corpora and related
</SectionTitle>
    <Paragraph position="0"> work Words in Hebrew texts, similar to words in Arabic and other Semitic languages, consist of a stem and optional prefixes and suffixes. Prefixes include conjunctions, prepositions, complementizers and the definiteness marker (in a strict well-defined order). Suffixes include inflectional suffixes (denoting gender, number, person and tense), pronominal complements with verbs and prepositions, and possessive pronouns with nouns.</Paragraph>
    <Paragraph position="1"> By the term word segmentation we henceforth refer to identifying the prefixes, the stem and suffixes of the word. By POS tag disambiguation we mean the assignment of a proper POS tag to each of these morphemes.</Paragraph>
    <Paragraph position="2"> In defining the task of segmentation and POS tagging, we ignore part of the information that is usually found in Hebrew morphological analyses. The internal morphological structure of stems is not analyzed, and the POS tag assigned to stems includes no information about their root, template/pattern, inflectional features and suffixes. Only pronominal complement suffixes on verbs and prepositions are identified as separate morphemes. The construct state/absolute,1 and the existence of a possessive suffix are identified using the POS tag assigned to the stem, and not as a separate segment or feature.</Paragraph>
    <Paragraph position="3"> Some of these conventions are illustrated by the segmentation and POS tagging of the word wfnpgfnw (&amp;quot;and that we met&amp;quot;, pronounced ve-she-nifgashnu):2 w/CC: conjunction f /COM: complementizer npgfnw/VB: verb Our segmentation and POS tagging conform with the annotation scheme used in the Hebrew Treebank (Sima'an et al., 2001), described next.</Paragraph>
    <Section position="1" start_page="39" end_page="39" type="sub_section">
      <SectionTitle>
2.1 Available corpora
</SectionTitle>
      <Paragraph position="0"> The Hebrew Treebank (Sima'an et al., 2001) consists of syntactically annotated sentences taken from articles from the Ha'aretz daily newspaper. We extracted from the treebank a mapping from each word to its analysis as a sequence of POS tagged morphemes. The treebank version used in the current work contains 57 articles, which amount to 1,892 sentences, 35,848 words, and 48,332 morphemes.</Paragraph>
      <Paragraph position="1"> In addition to the manually tagged corpus, we have access to an untagged corpus containing 337,651 words, also originating from Ha'aretz newspaper.</Paragraph>
      <Paragraph position="2"> The tag set, containing 28 categories, was obtained from the full morphological tagging by removing the gender, number, person and tense features. This tag set was used for training the POS tagger. In the evaluation of the results, however, we perform a further grouping of some POS tags, leading to a reduced POS tag set of 21 categories. The tag set and the grouping scheme are shown below: {NN}, {NN-H}, {NNT}, {NNP}, {PRP,AGR}, {JJ}, {JJT}, {RB,MOD}, {RBR}, {VB,AUX}, {VB-M}, {IN,COM,REL}, {CC}, {QW}, {HAM}, {WDT,DT}, {CD,CDT}, {AT}, {H}, {POS}, {ZVL}.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="39" end_page="40" type="metho">
    <SectionTitle>
2.2 Related work on Hebrew and Arabic
</SectionTitle>
    <Paragraph position="0"> Due to the lack of substantial tagged corpora, most previous corpus-based work on Hebrew focus on the  development of techniques for learning probabilities from large unannotated corpora. The candidate analyses for each word were usually obtained from a morphological analyzer.</Paragraph>
    <Paragraph position="1"> Levinger et al. (1995) propose a method for choosing a most probable analysis for Hebrew words using an unannotated corpus, where each analysis consists of the lemma and a set of morphological features. They estimate the relative frequencies of the possible analyses for a given word w by defining a set of &amp;quot;similar words&amp;quot; SW(A) for each possible analysis A of w. Each word wprime in SW(A) corresponds to an analysis Aprime which differs from A in exactly one feature. Since each set is expected to contain different words, it is possible to approximate the frequency of the different analyses using the average frequency of the words in each set, estimated from the untagged corpus.</Paragraph>
    <Paragraph position="2"> Carmel and Maarek (1999) follow Levinger et al. in estimating context independent probabilities from an untagged corpus. Their algorithm learns frequencies of morphological patterns (combinations of morphological features) from the unambiguous words in the corpus.</Paragraph>
    <Paragraph position="3"> Several works aimed at improving the &amp;quot;similar words&amp;quot; method by considering the context of the word. Levinger (1992) adds a short context filter that enforces grammatical constraints and rules out impossible analyses. Segal's (2000) system includes, in addition to a somewhat different implementation of &amp;quot;similar words&amp;quot;, two additional components: correction rules `a la Brill (1995), and a rudimentary deterministic syntactic parser.</Paragraph>
    <Paragraph position="4"> Using HMMs for POS tagging and segmenting Hebrew was previously discussed in (Adler, 2001).</Paragraph>
    <Paragraph position="5"> The HMM in Adler's work is trained on an untagged corpus, using the Baum-Welch algorithm (Baum, 1972). Adler suggests various methods for performing both tagging and segmentation, most notable are (a) The usage of word-level tags, which uniquely determine the segmentation and the tag of each morpheme, and (b) The usage of a two-dimensional Markov model with morpheme-level tags. Only the first method (word-level tags) was tested, resulting in an accuracy of 82%. In the present paper, both word-level tagging and morpheme-level tagging are evaluated.</Paragraph>
    <Paragraph position="6"> Moving on to Arabic, Lee et al. (2003) describe a word segmentation system for Arabic that uses an n-gram language model over morphemes. They start with a seed segmenter, based on a language model and a stem vocabulary derived from a manually segmented corpus. The seed segmenter is improved iteratively by applying a bootstrapping scheme to a large unsegmented corpus. Their system achieves accuracy of 97.1% (per word).</Paragraph>
    <Paragraph position="7"> Diab et al. (2004) use Support Vector Machines (SVMs) for the tasks of word segmentation and POS tagging (and also Base Phrase Chunking). For segmentation, they report precision of 99.09% and recall of 99.15%, when measuring morphemes that were correctly identified. For tagging, Diab et al.</Paragraph>
    <Paragraph position="8"> report accuracy of 95.49%, with a tag set of 24 POS tags. Tagging was applied to segmented words, using the &amp;quot;gold&amp;quot; segmentation from the annotated corpus (Mona Diab, p.c.).</Paragraph>
  </Section>
  <Section position="7" start_page="40" end_page="42" type="metho">
    <SectionTitle>
3 Architectures for POS tagging Semitic
</SectionTitle>
    <Paragraph position="0"> languages Our segmentation and POS tagging system consists of a morphological analyzer that assigns a set of possible candidate analyses to each word, and a disambiguator that selects from this set a single preferred analysis per word. Each candidate analysis consists of a segmentation of the word into morphemes, and a POS tag assignment to these morphemes. In this section we concentrate on the architectural decisions in devising an optimal disambiguator, given a morphological analyzer for Hebrew (or another Semitic language).</Paragraph>
    <Section position="1" start_page="40" end_page="41" type="sub_section">
      <SectionTitle>
3.1 Defining the input/output
</SectionTitle>
      <Paragraph position="0"> An initial crucial decision in building a disambiguator for a Semitic text concerns the &amp;quot;tokenization&amp;quot; of the input sentence: what constitutes a terminal (i.e., input) symbol. Unlike English POS tagging, where the terminals are usually assumed to be words (delimited by white spaces), in Semitic texts there are two reasonable options for fixing the kind of terminal symbols, which directly define the corresponding kind of nonterminal (i.e., output) symbols: Words (W): The terminals are words as they appear in the text. In this case a nonterminal a that is assigned to a word w consists of a sequence of POS tags, each assigned to a mor- null pheme of w, delimited with a special segmentation symbol. We henceforth refer to such complex nonterminals as analyses. For instance, the analysis IN-H-NN for the Hebrew word bbit uniquely encodes the segmentation b-h-bit.</Paragraph>
      <Paragraph position="1"> In Hebrew, this unique encoding of the segmentation by the sequence of POS tags in the analysis is a general property: given a word w and a complex nonterminal a = [t1 ...tp] for w, it is possible to extend a back to a full analysis ~a = [(m1,t1)...(mp,tp)], which includes the morphemes m1 ...mp that make out w. This is done by finding a match for a in Analyses(w), the set of possible analyses of w. Except for very rare cases, this match is unique.</Paragraph>
      <Paragraph position="2"> Morphemes (M): In this case the nonterminals are the usual POS tags, and the segmentation is given by the input morpheme sequence. Note that information about how morphemes are joined into words is lost in this case.</Paragraph>
      <Paragraph position="3"> Having described the main input-output options for the disambiguator, we move on to describing the probabilistic framework that underlies their workings. null</Paragraph>
    </Section>
    <Section position="2" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
3.2 The probabilistic framework
</SectionTitle>
      <Paragraph position="0"> Let wk1 be the input sentence, a sequence of words w1 ...wk. If tokenization is per word, then the disambiguator aims at finding the nonterminal sequence ak1 that has the highest joint probability with the given sentence wk1:</Paragraph>
      <Paragraph position="2"> This setting is the standard formulation of probabilistic tagging for languages like English.</Paragraph>
      <Paragraph position="3"> If tokenization is per morpheme, the disambiguator aims at finding a combination of a segmentation mn1 and a tagging tn1 for mn1, such that their joint probability with the given sentence, wk1, is maximized: null</Paragraph>
      <Paragraph position="5"> where ANALYSES(wk1) is the set of possible analyses for the input sentence wk1 (output by the morphological analyzer). Note that n can be different from k, and may vary for different segmentations. The original sentence can be uniquely recovered from the segmentation and the tagging.</Paragraph>
      <Paragraph position="6"> Since all the &lt;mn1,tn1&gt; pairs that are the input for the disambiguator were derived from wk1, we have</Paragraph>
      <Paragraph position="8"> formula that applies to both word tokenization and</Paragraph>
      <Paragraph position="10"> In Formula (4) en1 represents either a sequence of words or a sequence of morphemes, depending on the level of tokenization, and An1 are the respective nonterminals - either POS tags or word-level analyses. Thus, the disambiguator aims at finding the most probable &lt;terminal sequence, nonterminal sequence&gt; for the given sentence, where in the case of word-tokenization there is only one possible terminal sequence for the sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
3.3 HMM probabilistic model
</SectionTitle>
      <Paragraph position="0"> The actual probabilistic model used in this work for estimating P(en1,An1) is based on Hidden Markov Models (HMMs). HMMs underly many successful POS taggers , e.g. (Church, 1988; Charniak et al., 1993).</Paragraph>
      <Paragraph position="1"> For a k-th order Markov model (k = 1 or k = 2), we rewrite (4) as:</Paragraph>
      <Paragraph position="3"> For reasons of data sparseness, actual models we use work with k = 2 for the morpheme level tokenization, and with k = 1 for the word level tokenization.  For these models, two kinds of probabilities need to be estimated: P(ei  |Ai) (lexical model) and P(Ai |Ai[?]k,...,Ai[?]1) (language model). Because the only manually POS tagged corpus that was available to us for training the HMM was relatively small (less than 4% of the Wall Street Journal (WSJ) portion of the Penn treebank), it is inevitable that major effort must be dedicated to alleviating the sparseness problems that arise. For smoothing the nonterminal language model probabilities we employ the standard backoff smoothing method of Katz (1987).</Paragraph>
      <Paragraph position="4"> Naturally, the relative frequency estimates of the lexical model suffer from more severe data-sparseness than the estimates for the language model. On average, 31.3% of the test words do not appear in the training corpus. Our smoothing method for the lexical probabilities is described next.</Paragraph>
    </Section>
    <Section position="4" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.4 Bootstrapping a better lexical model
</SectionTitle>
      <Paragraph position="0"> For the sake of exposition, we assume word-level tokenization for the rest of this subsection. The method used for the morpheme-level tagger is very similar.</Paragraph>
      <Paragraph position="1"> The smoothing of the lexical probability of a word w given an analysis a, i.e., P(w  |a) = P(w,a)P(a) , is accomplished by smoothing the joint probability P(w,a) only, i.e., we do not smooth P(a).3 To smooth P(w,a), we use a linear interpolation of the relative frequency estimates from the annotated training corpus (denoted rftr(w,a)) together with estimates obtained by unsupervised estimation from a large unannotated corpus (denoted emauto(w,a)):</Paragraph>
      <Paragraph position="3"> where l is an interpolation factor, experimentally set to 0.85.</Paragraph>
      <Paragraph position="4"> Our unsupervised estimation method can be viewed as a single iteration of the Baum-Welch (Forward-Backward) estimation algorithm (Baum, 1972) with minor differences. We apply this method to the untagged corpus of 340K words. Our method starts out from a naively smoothed relative fre-</Paragraph>
      <Paragraph position="6"> quency lexical model in our POS tagger:</Paragraph>
      <Paragraph position="8"> Where ftr(w) is the occurrence frequency of w in the training corpus, and p0 is a constant set experimentally to 10[?]10. We denote the tagger that employs a smoothed language model and the lexical model PLM0 by the probability distribution Pbasic (over analyses, i.e., morpheme-tag sequences).</Paragraph>
      <Paragraph position="9"> In the unsupervised algorithm, the model Pbasic is used to induce a distribution of alternative analyses (morpheme-tag sequences) for each of the sentences in the untagged corpus; we limit the number of alternative analyses per sentence to 300. This way we transform the untagged corpus into a &amp;quot;corpus&amp;quot; containing weighted analyses (i.e., morpheme-tag sequences). This corpus is then used to calculate the updated lexical model probabilities using maximum-likelihood estimation. Adding the test sentences to the untagged corpus ensures non-zero probabilities for the test words.</Paragraph>
    </Section>
    <Section position="5" start_page="42" end_page="42" type="sub_section">
      <SectionTitle>
3.5 Implementation4
</SectionTitle>
      <Paragraph position="0"> The set of candidate analyses was obtained from Segal's morphological analyzer (Segal, 2000). The analyzer's dictionary contains 17,544 base forms that can be inflected. After this dictionary was extended with the tagged training corpus, it recognizes 96.14% of the words in the test set.5 For each train/test split of the corpus, we only use the training data for enhancing the dictionary. We used SRILM (Stolcke, 2002) for constructing language models, and for disambiguation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML