File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3246_metho.xml
Size: 28,628 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3246"> <Title>Learning Hebrew Roots: Machine Learning with Linguistic Constraints</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Linguistic background </SectionTitle> <Paragraph position="0"> In this section we refer to Hebrew only, although much of the description is valid for other Semitic languages as well. As an example of root-and-pattern morphology, consider the Hebrew roots g.d.l, k.t.b and r.$.m and the patterns haCCaCa, hitCaCCut and miCCaC, where the 'C's indicate the slots. When the roots combine with these patterns the resulting lexemes are hagdala, hitgadlut, migdal, haktaba, hitkatbut, miktab, har$ama, hitra$mut, mir$am, respectively. After the root combines with the pattern, some morpho-phonological alternations take place, which may be non-trivial: for example, the hitCaCCut pattern triggers assimilation when the first consonant of the root is t or d: thus, d.r.$+hitCaCCut yields hiddar$ut. The same pattern triggers metathesis when the first radical is s or $: s.d.r+hitCaCCut yields histadrut rather than the expected *hitsadrut. Semi-vowels such as w or y in the root are frequently combined with the vowels of the pattern, so that q.w.m+haCCaCa yields haqama, etc. Frequently, root consonants such as w or y are altogether missing from the resulting form.</Paragraph> <Paragraph position="1"> These matters are complicated further due to two sources: first, the standard Hebrew orthography leaves most of the vowels unspecified. It does not explicate a and e vowels, does not distinguish between o and u vowels and leaves many of the i vowels unspecified. Furthermore, the single letter w is used both for the vowels o and u and for the consonant v, whereas i is similarly used both for the vowels i and for the consonant y. On top of that, the script dictates that many particles, including four of the most frequent prepositions, the definite article, the coordinating conjunction and some subordinating conjunctions all attach to the words which immediately follow them. Thus, a form such as mhgr can be read as a lexeme (&quot;immigrant&quot;), as m-hgr &quot;from Hagar&quot;or even as m-h-gr &quot;from the foreigner&quot;. Note that there is no deterministic way to tell whether the first m of the form is part of the pattern, the root or a prefixing particle (the preposition m &quot;from&quot;).</Paragraph> <Paragraph position="2"> The Hebrew script has 22 letters, all of which can be considered consonants. The number of tri-consonantal roots is thus theoretically bounded by 223, although several phonological constraints limit this number to a much smaller value. For example, while roots whose second and third radicals are identical abound in Semitic languages, roots whose first and second radicals are identical are extremely rare (see McCarthy (1981) for a theoretical explanation). To estimate the number of roots in Hebrew we compiled a list of roots from two sources: a dictionary (Even-Shoshan, 1993) and the verb paradigm tables of Zdaqa (1974). The union of these yields a list of 2152 roots.3 While most Hebrew roots are regular, many belong to weak paradigms, which means that root consonants undergo changes in some patterns. Examples include i or n as the first root consonant, w or i as the second, i as the third and roots whose second and third consonants are identical. For example, consider the pattern hCCCh. Regular roots such as p.s.q yield forms such as hpsqh. However, the irregular roots n.p.l, i.c.g, q.w.m and g.n.n in this pattern yield the seemingly similar forms hplh, hcgh, hqmh and hgnh, respectively. Note that in the first and second examples, the first radical (n or i) is missing, in the third the second radical (w) is omitted and in the last example one of the two identical radicals is omitted. Consequently, a form such as hC1C2h can have any of the roots n.C1.C2, C1.w.C2, C1.i.C2, C1.C2.C2 and even, in some cases, i.C1.C2.</Paragraph> <Paragraph position="3"> While the Hebrew script is highly ambiguous, ambiguity is somewhat reduced for the task we consider here, as many of the possible lexemes of a given form share the same root. Still, in order to correctly identify the root of a given word, context must be taken into consideration. For example, the form $mnh has more than a dozen readings, including the adjective &quot;fat&quot; (feminine singular), which has the root $.m.n, and the verb &quot;count&quot;, whose root is m.n.i, preceded by a subordinating conjunction. In the experiments we describe below we ignore context completely, so our results are handicapped by design.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Data and methodology </SectionTitle> <Paragraph position="0"> We take a machine learning approach to the problem of determining the root of a given word. For training and testing, a Hebrew linguist manually tagged a corpus of 15,000 words (a set of newspaper articles). Of these, only 9752 were annotated; the reason for the gap is that some Hebrew words, mainly borrowed but also some frequent words such as prepositions, do not have roots; we further eliminated 168 roots with more than three consonants and were left with 5242 annotated word types, exhibiting 1043 different roots. Table 1 shows the distribution of word types according to root ambiguity.</Paragraph> <Paragraph position="1"> 3Only tri-consonantal roots are counted. Ornan (2003) mentions 3407 roots, whereas the number of roots in Arabic is estimated to be 10,000 (Darwish, 2002).</Paragraph> <Paragraph position="2"> Number of roots 1 2 3 4 Number of words 4886 335 18 3 the 5242 word types in our corpus according to root type, where Ci is the i-th radical (note that some roots may belong to more than one group).</Paragraph> <Paragraph position="4"> As assurance for statistical reliability, in all the experiments discussed in the sequel (unless otherwise mentioned) we performed 10-fold cross validation runs a for every classification task during evaluation. We also divided the test corpus into two sets: a development set of 4800 words and a held-out set of 442 words. Only the development set was used for parameter tuning. A given example is a word type with all its (manually tagged) possible roots.</Paragraph> <Paragraph position="5"> In the experiments we describe below, our system produces one or more root candidates for each example. For each example, we define tp as the number of candidates correctly produced by the system; fp as the number of candidates which are not correct roots; and fn as the number of correct roots the system did not produce. As usual, we define recall as tptp+fp and precision as tptp+fn; we then compute f-measure for each example (with a = 0.5) and (macro-) average to obtain the system's overall fmeasure. null To estimate the difficulty of this task, we asked six human subjects to perform it. Subjects were asked to identify all the possible roots of all the words in a list of 200 words (without context), randomly chosen from the test corpus. All subjects were computer science graduates, native Hebrew speakers with no linguistic background. The average precision of humans on this task is 83.52%, and with recall at 80.27%, f-measure is 81.86%. Two main reasons for the low performance of humans are the lack of context and the ambiguity of some of the weak paradigms.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 A machine learning approach </SectionTitle> <Paragraph position="0"> To establish a baseline, we first performed two experiments with simple, baseline classifiers. In all the experiments described in this paper we use SNoW (Roth, 1998) as the learning environment, with winnow as the update rule (using perceptron yielded comparable results). SNoW is a multi-class classifier that is specifically tailored for learning in domains in which the potential number of information sources (features) taking part in decisions is very large, of which NLP is a principal example. It works by learning a sparse network of linear functions over a pre-defined or incrementally learned feature space. SNoW has already been used successfully as the learning vehicle in a large collection of natural language related tasks, including POS tagging, shallow parsing, information extraction tasks, etc., and compared favorably with other classifiers (Roth, 1998; Punyakanok and Roth, 2001; Florian, 2002).</Paragraph> <Paragraph position="1"> Typically, SNoW is used as a classifier, and predicts using a winner-take-all mechanism over the activation values of the target classes. However, in addition to the prediction, it provides a reliable confidence level in the prediction, which enables its use in an inference algorithm that combines predictors to produce a coherent inference.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Feature types </SectionTitle> <Paragraph position="0"> All the experiments we describe in this work share the same features and differ only in the target classifiers. The features that are used to characterize a word are both grammatical and statistical: * Location of letters (e.g., the third letter of the word is b ). We limit word length to 20, thus obtaining 440 features of this type (recall the the size of the alphabet is 22).</Paragraph> <Paragraph position="1"> * Bigrams of letters, independently of their loca-tion (e.g., the substring gd occurs in the word). This yields 484 features.</Paragraph> <Paragraph position="2"> * Prefixes (e.g., the word is prefixed by k$h &quot;when the&quot;). We have 292 features of this type, corresponding to 17 prefixes and sequences thereof.</Paragraph> <Paragraph position="3"> * Suffixes (e.g., the word ends with im, a plural suffix). There are 26 such features.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Direct prediction </SectionTitle> <Paragraph position="0"> In the first of the two experiments, referred to as Experiment A, we trained a classifier to learn roots as a single unit. The two obvious drawbacks of this approach are the large set of targets and the sparseness of the training data. Of course, defining a multi-class classification task with 2152 targets, when only half of them are manifested in the training corpus, does not leave much hope for ever learning to identify the missing targets.</Paragraph> <Paragraph position="1"> In Experiment A, the macro-average precision of ten-fold cross validation runs of this classification problem is 45.72%; recall is 44.37%, yielding an f-score of 45.03%. In order to demonstrate the inadequacy of this method, we repeated the same experiment with a different organization of the training data. We chose 30 roots and collected all their occurrences in the corpus into a test file. We then trained the classifier on the remainder of the corpus and tested on the test file. As expected, the accuracy was close to 0%,</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Decoupling the problem </SectionTitle> <Paragraph position="0"> In the second experiment, referred to as Experiment B, we separated the problem into three different tasks. We trained three classifiers to learn each of the root consonants in isolation and then combined the results in the straight-forward way (a conjunction of the decisions of the three classifiers). This is still a multi-class classification but the number of targets in every classification task is only 22 (the number of letters in the Hebrew alphabet) and data sparseness is no longer a problem.</Paragraph> <Paragraph position="1"> As we show below, each classifier achieves much better generalization, but the clear limitation of this method is that it completely ignores interdependencies between different targets: the decision on the first radical is completely independent of the decision on the second and the third.</Paragraph> <Paragraph position="2"> We observed a difference between recognizing the first and third radicals and recognizing the second one, as can be seen in table 3. These results correspond well to our linguistic intuitions: the most difficult cases for humans are those in which the second radical is w or i, and those where the second and the third consonants are identical. Combining the three classifiers using logical conjunction yields an f-measure of 52.84%. Here, repeating the same experiment with the organization of the corpus such that testing is done on unseen roots yielded 18.1% accuracy.</Paragraph> <Paragraph position="3"> To demonstrate the difficulty of the problem, we conducted yet another experiment. Here, we trained the system as above but we tested it on different words whose roots were known to be in the training set. The results of experiment A here were 46.35%, whereas experiment B was accurate in 57.66% of rect radical the cases. Evidently, even when testing only on previously seen roots, both na&quot;ive methods are unsuccessful (although method A here outperforms method B).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Combining interdependent classifiers </SectionTitle> <Paragraph position="0"> Evidently, simple combination of the results of the three classifiers leaves much room for improvement. Therefore we explore other ways for combining these results. We can rely on the fact that SNoW provides insight into the decisions of the classifiers - it lists not only the selected target, but rather all candidates, with an associated confidence measure. Apparently, the correct radical is chosen among SNoW's top-n candidates with high accuracy, as the data in table 3 reveal.</Paragraph> <Paragraph position="1"> This observation calls for a different way of combining the results of the classifiers which takes into account not only the first candidate but also others, along with their confidence scores.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 HMM combination </SectionTitle> <Paragraph position="0"> We considered several ways, e.g., via HMMs, of appealing to the sequential nature of the task (C1 followed by C2, followed by C3). Not surprisingly, direct applications of HMMs are too weak to provide satisfactory results, as suggested by the following discussion. The approach we eventually opted for combines the predictive power of a classifier to estimate more accurate state probabilities.</Paragraph> <Paragraph position="1"> Given the sequential nature of the data and the fact that our classifier returns a distribution over the possible outcomes for each radical, a natural approach is to combine SNoW's outcomes via a Markovian approach. Variations of this approach are used in the context of several NLP problems, including POS tagging (Sch&quot;utze and Singer, 1994), shallow parsing (Punyakanok and Roth, 2001) and named entity recognition (Tjong Kim Sang and De Meulder, 2003).</Paragraph> <Paragraph position="2"> Formally, we assume that the confidence supplied by the classifier is the probability of a state (radical, c) given the observation o (the word), P(c|o). This information can be used in the HMM framework by applying Bayes rule to compute</Paragraph> <Paragraph position="4"> where P(o) and P(c) are the probabilities of observing o and being at c, respectively. That is, instead of estimating the observation probability P(o|c) directly from training data, we compute it from the classifiers' output. Omitting details (see Punyakanok and Roth (2001)), we can now combine the predictions of the classifiers by finding the most likely root for a given observation, as</Paragraph> <Paragraph position="6"> where th is a Markov model that, in this case, can be easily learned from the supervised data. Clearly, given the short root and the relatively small number of values of ci that are supported by the outcomes of SNoW, there is no need to use dynamic programming here and a direct computation is possible.</Paragraph> <Paragraph position="7"> However, perhaps not surprisingly given the difficulty of the problem, this model turns out to be too simplistic. In fact, performance deteriorated. We conjecture that the static probabilities (the model) are too biased and cause the system to abandon good choices obtained from SNoW in favor of worse candidates whose global behavior is better.</Paragraph> <Paragraph position="8"> For example, the root &.b.d was correctly generated by SNoW as the best candidate for the word &obdim, but since P(C3 = b|C2 = b), which is 0.1, is higher than P(C3 = d|C2 = b), which is 0.04, the root &.b.b was produced instead. Note that in the above example the root &.b.b cannot possibly be the correct root of &obdim since no pattern in Hebrew contains the letter d, which must therefore be part of the root. It is this kind of observations that motivate the addition of linguistic knowledge as a vehicle for combining the results of the classifiers.</Paragraph> <Paragraph position="9"> An alternative approach, which we intend to investigate in the future, is the introduction of higher-level classifiers which take into account interactions between the radicals (Punyakanok and Roth, 2001).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Adding linguistic constraints </SectionTitle> <Paragraph position="0"> The experiments discussed in section 4 are completely devoid of linguistic knowledge. In particular, experiment B inherently assumes that any sequence of three consonants can be the root of a given word. This is obviously not the case: with very few exceptions, all radicals must be present in any inflected form (in fact, only w, i, n and in an exceptional case l can be deleted when roots combine with patterns). We therefore trained the classifiers to consider as targets only letters that occurred in the observed word, plus w, i, n and l, rather than any of the alphabet letters. The average number of targets is now 7.2 for the first radical, 5.7 for the second and 5.2 for the third (compared to 22 each in the previous setup).</Paragraph> <Paragraph position="1"> In this model, known as the sequential model (Even-Zohar and Roth, 2001), SNoW's performance improved slightly, as can be seen in table 4 (compare to table 3). Combining the results in the straight-forward way yields an f-measure of 58.89%, a small improvement over the 52.84% performance of the basic method. This new result should be considered baseline. In what follows we always employ the sequential model for training and testing the classifiers, using the same constraints.</Paragraph> <Paragraph position="2"> However, we employ more linguistic knowledge for a more sophisticated combination of the classifiers.</Paragraph> <Paragraph position="3"> rect radical, sequential model</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Combining classifiers using linguistic </SectionTitle> <Paragraph position="0"> knowledge SNoW provides a ranking on all possible roots. We now describe the use of linguistic constraints to re-rank this list. We implemented a function which uses knowledge pertaining to word-formation processes in Hebrew in order to estimate the likelihood of a given candidate being the root of a given word. The function practically classifies the candidate roots into one of three classes: good candidates, which are likely to be the root of the word; bad candidates, which are highly unlikely; and average cases.</Paragraph> <Paragraph position="1"> The decision of the function is based on the observation that when a root is regular it either occurs in a word consecutively or with a single w or i between any two of its radicals. The scoring function checks, given a root and a word, whether this is the case. Furthermore, the suffix of the word, after matching the root, must be a valid Hebrew suffix (there is only a small number of such suffixes in Hebrew). If both conditions hold, the scoring function returns a high value. Then, the function checks if the root is an unlikely candidate for the given word.</Paragraph> <Paragraph position="2"> For example, if the root is regular its consonants must occur in the word in the same order they occur in the root. If this is not the case, the function returns a low value. We also make use in this function of our pre-compiled list of roots. A root candidate which does not occur in the list is assigned the low score. In all other cases, a middle value is returned.</Paragraph> <Paragraph position="3"> The actual values that the function returns were chosen empirically by counting the number of occurrences of each class in the training data. For example, &quot;good&quot; candidates make up 74.26% of the data, hence the value the function returns for &quot;good&quot; roots is set to 0.7426. Similarly, the middle value is set to 0.2416 and the low - to 0.0155.</Paragraph> <Paragraph position="4"> As an example, consider hipltm, whose root is n.p.l (note that the first n is missing in this form).</Paragraph> <Paragraph position="5"> Here, the correct candidate will be assigned the middle score while p.l.t and l.t.m will score high.</Paragraph> <Paragraph position="6"> In addition to the scoring function we implemented a simple edit distance function which returns, for a given root and a given word, the inverse of the edit distance between the two. For example, for hipltm, the (correct) root n.p.l scores 1/4 whereas p.l.t scores 1/3.</Paragraph> <Paragraph position="7"> We then run SNoW on the test data and rank the results of the three classifiers globally, where the order is determined by the product of the three different classifiers. This induces an order on roots, which are combinations of the decisions of three independent classifiers. Each candidate root is assigned three scores: the product of the confidence measures of the three classifiers; the result of the scoring function; and the inverse edit distance between the candidate and the observed word. We rank the candidates according to the product of the three scores (i.e., we give each score an equal weight in the final ranking).</Paragraph> <Paragraph position="8"> In order to determine which of the candidates to produce for each example, we experimented with two methods. First, the system produced the top-i candidates for a fixed value of i. The results on the development set are given in table 5.</Paragraph> <Paragraph position="9"> Obviously, since most words have only one root, precision drops dramatically when the system produces more than one candidate. This calls for a better threshold, facilitating a non-fixed number of outputs for each example. We observed that in the &quot;difficult&quot; examples, the top ranking candidates are assigned close scores, whereas in the easier cases, the top candidate is usually scored much higher than the next one. We therefore decided to produce all those candidates whose scores are not much lower than the score of the top ranking candidate. The drop in the score, d, was determined empirically on the development set. The results are listed in table 6, where d varies from 0.1 to 1 (d is actually computed on the log of the actual score, to avoid underflow).</Paragraph> <Paragraph position="10"> These results show that choosing d = 0.4 produces the highest f-measure. With this value for d, results for the held-out data are presented in table 7. The results clearly demonstrate the added benefit of the linguistic knowledge. In fact, our results are slightly better than average human performance, which we recall as well. Interestingly, even when testing the system on a set of roots which do not occur in the training corpus (see section 4), we obtain an f-score of 65.60%. This result demonstrates the robustness of our method.</Paragraph> <Paragraph position="11"> It must be noted that the scoring function alone is not a function for extracting roots from Hebrew words. First, it only scores a given root candidate against a given word, rather than yield a root given a word. While we could have used it exhaustively on all possible roots in this case, in a general setting of a number of classifiers the number of classes might be too high for this solution to be practical. Second, the function only produces three different values; when given a number of candidate roots it may return more than one root with the highest score. In the extreme case, when called with all 223 potential roots, it returns on the average more than 11 candidates which score highest (and hence are ranked equally).</Paragraph> <Paragraph position="12"> Similarly, the additional linguistic knowledge is not merely eliminating illegitimate roots from the ranking produced by SNoW. Using the linguistic constraints encoded in the scoring function only to eliminate roots, while maintaining the ranking proposed by SNoW, yields much lower accuracy.</Paragraph> <Paragraph position="13"> Clearly, our linguistically motivated scoring does more than elimination, and actually re-ranks the roots. It is only the combination of the classifiers with the linguistically motivated scoring function which boosts the performance on this task.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Error analysis </SectionTitle> <Paragraph position="0"> Looking at the questionnaires filled in by our subjects (section 3), it is obvious that humans have problems identifying the correct roots in two general cases: when the root paradigm is weak (i.e., when the root is irregular) and when the word can be read in more than way and the subject chooses only one (presumably, the most prominent one). Our system suffers from similar problems: first, its performance on the regular paradigms is far superior to its overall performance; second, it sometimes cannot distinguish between several roots which are in principle possible, but only one of which happens to be the correct one.</Paragraph> <Paragraph position="1"> To demonstrate the first point, we evaluated the performance of the system on a different organization of the data. We tested separately words whose roots are all regular, vs. words all of whose roots are irregular. We also tested words which have at least one regular root (mixed). The results are presented in table 8, and clearly demonstrate the difficulty of the system on the weak paradigms, compared to almost 95% on the easier, regular roots.</Paragraph> <Paragraph position="2"> on different cases.</Paragraph> <Paragraph position="3"> A more refined analysis reveals differences between the various weak paradigms. Table 9 lists f-measure for words whose roots are irregular, classified by paradigm. As can be seen, the system has great difficulty in the cases of C2 = C3 and C3 = i.</Paragraph> <Paragraph position="4"> Finally, we took a closer look at some of the errors, and in particular at cases where the system produces several roots where fewer (usually only one) are correct. Such cases include, for example, the</Paragraph> <Paragraph position="6"> word hkwtrt (&quot;the title&quot;), whose root is the regular k.t.r; but the system produces, in addition, also w.t.r, mistaking the k to be a prefix. This is the kind of errors which are most difficult to cope with.</Paragraph> <Paragraph position="7"> However, in many cases the system's errors are relatively easy to overcome. Consider, for example, the word hmtndbim (&quot;the volunteers&quot;) whose root is the irregular n.d.b. Our system produces as many as five possible roots for this word: n.d.b, i.t.d, d.w.b, i.h.d, i.d.d. Clearly some of these could be eliminated. For example, i.t.d should not be produced, because if this were the root, nothing could explain the presence of the b in the word; i.h.d should be excluded because of the location of the h. Similar phenomena abound in the errors the system makes; they indicate that a more careful design of the scoring function can yield still better results, and this is the direction we intend to pursue in the future.</Paragraph> </Section> </Section> class="xml-element"></Paper>