File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-3004_metho.xml

Size: 35,637 bytes

Last Modified: 2025-10-06 14:13:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-3004">
  <Title>Alon Itait Technion Uzzi Ornan t Technion</Title>
  <Section position="5" start_page="387" end_page="387" type="metho">
    <SectionTitle>
3. The word order in Hebrew is rather free.
4. Our Approach
</SectionTitle>
    <Paragraph position="0"> The purpose of this paper is to suggest a new approach to deal with the above-mentioned problem. This approach provides highly useful data that can be used by systems for automatic, unsupervised morphological tagging of Hebrew texts. In order to justify and motivate our approach, we must first make the following conjecture: Although the Hebrew language is highly ambiguous morphologically, it seems that in many cases a native speaker of the language can accurately &amp;quot;guess&amp;quot; the right analysis of a word, without even being exposed to the concrete context in which it appears. The accuracy can even be enhanced if the native speaker is told from which sublanguage the ambiguous word was taken.</Paragraph>
    <Paragraph position="1"> If this conjecture is true, we can now suggest a simple strategy for automatic tagging of Hebrew texts: For each ambiguous word, find the morpho-lexical probabilities of each possible analysis. If any of these analyses is substantially more frequent than the others, choose it as the right analysis.</Paragraph>
    <Paragraph position="2"> As we have already noted, by saying morpho-lexical probabilities, we mean the probability of a given analysis to be the right analysis of a word, independently of the context in which it appears. It should be emphasized that having these morpho-lexical probabilities enables us not only to use them rather naively in the above-mentioned strategy, but also to incorporate these probabilities into other systems that exploit higher level knowledge (syntactic, semantic etc.). Such a system that uses the morpho-lexical probabilities together with a syntactic knowledge is described in Levinger (1992).</Paragraph>
  </Section>
  <Section position="6" start_page="387" end_page="388" type="metho">
    <SectionTitle>
5. Acquiring the Probabilities
</SectionTitle>
    <Paragraph position="0"> Adopting this approach leaves us with the problem of finding the morpho-lexical probabilities for the different analyses of every ambiguous word in the language. Since we use a large corpus for this purpose, the morpho-lexical probabilities we acquire must be considered relative to this specific training corpus.</Paragraph>
    <Paragraph position="1"> One way to acquire morpho-lexical probabilities from a corpus is to use a large tagged corpus. Given a corpus in which every word is tagged with its right analysis, we can find the morpho-lexical probabilities as reflected in the corpus. This is done by simply counting for each analysis the number of times that it was the right analysis, and using these counters to calculate the probability of each analysis being the right one. The main drawback of this solution is the need for a very large tagged corpus. No such corpus exists for modern Hebrew. Moreover, for such a solution a separate tagged corpus is required for each domain. The method we are about to present saves us the laborious effort of tagging a large corpus, and enables us to find a good approximation to the morpho-lexical probabilities by learning about them from an untagged corpus.</Paragraph>
    <Paragraph position="2"> Using this method, one can easily move to a new domain by applying the method to a new untagged corpus suited to this new domain.</Paragraph>
    <Paragraph position="3">  Moshe Levinger et al. Learning Morpho-Lexical Probabilities This might seem, at first sight, an impossible mission. When we see the word HQPH in an untagged corpus we cannot automatically decide which of its possible readings is the right one. The key idea is to shift each of the analyses of an ambiguous word in such a way that they all become distinguishable. To be more specific, for each possible analysis (lexical entry + the morphological information), we define a set of words that we call Similar Words (SW). An element in this set is another word form of the same lexical entry that has similar morphological attributes to the given analysis. These words are assumed similar to the analysis in the sense that we expect them to have approximately the same frequency in the language as the analysis they belong to.</Paragraph>
    <Paragraph position="4"> A reasonable assumption of this kind would be, for instance, to say that the masculine form of a verb in a certain tense in Hebrew is expected to have approximately the same frequency as the feminine form of the same verb, in the same tense. This assumption holds for most of the Hebrew verbs, since all Hebrew nouns (and not only animate ones) have the gender attribute. 4 To see a concrete example, consider the word R^H (nt~7) and one of its analyses: the verb 'to see', masculine, singular, third person, past tense. A similar word for this analysis is the following one: * RATH (n~t~7), feminine, singular, third person, past tense.</Paragraph>
    <Paragraph position="5"> The choice of which words should be included in the SW set of a given analysis is determined by a set of pre-defined rules based on the intuition of a native speaker. Nevertheless, the elements in the SW sets are not determined for each analysis separately, but rather are generated automatically, for each analysis, by changing the contents of one or several morphological attributes in the morphological analysis.</Paragraph>
    <Paragraph position="6"> In the previous example the elements are generated by changing the contents of the gender attribute in the morphological analysis, while keeping all the other attributes unchanged.</Paragraph>
    <Paragraph position="7"> The set of rules used by the algorithm for automatic generation of SW sets for each analysis in the language are of a heuristic nature. For the problem in Hebrew, a set of ten rules 5 was sufficient for the generation of SW sets for all the possible morphological analyses in Hebrew. In case we wish to move to some other domain in Hebrew, we should be able to use the same set of rules, but with a suitable training corpus. Hence, the set of rules are language-dependent but not domain-dependent.</Paragraph>
    <Paragraph position="8"> To clarify this point, consider the word MCBY&amp; (~'~Xr2), which has the following two morphological analyses: .</Paragraph>
    <Paragraph position="9"> .</Paragraph>
    <Paragraph position="10"> The verb HCBY&amp; (~2Xn), masculine, singular, present tense ('indicates' or 'votes').</Paragraph>
    <Paragraph position="11"> The noun MCBY&amp; (~2xr~, 'a pointer').</Paragraph>
    <Paragraph position="12"> The set of rules defined for Hebrew would enable us to observe that in the domain of daily newspaper articles, the first analysis probably has a high morpho-lexical probability while the second analysis has a very low probability. Using the same set of rules, we should be able to deduce for a domain of articles dealing with computer languages that the second analysis is probably much more frequent than the first one. Whenever we wish to apply our method to some other language that has a similar 4 This assumption does not hold for a small number of verbs that take as a subject only animate nouns with a specific gender, such as YLDH (n&amp;quot;f~ ~, 'she gave birth.')</Paragraph>
  </Section>
  <Section position="7" start_page="388" end_page="389" type="metho">
    <SectionTitle>
5 See Appendix B for the list of the rules used for Hebrew.
</SectionTitle>
    <Paragraph position="0"> Computational Linguistics Volume 21, Number 3 ambiguity problem, all we need to do is define a new set of rules for generation of SW sets in that other language.</Paragraph>
    <Paragraph position="1"> By choosing the elements in the SW set carefully so that they meet the requirement of similarity, we can study the frequency of an analysis from the frequencies of the elements in its SW set. Note that we should choose the words for the SW sets such that they are morphologically unambiguous. We assume that this is the case in the following examples, and will return to this issue in the next two sections.</Paragraph>
    <Paragraph position="2"> To illustrate the whole process, let us reconsider the ambiguous word HQPH (~flpn) and its three different analyses. The SW sets for each analysis is as follows:</Paragraph>
    <Paragraph position="4"> HQPM (O~pn, masculine 'their perimeter'), HQPN (lflpn, feminine 'their perimeter') }.</Paragraph>
    <Paragraph position="5"> Given the SW set of each analysis we can now find in the corpus how many times each word appears, calculate the expected frequency of each analysis, and get the desired probabilities by normalizing the frequency distribution.</Paragraph>
    <Paragraph position="6"> Had our similarity assumption been totally correct, namely, that each word in the SW set appears exactly the same number of times as the related analysis, we would have expected to get a neat situation such as the following (assuming that the ambiguous word HQPH appears 200 times in the corpus): 6</Paragraph>
    <Paragraph position="8"> These counters suggest that if we manually tagged the 200 occurrences of the string HQPH in the corpus, we would find that the first analysis of HQPH is the right one 18 times out of the 200 times that the word appears in the corpus, that the second analysis is the right one 180 times, and that the third analysis is the right analysis only twice.</Paragraph>
    <Paragraph position="9"> Using these counters we can relate the following morpho-lexical probabilities to the three analyses of HQPH: 0.09, 0.90, 0.01, respectively. These probabilities must be considered an approximation to the real morpho-lexical probabilities, because of the following reasons: .</Paragraph>
    <Paragraph position="10"> .</Paragraph>
    <Paragraph position="11"> The words in the SW set are only expected to appear approximately the same number of times as the analysis they represent.</Paragraph>
    <Paragraph position="12"> The reliability of the probabilities we acquire using our method depends on the number of times the ambiguous word appears in the corpus</Paragraph>
  </Section>
  <Section position="8" start_page="389" end_page="390" type="metho">
    <SectionTitle>
6 The numbers in this example are fictitious. They were chosen in order to clarify our point.
</SectionTitle>
    <Paragraph position="0"> Moshe Levinger et al. Learning Morpho-Lexical Probabilities (which is really the size of the sample we use to calculate the morpho-lexical probabilities).</Paragraph>
    <Paragraph position="1"> In the corpus we worked with, the word HQPH appeared 202 times, and the number of occurrences of the words in its SW sets were as follows:</Paragraph>
    <Paragraph position="3"> By applying now the algorithm of the next section on these counters, we can calculate the desired probabilities.</Paragraph>
  </Section>
  <Section position="9" start_page="390" end_page="396" type="metho">
    <SectionTitle>
6. The Algorithm
</SectionTitle>
    <Paragraph position="0"> Our algorithm has to handle the frequently occurring case in which a certain word appears in more than one SW set. In that case, we would like to consider the counter of such a word appropriately. The algorithm takes care of this problem and works as follows: Initially we assume that the proportions between the different analyses are equal.</Paragraph>
    <Paragraph position="1"> For each analysis we compute its average number of occurrences, by summing up all the counters for each word in the SW set and dividing this sum by the SW size. Note that in this stage we also include the ambiguous word in each of the SW sets. 7 If a word appears in several SW sets, we calculate its contribution to the total sum according to the proportions between all those sets, using the proportions calculated in the previous iteration.</Paragraph>
    <Paragraph position="2"> Calculate the new proportions between the different analyses by computing the proportions between the average number of occurrences of each analysis.</Paragraph>
    <Paragraph position="3"> This process is iterated until the new proportions calculated are sufficiently close to the proportions calculated in the previous iteration. Finally, the proportions are normalized to obtain probabilities.</Paragraph>
    <Paragraph position="4"> A formal description of the algorithm written in a pseudo-code is given in Figure 1. 7 This is done mainly in order to handle cases where a certain analysis has an empty SW set, since it does not have naturally similar words. The third example in the next section serves to clarify this point.</Paragraph>
    <Paragraph position="6"/>
    <Paragraph position="8"> Although this method for acquiring morpho-lexical probabilities gives very good results for many ambiguous words, as will be shown in Section 8, we detected two types of inherently problematic cases: .</Paragraph>
    <Paragraph position="9"> .</Paragraph>
    <Paragraph position="10"> Because of the high degree of morphological ambiguity in Hebrew, some of the words in the SW sets may also be ambiguous. As long as the other possible analyses of such a word are not too frequent, it only slightly affects the final probabilities. Otherwise, we might get wrong results by erroneously crediting the high number of occurrences of such a word 9 to one of the analyses. For this reason, we try to construct the SW sets from as many suitable elements as possible, in order to be able to detect &amp;quot;misleading&amp;quot; words of this sort.</Paragraph>
    <Paragraph position="11"> Occasionally, the SW sets defined for two different analyses are actually the same. Thus, a differentiation between those two analyses cannot be done using our method.</Paragraph>
    <Paragraph position="12"> Another potentially problematic case is the coverage problem, that arises whenever we do not have enough data in the corpus for disambiguation of a certain word (see a discussion on this problem in Dagan, Itai, and Schwall \[1991\]). This problem was found to occur very rarely--for only 3% of the ambiguous words in our test texts the counters found in the corpus were smaller than 20. We expect this percentage would be even smaller had we used a larger training corpus. For such words, we simply ignored the data and arbitrarily gave a uniform probability to all their analyses. 7. Examples Several aspects of the algorithm described in the previous section can be better understood by looking at some clarifying examples. To see an example for the convergence of the algorithm, consider the neat situation described in Section 5 for the word HQPH:</Paragraph>
    <Paragraph position="14"> For these sets and counters and for ~ = 0.001, the algorithm converges after 10 iterations. The probabilities for each iteration are given below:  9 Because of technical reasons, we cannot decide whether a given word is ambiguous or not when we automatically generate the words for the SW sets. See Section 7 for more details.  In this example the similarity assumption holds, and the words in the SW sets (excluding the word HQPH itself) are also unambiguous. This need not hold in other situations.</Paragraph>
    <Paragraph position="15"> As we have pointed out already, because of technical reasons we have not been able to apply the morphological analyzer to the words in the SW sets, and thus we have not been able to automatically observe that a given similar word is ambiguous by itself. The problem stems from the fact that we have been able to use the morphological analyzer on personal computers only, while both the corpus and the program that automatically generates the SW sets for each analysis could have been used only on our mainframe computer. Given this, the morphological analyzer was only used in order to obtain the input files for the disambiguation project.</Paragraph>
    <Paragraph position="16"> Nonetheless, the fact that ambiguous words in the SW sets cannot be automatically identified does not affect the quality of the probabilities obtained by our method for most ambiguous words. 1deg To see the reason for this, consider the word XWD$ (vd'nn) and its two analyses:</Paragraph>
    <Paragraph position="18"> Both XWD$H and XWD$W (SW2) are ambiguous words. Still, since the counters for these two words are substantially smaller than the counter for the word HXWD$ (SW1), the probabilities calculated according to these counters can be considered as a reasonable approximation for the real morpho-lexical probabilities. The algorithm, applied to these sets and counters, yielded the following probabilities: P1 = 0.961,</Paragraph>
    <Paragraph position="20"> This kind of situation is not unique for the word XWD$. Similar situations occur in many other ambiguous words in Hebrew. Hence, not having the ability to identify ambiguous words in the SW sets has a meaningful effect on the quality of the probabilities only in cases where some similar word is ambiguous and its other analysis is frequent in the language. In such cases the analysis that this word belongs to is assigned a higher probability than its real morpho-lexical probability. We use the term misleading words for such ambiguous similar words.</Paragraph>
    <Paragraph position="21"> A partial solution for such cases was implemented in the revised algorithm we used for morpho-lexical probabilities calculation. In this revised version we automatically identified similar words as misleading words by looking at the counters of all the similar words in a given SW set. A word was considered misleading if its counter was at least five times greater than that of any other word in the set. This solution was not applicable in cases where all the similar words in a given SW set were misleading words.</Paragraph>
    <Paragraph position="22"> 10 In our test sample of 53 words, the probabilities were significantly affected by this phenomenon in only three cases.  Moshe Levinger et al. Learning Morpho-Lexical Probabilities The need to add the original ambiguous word to all the SW sets of its analyses can be made clear by the following example. Consider the word AT (~t~) and its sets and counters, as found in our training corpus:  1. The direct object particle for definite nouns, AT.</Paragraph>
    <Paragraph position="23"> SW 1 = { AT = 197,501 } 2. The feminine, singular, second person, nominal personal pronoun AT (feminine 'you').</Paragraph>
    <Paragraph position="25"> The key point here is that the particle AT has no natural similar word. u Yet, from the above counters we should be able to deduce that the first analysis has a very high morpho-lexical probability. This is since the ambiguous word AT is very frequent in the corpus, while the counters in the SW sets for the second and third analyses indicate that these analyses are not the &amp;quot;reason&amp;quot; for the high frequency of AT in the corpus.</Paragraph>
    <Paragraph position="26"> Adding the ambiguous word to all the SW sets allows the algorithm to take this fact into account. Applying the algorithm on the above sets and counters yields the following morpho-lexical probabilities: P1 = 0.9954, P2 = 0.0045, P3 = 0.0001.</Paragraph>
    <Paragraph position="27"> 8. Evaluating the Probabilities Before we evaluate the quality of the approximated probabilities that can be acquired using our method, we would like to start with a definition of three terms that will be used in this section: Morpho-Lexical Probabilities Estimated from a Training Corpus Given a large corpus in Hebrew the morpho-lexical probabilities of a given word are the probabilities of its analyses as calculated by manually tagging all the occurrences of the given word in the corpus. We will use the abbreviation morpho-lexical probabilities to denote this term.</Paragraph>
    <Paragraph position="28"> Morpho-Lexical Probabilities Estimated over a Test-Corpus In order to avoid the laborious effort needed for the manual tagging of all the occurrences of an ambiguous word in a large corpus, we estimate the morpho-lexical probabilities by calculating them from a relatively small corpus. The abbreviation test-corpus probabilities will be used for this term.</Paragraph>
    <Paragraph position="29"> Approximated Probabilities Given an ambiguous word, the approximated probabilities of the word are the probabilities calculated using the method described in this paper.</Paragraph>
    <Paragraph position="30"> The approximated probabilities obtained by our method were evaluated by comparing these probabilities with test-corpus probabilities obtained by manual tagging of a relatively small corpus. Since the approximation we acquire depends on the corpus we have been using--texts taken from the Hebrew newspaper Ha'aretz12--we have to  Computational Linguistics Volume 21, Number 3 calculate the test-corpus probabilities from texts taken from the same source. For this purpose we used a small corpus consisting of more than 500,000 word-tokens taken from the same newspaper.</Paragraph>
    <Paragraph position="31"> For our experiment we picked from this small corpus two kinds of test groups. Test-group1 consisted of 30 ambiguous word-types chosen randomly from all the ambiguous word types appearing more than 100 times in the corpus. For the second test group, test-group2, we randomly picked a short text from the corpus from which we extracted all the ambiguous word-tokens appearing at least 30 times in the small corpus. This test group consisted of 23 words.</Paragraph>
    <Paragraph position="32"> These two test groups are of a different nature. Test-group1 consists only of very frequent word types in Hebrew, but the test-corpus probabilities for these word types can be viewed as a reliable estimate of the morpho-lexical probabilities. The word-tokens in test-group2 better represent the typical ambiguous word in the language, but their test-corpus probabilities were calculated from a relatively small sample of tagged words.</Paragraph>
    <Paragraph position="33"> For each word in these test groups, we extracted from the small corpus all the sentences in which the ambiguous word appears. We then manually tagged each ambiguous word and found for each one of its analyses how many times it was the right analysis. For example, the word AWLM (O~1b0 (taken from test-group1) has the following two morphological analyses:  1. The particle AWLM ('but').</Paragraph>
    <Paragraph position="34"> 2. The noun ^WLM ('a hall').</Paragraph>
    <Paragraph position="35">  The word AWLM appeared 236 times in the small corpus. By manually tagging all the relevant sentences we found that the first analysis, 'but,' was the right analysis 232 times, and the second analysis, 'a hall,' was the right analysis only 4 times. Given these numbers we can calculate the relative weights of these two analyses: 232/236, 4/236 and the test-corpus probabilities: 0.983, 0.017, respectively. In the same way, using the small corpus we found the test-corpus probability, Ptest, for each of the analyses in the test groups.</Paragraph>
    <Paragraph position="36"> Table 2 shows the test-corpus probabilities and the approximated probabilities for five representative ambiguous words from our test groups. In this table the approximation for the probabilities of the first three words is very good while the approximation for the fourth word is quantitatively poor, but still succeeds in identifying the first analysis of LPNY (~3~V, &amp;quot;before') as the dominant analysis. As for the fifth word, here the approximation we got is totally incorrect. At the end of this section we shall identify some cases for which our method fails to find a reasonable approximation for the morpho-lexical probabilities of an ambiguous word.</Paragraph>
    <Paragraph position="37"> In order to evaluate the quality of the approximation we got by our method, we should compare the approximated probabilities for the words in these test groups with the test-corpus probabilities we found.</Paragraph>
    <Paragraph position="38"> When we tried to make a quantitative comparison using statistical methods we found that for many analyses Papp &amp;quot;looks&amp;quot; like a good approximation for Ptest, but from a statistical point of view the approximation is not satisfying. The main reason for this is that the words in the SW set of a given analysis can be considered similar in their frequency to the analysis only from a qualitative point of view, and not from a quantitative one. Thus, the comparison we describe in what follows serves for evaluation of the quality of the approximated probabilities.</Paragraph>
    <Paragraph position="39"> Motivated by the way we use the morpho-lexical probabilities for morphological disambiguation, we can divide the probability of an analysis into three categories:</Paragraph>
    <Paragraph position="41"> Very high probability An analysis with a probability from this category is the dominant analysis of the ambiguous word and thus, given that we cannot use any other source of information to disambiguate the given word, we would like to select the dominant analysis as the right analysis.</Paragraph>
    <Paragraph position="42"> Very low probability Given no other information, an analysis with a very low probability should be treated as a wrong analysis.</Paragraph>
    <Paragraph position="43"> All other probabilities An analysis with probability of this sort should not be selected as wrong/right analysis solely according to its morpho-lexical probability.</Paragraph>
    <Paragraph position="44"> Formally, the mapping from the probability of an analysis to its category is done using two thresholds, upper threshold and lower threshold, as follows:</Paragraph>
    <Paragraph position="46"/>
  </Section>
  <Section position="10" start_page="396" end_page="398" type="metho">
    <SectionTitle>
3 otherwise
</SectionTitle>
    <Paragraph position="0"> The quality of the approximated probabilities we acquire using our method is now measured by examining the proportion of words for which the estimated category for each of their analyses agrees with the category defined by the approximated probabilities. The results of this comparison for the two test groups we used are shown in Table 3 and Table 4. In these tables we divide the words into three groups according to the quality of the approximation found for them: 1. Words with good approximation--words for which CAT(Ptest) = CAT(Papp) holds for all their analyses, using: lower  threshold = 0.20, and upper threshold = 0.80. (The first three words in Table 2 belong to this category).</Paragraph>
    <Paragraph position="1"> Words with reasonable approximation--words that do not fall into the previous category, but CAT(Ptest) = CAT(Papp) holds for all their analyses, using: lower threshold = 0.35, and upper threshold = 0.65 (The fourth word in Table 2 belongs to this category).</Paragraph>
    <Paragraph position="2"> Words with incorrect approximation--the words whose approximation is neither good nor reasonable. (The fifth word in Table 2 belongs to this category).</Paragraph>
    <Paragraph position="3"> From these tables we can see that our method yielded incorrect approximation for only 5 words out of the 53 words in the test groups (9.5%). By closely looking at these words, we can identify two reasons for failure: .</Paragraph>
    <Paragraph position="4"> .</Paragraph>
    <Paragraph position="5"> Ambiguity of a word in the SW set of a given analysis. This may affect the probabilities calculated for this analysis. To see that, consider the word MWNH (~\]1r2) (test-group2), one analysis of which is the noun MWNH ('a counter'). By manually tagging all the occurrences of MWNH in our small corpus, we found that the above-mentioned analysis is extremely rare--its relative weight is 0/44. As for the approximated probability of this analysis, its SW set contains a single word: HMWNH (n~lr~n, 'the counter'), the definite form of the same noun. The word HMWNH is very frequent in our corpus and for that reason the approximated probability found for this analysis is very high: 0.894. The mismatch between Ptest and Papp in this case is due to the fact that HMWNH is a misleading word--an ambiguous word one analysis of which H + present form of MNH (~\]r~, 'numbered'), is a frequent idiom in Hebrew ('which numbers').</Paragraph>
    <Paragraph position="6"> Our method may also yield an incorrect approximation for analyses where the similarity assumption we use between the frequency of an  Moshe Levinger et al. Learning Morpho-Lexical Probabilities analysis and the frequency of the words in its SW set does not hold. An example for this is the word $&amp;H (n~Vd) (test-group2), and one of its analyses the noun $&amp;H ('an hour'). The approximated probability for this analysis is calculated by looking at the frequency of the similar word H$&amp;H (~Vdn, 'the hour'). Unfortunately, the similarity assumption does not hold in this case, since the indefinite form of $&amp;H is much more frequent in Hebrew than the definite form of the word. For this reason, 13 the approximated probability for this analysis (0.376) is substantially lower than its test-corpus probability (0.847).</Paragraph>
  </Section>
  <Section position="11" start_page="398" end_page="400" type="metho">
    <SectionTitle>
9. Morphological Disambiguation
</SectionTitle>
    <Paragraph position="0"> In the previous section we compared the approximated probabilities obtained by our method to the probabilities found by manually tagging a small corpus. We found that the acquired probabilities are truly a good approximation for the morpho-lexical probabilities. In this section we describe an experiment that was conducted in order to test the effectiveness of the morpho-lexical probabilities for morphological disambiguation in Hebrew.</Paragraph>
    <Paragraph position="1"> Following are the main components in our project that were used in order to conduct the experiment:  1. A robust morphological analyzer for Hebrew that gives for each word in the language all its possible analyses. The input for our project is supplied by this module.</Paragraph>
    <Paragraph position="2"> 2. An interactive program for manually tagging Hebrew texts. It was created in order to rapidly tag large texts and was used to mark the right analysis for each ambiguous word in order to be used later to evaluate the performance of our method.</Paragraph>
    <Paragraph position="3"> 3. Untagged Hebrew corpus. Because of the fact that Hebrew corpora (untagged and tagged as well) are not available in the public domain, we had to build a Hebrew corpus especially for this project. This corpus consists of 11 million word-tokens taken from the daily newspaper Ha'aretz.</Paragraph>
    <Paragraph position="4"> 4. A hash table that stores all the words in the corpus. Each word is accompanied by a counter indicating how many times it appears in the corpus. Since this is the only information we extract from the corpus, our algorithm needs only this hash table and is therefore very efficient. 5. A morphological generator for Hebrew that was written especially for this project. The SW sets for every analysis are generated using this module. Because of technical reasons, we were not able to use the morphological analyzer at this stage, and thus we could not identify ambiguous words in the SW sets.</Paragraph>
    <Paragraph position="5"> 6. An implementation of the iterative algorithm that calculates the probabilities.</Paragraph>
    <Paragraph position="6"> 13 The indefinite form of $&amp;H appears in many Hebrew idioms, e.g., LPY $&amp;H (n~)9~, 'for the time being'), B^WTH $&amp;H (D~ ~llR~l, 'at the same time') etc.</Paragraph>
    <Paragraph position="7">  Computational Linguistics Volume 21, Number 3 . A simple selection algorithm that reduces the level of morphological ambiguity using the probabilities obtained from the corpus. The algorithm uses two thresholds, an upper threshold and a lower threshold, which serve to choose the right analysis or to rule out wrong analyses, respectively.</Paragraph>
    <Paragraph position="8"> A set of 21 articles was selected in order to test the performance of the method. Since the morpho-lexical probabilities we use are calculated from a large Hebrew corpus (representing a certain Hebrew sublanguage), these 21 texts were randomly selected from texts belonging to the same sublanguage. The total number of word-tokens in these test texts was 3,400, out of which nearly 50% were morphologically ambiguous.</Paragraph>
    <Paragraph position="9"> The reason for testing the method only on a relatively small set of test texts is that no tagged Hebrew corpus is currently available for a more powerful evaluation. The need to manually tag the texts used for evaluation limited the number of words in the test texts we used. Nevertheless, we believe that the results obtained for this restricted set of texts gives a fairly good indication for the success of the method on large texts as well.</Paragraph>
    <Paragraph position="10"> We tested the performance of the method on the test texts from two different perspectives. First, we used the probabilities only for ambiguous words that can be fully disambiguated. In this case a single analysis can be selected as the right analysis. The performance of the method for full-disambiguation is measured by the recall parameter, which is defined as follows: no. of correctly assigned words Recall = no. of ambiguous words In addition to this parameter we present two additional performance parameters: applicability and precision. We believe that these parameters are relevant for the particular naive method described in the current section. This is due to the fact that the morpho-lexical probabilities are not supposed to be used alone for disambiguation, but rather are meant to serve as one information source in a system that combines several linguistic sources for disambiguation. The above-mentioned parameters are defined as follows: no. of correctly assigned words Precision = no. of fully disambiguated words no. of fully disambiguated words Applicability = no. of ambiguous words The results obtained for full disambiguation are shown in Table 5. However, the morpho-lexical probabilities can also be used in order to reduce the ambiguity level in the text. The performance of the method in this sense is much more interesting and important since it examines, more accurately, the quality of the probabilities as data for other, more sophisticated, systems that use higher levels of information. In this experiment we test the performance of the morpho-lexical probabilities on the task of analysis assignment. Here one or more analyses of an ambiguous word are recognized as wrong and hence are rejected. The right analysis should be one of the remaining analyses. The three parameters used for evaluation are as follows: no. of correct right assignments Recall = no. of ambiguous words</Paragraph>
    <Paragraph position="12"> no. of remaining analyses Fallout -- no. of incorrect assignments no. o/wrong analyses The results are shown in Table 6. In another experiment we examined 891 words with more than two analyses. Table 7 shows how our algorithm reduced the ambiguity of these words.</Paragraph>
    <Paragraph position="13"> These results demonstrate the effectiveness of morpho-lexical probabilities in reducing the ambiguity level in a Hebrew text, and it seems that by using such information combined with other approaches for morphological disambiguation in Hebrew, we come very close to a practical solution for this problem.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML