File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/95/j95-3004_abstr.xml

Size: 7,369 bytes

Last Modified: 2025-10-06 13:48:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-3004">
  <Title>Alon Itait Technion Uzzi Ornan t Technion</Title>
  <Section position="2" start_page="0" end_page="384" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> This paper addresses the problem of morphological disambiguation in Hebrew by extracting statistical information from an untagged corpus. Yet, the primary point is not to propose a method for morphological disambiguation per se, but rather to suggest a method to compute morpho-lexical probabilities to be used as a linguistic source for morphological disambiguation. Let us start with a few definitions and terminology that will be used throughout this paper.</Paragraph>
    <Paragraph position="1"> We consider written languages, and for the purpose of this paper, a word is a string of letters delimited by spaces or punctuation. Given a language L, and a word w E L, we can find (manually or automatically by a morphological analyzer for L) all the possible morphological analyses of the word w. Suppose a word w has k different analyses, then A1 ..... Ak, will be used to denote these k analyses. A word is morphologically ambiguous if k &gt; 2. The number and character of the analyses depend on the language model. We have used the definitions of the automatic morphological analyzer developed at the IBM Scientific Center, Haifa, Israel (Bentur, Angel', and Segev 1992).</Paragraph>
    <Paragraph position="2"> Given a text T with n words: Wl,..., wn, for each morphologically ambiguous word Wi E T, with k analyses: A1 ..... Ak, there is one analysis, 1 AF E {A1 ..... Ak} that is the  one.</Paragraph>
    <Paragraph position="3"> (~) 1995 Association for Computational Linguistics Computational Linguistics Volume 21, Number 3 right analysis, while all the other k - 1 analyses of w are wrong analyses. The same word wi in a different text, may have, of course, a different right analysis, thus, right and wrong in this case are meaningful only with respect to the context in which wi appears.</Paragraph>
    <Paragraph position="4"> Morphological disambiguation of a text T is done by indicating for each ambiguous word in T--which of its different analyses is the right one. At present, this can be done manually by a speaker of the language, and hopefully in the future it will be done automatically by a computer program. When dealing with automatic disambiguation of a text it is sometimes useful to reduce its ambiguity level. A reduction of the ambiguity level of an ambiguous word w, with k morphological analyses: A1 ..... Ak, occurs when it is possible to select from A1 ..... Ak, a proper subset of I analyses 1 G 1 &lt; k, such that the right analysis of w is one of these 1 analyses. In the case where l = 1, we say that the word w is fully disambiguated.</Paragraph>
    <Paragraph position="5"> Since this paper suggests a method for morphological disambiguation using probabilities, the notion of morpho-lexical probabilities is also required. Our model of the language is based on a large fixed Hebrew corpus. For a word w with k analyses, A1,..., Ak, the morpho-lexical probability of Ai is the estimate of the conditional probability P(Ai \[ w) from the given corpus, i.e.,</Paragraph>
    <Paragraph position="7"> Note that Pi is the probability that Ai is the right analysis of w independently of the context in which w appears. Since the word w has exactly k different analyses: k E~ 1P(Ai w) Ei=I Pi = = \[ = 1.</Paragraph>
    <Paragraph position="8"> For reasons that will be elaborated in Section 2, our problem is most acute in Hebrew and some other languages (e.g., Arabic), though ambiguity problems of a similar nature occur in other languages. One such problem is sense disambiguation. In the context of machine translation, Dagan and Itai (Dagan, Itai, and Schwall 1991; Dagan and Itai 1994) used corpora in the target language to resolve ambiguities in the source language. Yarowsky (1992) proposed a method for sense disambiguation using wide contexts. Part-of-speech tagging--deciding the correct part of speech in the current context of the sentence--has received major attention. Most successful methods have followed speech recognition systems (Jelinek, Mercer, and Roukos 1992) and used large corpora to deduce the probability of each part of speech in the current context (usually the two previous words--trigrams). These methods have reported performance in the range of 95-99% &amp;quot;correct&amp;quot; by word (DeRose 1988; Cutting et al. 1992; Jelinek, Mercer, and Roukos 1992; Kupiec 1992). (The difference in performance is due to different evaluation methods, different tag sets, and different corpora). See Church (1992) for a survey.</Paragraph>
    <Paragraph position="9"> Our work did not use the trigram model, since because of the relatively free word order in Hebrew it was less promising, and also, in some cases the different choices are among words of the same part-of-speech category. Thus tagging for part of speech alone would not solve our problems. Note that a single morphological analysis may correspond to several senses. Even though each sense may have different behavior patterns, in practice this did not present a problem for our program.</Paragraph>
    <Paragraph position="10"> The rest of this paper is organized as follows. Sections 2 through 4 include a description of the morphological ambiguity problem in Hebrew, followed by the claim that knowing the morpho-lexical probabilities of an ambiguous word can be very effective for automatic morphological disambiguation in Hebrew.</Paragraph>
    <Paragraph position="11">  Moshe Levinger et al.</Paragraph>
    <Paragraph position="12"> Table 1 The dimension of morphological ambiguity in Hebrew.</Paragraph>
    <Section position="1" start_page="384" end_page="384" type="sub_section">
      <SectionTitle>
Learning Morpho-Lexical Probabilities
</SectionTitle>
      <Paragraph position="0"> no. of Analyses 1 2 3 4 5 6 no. of Word-Tokens 17,551 9,876 6,401 2,760 1,309 493 % 45.1 25.4 16.5 7.1 3.37 1.27 no. of Analyses 7 8 9 10 11 12 13 no. of Word-Tokens 337 134 10 18 1 3 5 % 0.87 0.34 0.02 0.05 0.002 0.007 0.01 Then, in Sections 5 and 6, we present the key idea of this paper: How to acquire a good approximation for the morpho-lexical probabilities from an untagged corpus. Using this method we can find for each ambiguous word w with k analyses: A1 ..... Ak, probabilities P1 .... , P---k that are an approximation to the morpho-lexical probabilities:</Paragraph>
      <Paragraph position="2"> In Section 7 we clarify some subtle aspects of the algorithm presented in Section 6 by looking at its application to several ambiguous words in Hebrew. A description of an experiment that serves to evaluate the approximated morpho-lexical probabilities calculated using an untagged corpus will be given in Section 8.</Paragraph>
      <Paragraph position="3"> Finally, in Section 9, a simple strategy for morphological disambiguation in Hebrew using morpho-lexical probabilities will be described. This simple strategy was used in an experiment conducted in order to test the significance of the morpho-lexical probabilities as a basis for morphological disambiguation in Hebrew. The experiment shows that using our method we can significantly reduce the level of ambiguity in a Hebrew text.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML