XML Viewer - h93-1039

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/h93-1039_abstr.xml
Size: 8,105 bytes
Last Modified: 2025-10-06 13:47:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="H93-1039">
  <Title>But Dictionaries Are Data Too</Title>
  <Section position="1" start_page="0" end_page="203" type="abstr">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> Although empiricist approaches to machine translation depend vitally on data in the form of large bilingual corpora, bilingual dictionaries are also a source of information. We show how to model at least a part of the information contained in a bilingual dictionary so that we can treat a bilingual dictionary and a bilingual corpus as two facets of a unified collection of data from which to extract values for the parameters of a probabilistic machine translation system. We give an algorithm for obtaining maximum iikefihood estimates of the parameters of a probabilistic model from this combined data and we show how these parameters are affected by inclusion of the dictionary for some sample words.</Paragraph>
    <Paragraph position="1"> There is a sharp dichotomy today between rationalist and empiricist approaches to machine translation: rationalist systems are based on information cajoled fact by reluctant fact from the minds of human experts; empiricist systems are based on information gathered wholesale from data. The data most readily digested by our translation system is from bilingual corpora, but bilingual dictiona.ries are data too, and in this paper we show how to weave information from them into the fabric of our statistical model of the translation process.</Paragraph>
    <Paragraph position="2"> When a lexicographer creates an entry in a bilingum dictionary, he describes in one language the meaning and use of a word from another language.</Paragraph>
    <Paragraph position="3"> Often, he includes a list. of simple translations. For example, tile entry for disingenuousness in the HarperCollins Robert French Dictionary \[1\] lists the translations d(loyautd, manque de sincdritd, and fourberie. In constructing such a list., the lexicographer gathers, either through introspection or extrospection, in-202 stances in which disingenuousness has been used in various ways and records those of the different translations that he deems of sufficient importance. Although a dictionary is more than just a collection of lists, we will concentrate here on that portion of it that is made up of lists.</Paragraph>
    <Paragraph position="4"> We formalize an intuitive account of lexicographic behavior as follows. We imagine that a lexicographer, when constructing an entry for the English word or phrase e, first chooses a random size s, and then selects at random a sample of s instances of the use of e, each with its French translation. We imagine, further, that he includes in his entry for e a list consisting of all of tile translations that occur at least once in his random sample. The probability that he will, in this way, obtain tile list fi, ..., f,,, is</Paragraph>
    <Paragraph position="6"> where Pr(f, le) is the probability from our statistical model that the phrase f, occurs as a translation of e, and Pr(sle) is the probability that the lexicographer chooses to sample s instances of e. The multinomial coefficient is defined by s ) s! sl...sk -s1!...sk!' (2) and satisfies the recursion  $ where (,t) is the usual binomial coefficient. In genera.I, the sum in Equation (1) cannot be evaluated in closed form, but we can organize an efficient calculation of it as follows. Let a H p(f~le)&amp;quot;. (4) E-. E</Paragraph>
    <Paragraph position="8"> Using Equation (3), it is easy to show that</Paragraph>
    <Paragraph position="10"> and therefore, we can compute P(fl,'&amp;quot;,fmle) in time proportional to s2m. By judicious use of thresholds, even this can be substantiMly reduced.</Paragraph>
    <Paragraph position="11"> In the special case that Pr(s\[e) is a Poisson distri-</Paragraph>
    <Paragraph position="13"> This is the form that we will assume throughout the remainder of the paper because of its simplicity. Notice that in this case, the probability of an entry is a product of factors, one for each of the translations that it contains.</Paragraph>
    <Paragraph position="14"> The series fi, .-., fm represents the translations of e that are included in the dictionary. We call this set of translations De. Because we ignore everything about the dictionary except for these lists, a complete dictionary is just. a collection of De's, one for each of the English phrases that has an entry. We treat each of these entries as independent and write the probability of the entire dictionary as</Paragraph>
    <Paragraph position="16"> the product here running over all entries.</Paragraph>
    <Paragraph position="17"> Equation (9) gives the probability of the dictionary in terms of the probabilities of the entries that  make it up. The probabilities of these entries in turn are given by Equation (8) in terms of the probabilities, p(fle), of individual French phrases given individual English phrases. Combining these two equations, we can write</Paragraph>
    <Paragraph position="19"> We take p(fle) to be given by the statistical model described in detail by Brown et al. \[2\]. Their model has a set of translation probabilities, t(fle), giving for each French word f and each English word e the probability that f will appear as (part of) a translation of e; a set of fertility probabilities, n(C/le), giving for each integer C/ and each English word e the probability that e will be translated as a phrase containing C/ French words; and a set of distortion probabilities governing the placement of French words in the translation of an English phrase. They show how to estimate these parameters so as to maximize the probability,</Paragraph>
    <Paragraph position="21"> of a collection of pairs of aligned translations, (e, f) E //.</Paragraph>
    <Paragraph position="22"> Let O represent the complete set of parameters of the model of Brown et al. \[2\], and let 0 represent any one of the parameters. We extend the method of Brown et al. to develop a scheme for estimating O so as to maximize the joint probability of the corpus and the dictionary, Pro(//,D). We assume that Pro(//, D) = Pro(//)Pro(D). In general, it is possible only to find local maxima of Pro(//,D) as a function of O, which we can do-by applying the EM algorithm \[3, 4\]. The EM algorithm adjusts an initial estimate of O in a series of iterations. Each iteration consists of an estimation step in which a count is determined for each parameter, followed by a maximization step in which each parameter is replaced by a value proportional to its count. The count ce for a parameter 0 is defined by</Paragraph>
    <Paragraph position="24"> Because we assume that II and D are independent, we can write ce as the sum of a count for H and a count for D:</Paragraph>
    <Paragraph position="26"> The corpus count is a sum of counts, one for each translation in the corpus. The dictionary count is also a sum of counts, but with each count weighted by a factor #(e,f) which we call the effective multiplicity of the translation. Thus,</Paragraph>
    <Paragraph position="28"> The effective multiplicity is just the expected number of times that our lexicographer observed the translation (e,f) given the dictionary and the corpus. In terms of the a priori multiplicity, p0(e,f) =</Paragraph>
    <Paragraph position="30"> Figure I shows the effective multiplicity as a function of the a priori multiplicity. For small values of lL0(e,f), /s(e,f) is approximately equal to 1 + #o(e, f)/2. For very large values, #0(e, f) and p(e, f) are approximately equal. Thus, if we expect a priori .that the lexicographer will see the translation (e,f) very many times, lhen the effective multiplicity will be nearly equal to this number, but even if we expect a priori that he will scarcely ever see a translation, the effective multiplicity for it cannot fall below 1. This is reasonable because in our model for the dictionary construction process, we assume that nothing can get into the dictionary unless it is seen at least once by the lexicographer.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML