File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1016_metho.xml
Size: 9,184 bytes
Last Modified: 2025-10-06 14:07:46
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1016"> <Title>Determining Recurrent Sound Correspondences by Inducing Translation Models</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Models of translational equivalence </SectionTitle> <Paragraph position="0"> In statistical machine translation, a translation model approximates the probability that two sentences are mutual translations by computing the product of the probabilities that each word in the target sentence is a translation of some source language word. A model of translation equivalence that determines the word translation probabilities can be induced from bitexts. The difficulty lies in the fact that the mapping, or alignment, of words between two parts of a bitext is not known in advance.</Paragraph> <Paragraph position="1"> Algorithms for word alignment in bitexts aim at discovering word pairs that are mutual translations.</Paragraph> <Paragraph position="2"> A straightforward approach is to estimate the likelihood that words are mutual translations by computing a similarity function based on a co-occurrence statistic, such as mutual information, Dice coefficient, or the kh test. The underlying assumption is that the association scores for different word pairs are independent of each other.</Paragraph> <Paragraph position="3"> Melamed (2000) shows that the assumption of independence leads to invalid word associations, and proposes an algorithm for inducing models of translational equivalence that outperform the models that are based solely on co-occurrence counts. His models employ the one-to-one assumption, which formalizes the observation that most words in bitexts are translated to a single word in the corresponding sentence. The algorithm, which is related to the expectation-maximization (EM) algorithm, iteratively re-estimates the likelihood scores which represent the probability that two word types are mutual translations. In the first step, the scores are initialized according to the G statistic (Dunning, 1993). Next, the likelihood scores are used to induce a set of one-to-one links between word tokens in the bitext. The links are determined by a greedy competitive linking algorithm, which proceeds to link pairs that have the highest likelihood scores.</Paragraph> <Paragraph position="4"> After the linking is completed, the link counts are used to re-estimate the likelihood scores, which in turn are applied to find a new set of links. The process is repeated until the translation model converges to the desired degree.</Paragraph> <Paragraph position="5"> Melamed presents three translation-model estimation methods. Method A re-estimates the likelihood scores as the logarithm of the probability of jointly generating the pair of words u and v: where linksB4uBNvB5 denotes the number of links induced between u and v. Note that the co-occurrence counts of u and v are not used for the re-estimation, In Method B, an explicit noise model with auxiliary parameters l and l is constructed in order to improve the estimation of likelihood scores. l is a probability that a link is induced between two co-occurring words that are mutual translations, while</Paragraph> <Paragraph position="7"> is a probability that a link is induced between two co-occurring words that are not mutual translations. Ideally, l should be close to one and l should be close to zero. The actual values of the two parameters are calculated by the maximum likelihood estimation. Let coocB4uBNvB5 be the number of co-occurrences of u and v.Thescore function is defined as: where BB4k CYnBN pB5 denotes the probability of k being generated from a binomial distribution with parameters n and p.</Paragraph> <Paragraph position="8"> In Method C, bitext tokens are divided into classes, such as content words, function words, punctuation, etc., with the aim of producing more accurate translation models. The auxiliary parameters are estimated separately for each class. Thanks to its generality and symmetry, Melamed's parameter estimation process can be adapted to the problem of determining correspondences. The main idea is to induce a model of sound correspondence in a bilingual wordlist, in the same way as one induces a model of translational equivalence among words in a parallel corpus. After the model has converged, phoneme pairs with the highest likelihood scores represent the most likely correspondences. While there are strong similarities between the task of estimating translational equivalence of words and the task of determining recurrent correspondences of sounds, a number of important modifications to Melamed's original algorithm are necessary in order to make it applicable to the latter task. The modifications include the method of finding a good alignment, the handling of null links, and the method of computing the alignment score.</Paragraph> <Paragraph position="9"> For the task at hand, I employ a different method of aligning the segments in two corresponding sequences. In sentence translation, the alignment links frequently cross and it is not unusual for two words in different parts of sentences to correspond. In contrast, the processes that lead to link intersection in diachronic phonology, such as metathesis, are quite sporadic. The introduction of the no-crossing-links constraint on alignments not only leads to a dramatic reduction of the search space, but also makes it possible to replace the approximate competitive-linking algorithm of Melamed with a variant of the well-known dynamic programming algorithm (Wagner and Fischer, 1974; Kondrak, 2000), which computes the optimal alignment between two strings in polynomial time.</Paragraph> <Paragraph position="10"> Null links in statistical machine translation are induced for words on one side of the bitext that have no clear counterparts on the other side of the bitext. Melamed's algorithm explicitly calculates the likelihood scores of null links for every word type occurring in a bitext. In diachronic phonology, phonological processes that lead to insertion or deletion of segments usually operate on individual words rather than on particular sounds across the language. Therefore, I model insertion and deletion by employing a constant indel penalty for unlinked segments.</Paragraph> <Paragraph position="11"> The alignment score between two words is computed by summing the number of induced links, and applying an indel penalty for each unlinked segment, with the exception of the segments beyond the rightmost link. The exception reflects the relative instability of word endings in the course of linguistic evolution. In order to avoid inducing links that are unlikely to represent recurrent sound correspondences, only pairs whose likelihood scores exceed a set threshold are linked. All correspondences above the threshold are considered to be equally valid. In the cases where more than one best alignment is found, each link is assigned a weight that is its average over the entire set of best alignments (for example, a link present in only one of two competing alignments receives the weight of 0BM5).</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Implementation </SectionTitle> <Paragraph position="0"> The method described above has been implemented as a C++ program, named CORDI, which will soon be made publicly available. The program takes as input a bilingual wordlist and produces an ordered list of correspondences. A model for a 200-pair list usually converges after 3-5 iterations, which takes only a few seconds on a Sparc workstation. The user can choose between methods A, B, and C, described in Section 3, and an additional Method D. In Method C, phonemes are divided into two classes: non-syllabic (consonants and glides), and syllabic (vowels); links between phonemes belonging to different classes are not induced. Method D differs from Method C in that the syllabic phonemes do not participate in any links.</Paragraph> <Paragraph position="1"> Adjustable parameters include the indel penalty ratio d and the minimum-strength correspondence threshold t. The parameter d fixes the ratio between the negative indel weight and the positive weight assigned to every induced link. (A lower ratio causes the program to be more adventurous in positing sparse links.) The parameter t controls the tradeoff between reliability and the number of links. In Method A, the value of t is the minimum number of phoneme links that have to be induced for the correspondence to be valid. In methods B, C, and D, the value of t implies a likelihood score</Paragraph> <Paragraph position="3"> , which is a score achieved by a pair of phonemes that have t links out of t cooccurrences. In the experiments reported in Section 6, d was set to 0BM15, and t was set to 1 (sufficient to reject all non-recurring correspondences). In Method D, where the lack of vowel links causes the linking constraints to be weaker, a higher value of t BP 3 was used. These parameter values were optimized on the development set described below.</Paragraph> </Section> class="xml-element"></Paper>