File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1016_evalu.xml
Size: 8,298 bytes
Last Modified: 2025-10-06 13:58:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1016"> <Title>Determining Recurrent Sound Correspondences by Inducing Translation Models</Title> <Section position="6" start_page="2" end_page="2" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.1 The data for experiments </SectionTitle> <Paragraph position="0"> The experiments in this section were performed using a well-known list of 200 basic meanings that are considered universal and relatively resistant to lexical replacement (Swadesh, 1952). The Swadesh 200-word lists are widely used in linguistics and have been compiled for a large number of languages. null The development set consisted of three 200-word list pairs adapted from the Comparative Indoeuropean Data Corpus (Dyen et al., 1992). The corpus contains the 200-word lists for a number of Indoeuropean languages together with cognation judgments made by a renowned historical linguist Isidore Dyen. Unfortunately, the words are represented in the Roman alphabet without any diacritical marks, which makes them unsuitable for automatic phonetic analysis. The Polish-Russian, Spanish-Romanian, and Italian-Serbocroatian were selected because they represent three different levels of relatedness (73.5%, 58.5%, and 25.3% of cognate pairs, respectively), and also because they have relatively transparent grapheme-to-phoneme conversion rules. They were transcribed into a phonetic notation by means of Perl scripts and then stemmed and corrected manually.</Paragraph> <Paragraph position="1"> The test set consisted of five 200-word lists representing English, German, French, Latin, and Albanian, compiled by Kessler (2001) As the lists contain rich phonetic and morphological information, the stemmed forms were automatically converted from the XML format with virtually no extra processing. The word pairs classified by Kessler as doubtful cognates were assumed to be unrelated.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.2 Determination of correspondences in word </SectionTitle> <Paragraph position="0"> pairs Experiments show that CORDI has little difficulty in determining correspondences given a set of cognate pairs (Kondrak, 2002) However, the assumption that a set of identified cognates is already available as the input for the program is not very plausible. The very existence of a reliable set of cognate pairs implies that the languages in question have already been thoroughly analyzed and that the sound correspondences are known. A more realistic input requirement is a list of word pairs from two languages such that the corresponding words have the same, well-defined meaning. Determining correspondences in a list of synonyms is clearly a more challenging task than extracting them from a list of reliable cognates because the non-cognate pairs introduce noise into the data. Note that Melamed's original algorithm is designed to operate on aligned sentences that are guaranteed to be mutual translations. null by CORDI in noisy synonym data.</Paragraph> <Paragraph position="1"> In order to test CORDI's ability to determine correspondences in noisy data, Method D was applied to the 200-word lists for English and Latin. Only 29% of word pairs are actually cognate; the remaining 71% of the pairs are unrelated lexemes. The top ten correspondences discovered by the program are shown in Table 2. Remarkably, all but one are valid. In contrast, only four of the top ten phoneme matchings picked up by the kh statistic are valid correspondences (the validity judgements are my own).</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 6.3 Identification of cognates in word pairs </SectionTitle> <Paragraph position="0"> The quality of correspondences produced by CORDI is difficult to validate, quantify, and compare with the results of alternative approaches.</Paragraph> <Paragraph position="1"> However, it is possible to evaluate the correspondences indirectly by using them to identify cognates. The likelihood of cognation of a pair of words increases with the number of correspondences that they contain. Since CORDI explicitly posits correspondence links between words, the likelihood of cognation can be estimated by simply dividing the number of induced links by the length of the words that are being compared. A minimum-length parameter can be set in order to avoid computing cognation estimates for very short words, which tend to be unreliable.</Paragraph> <Paragraph position="2"> The evaluation method for cognate identification algorithms adopted in this section is to apply them to a bilingual wordlist and order the pairs according to their scores (refer to Table 3). The ranking is then evaluated against a gold standard by computing the n-point average precision, a generalization of the 11-point average precision, where n is the total number of cognate pairs in the list. The n-point average precision is obtained by taking the average of n precision values that are calculated for each point in the list where we find a cognate pair:</Paragraph> <Paragraph position="4"> BNi BP 1BNBMBMBMBNn,wherei is the number of the cognate pair counting from the top of the list produced by the algorithm, and r i is the rank of this cognate pair among all word pairs. The n-point precision of the ranking in Table 3 is B41BM0B70BM66B5BP2 BP 0BM83. The expected n-point precision of a program that randomly orders word pairs is close to the proportion of cognate pairs in the list.</Paragraph> <Paragraph position="5"> by methods A, B, C, and D on the development set. The cognation judgments from the Comparative Indoeuropean Data Corpus served as the gold standard. null All four methods proposed in this paper as well as other cognate identification programs were uniformly applied to the test set representing five Indoeuropean languages. Apart from the English-German and the French-Latin pairs, all remaining language pairs are quite challenging for a cognate identification program. In many cases, the gold-standard cognate judgments distill the findings of decades of linguistic research. In fact, for some of those pairs, Kessler finds it difficult to show by statistical techniques that the surface regularities are unlikely to be due to chance. Nevertheless, in order to avoid making subjective choices, CORDI was evaluated on all possible language pairs in Kessler's set.</Paragraph> <Paragraph position="6"> Two programs mentioned in Section 2, COGNATE and JAKARTA, were also applied to the test set. The source code of JAKARTA was obtained directly from the author and slightly modified according to his instructions in order to make it recognize additional phonemes. Word pairs were ordered according to the confidence scores in the case of COG-NATE, and according to the edit distances in the case of JAKARTA. Since the other two programs do not impose any length constraints on words, the minimum-length parameter was not used in the experiments described here.</Paragraph> <Paragraph position="7"> The results on the test set are shown in Table 5. The best result for each language pair is underlined. The performance of COGNATE and JAKARTA is quite similar, even though they represent two radically different approaches to cognate identification. On average, methods B, C, and D outperform both comparison programs. On closely related languages, Method B, with its relatively unconstrained linking, achieves the highest precision. Method D, which considers only consonants, is the best on fairly remote languages, where vowel correspondences tend to be weak. The only exception is the extremely difficult Albanian-English pair, where the relative ordering of methods seems to be accidental. As expected, Method A is out-performed by methods that employ an explicit noise model. However, in spite of its extra complexity, Method C is not consistently better than Method B, perhaps because of its inability to detect important vowel-consonant correspondences, such as the ones between French nasal vowels and Latin /n/.</Paragraph> </Section> </Section> class="xml-element"></Paper>