XML Viewer - w05-0805

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0805_metho.xml
Size: 24,522 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0805">
  <Title>Revealing Phonological Similarities between Related Languages from Automatically Generated Parallel Corpora</Title>
  <Section position="3" start_page="33" end_page="34" type="metho">
    <SectionTitle>
2 Previous Research
</SectionTitle>
    <Paragraph position="0"> Some approaches to revealing sound correspondences require clean data whereas other methods can deal with noisy input. Cahill and Tiberius (2002) use a manually compiled cognate list of Dutch, English and German cognates and extract cross-linguistic phoneme correspondences. The results1 contain the counts of a certain German phoneme and their possible English and Dutch counterparts.</Paragraph>
    <Paragraph position="1"> The method presented in Kondrak (2003), however, can deal with noisy bilingual word lists. He generates sound correspondences of various Algonquian languages. His algorithm considers them as possible candidates if their likelihood scores lie above a certain minimum-strength threshold. The candidates are evaluated against manually compiled sound correspondences. The algorithm is able to judge  whether a bilingual phoneme pair is a possible sound correspondence. Another interesting generative model can be found in Knight and Graehl (1998).</Paragraph>
    <Paragraph position="2"> They train weighted finite-state transducers with the EM algorithm which are applied to automatically transliterating Japanese words - originated from English - back to English. In our approach, we aim at discovering similar correspondences between bilingual data represented in the classes. The classes can be used to assess how likely a bilingual sound correspondence is.</Paragraph>
    <Paragraph position="3"> 3 Generation of two parallel Corpora In this section, we describe the resources used for our clustering algorithm. We take advantage of two on-line bilingual orthographic dictionaries2 and the monolingual pronunciation dictionaries (Baayen et al., 1993) in CELEX to automatically build two bilingual pronunciation dictionaries.</Paragraph>
    <Paragraph position="4"> In a first step, we extract from the German-Dutch orthographic dictionary 72,037 word pairs and from the German-English dictionary 155,317. Figures 1 and 2 (1st table) display a fragment of the extracted orthographic word pairs. Note that we only allow one possible translation, namely the first one.</Paragraph>
    <Paragraph position="5"> In a next step, we automatically look up the pronunciation of the German, Dutch and English words in the monolingual part of CELEX. A word pair is considered for further analysis if the pronunciation of both words is found in CELEX. For instance, the first half of the word pair Hausflur-huisgang (corridor) does occur in the German part of CELEX but the second half is not contained within the Dutch part. Thus, this word pair is discarded. However, the words Haus-huis-house are found in all three mono-lingual pronunciation dictionaries and are used for further analysis. Note that the transcription and syllabification of the words are defined in CELEX.</Paragraph>
    <Paragraph position="6"> The result is a list of 44,415 transcribed German-Dutch word pairs and a list of 63,297 transcribed German-English word pairs. Figures 1 and 2 (2nd table) show the result of the look-up procedure.</Paragraph>
    <Paragraph position="7"> For instance, [&amp;quot;haus]3-[&amp;quot;hUIs] is the transcription of Haus-huis in the German-Dutch dictionary, while  Orthographic lexicon Transcribed lexicon Bilingual pronunciation dictionary Onsets Nuclei Codas</Paragraph>
    <Paragraph position="9"> H&amp;quot;auser huizen [&amp;quot;hOy][z@r] [hUI][z@] [&amp;quot;hOy][z@r] [hUI][z@] h h Oy UI NOP NOP Haus huis [&amp;quot;haus] [&amp;quot;hUIs] [&amp;quot;haus] [&amp;quot;hUIs] z z @ @ r NOP Hausflur huisgang = [&amp;quot;haus][flu:r] huisgang = - - = h h au UI s s Haut huid [&amp;quot;haut] [&amp;quot;hUIt] [&amp;quot;haut] [&amp;quot;hUIt] h h au UI t t Hautarzt huidarts [haut][&amp;quot;a:rtst] [hUId][Arts] [haut][&amp;quot;a:rtst] [hUId][Arts] h h au UI t d</Paragraph>
    <Paragraph position="11"> scribed lexicon - the bilingual dictionary - to the final bilingual onset, nucleus and coda lists ( left to right) Orthographic lexicon Transcribed lexicon Bilingual pronunciation dictionary Onsets Nuclei Codas</Paragraph>
    <Paragraph position="13"> H&amp;quot;auser houses [&amp;quot;hOy][z@r] [&amp;quot;haU][zIz] [&amp;quot;hOy][z@r] [&amp;quot;haU][zIz] h h Oy aU NOP NOP Haus house [&amp;quot;haus] [haUs] [&amp;quot;haus] [haUs] z z @ I r z Hausflur corridor = [&amp;quot;haus][flu:r] [&amp;quot;kO][rI][dO:rstar] = - - = h h au aU s s Haut skin [&amp;quot;haut] [skIn] [&amp;quot;haut] [skIn] h sk au I t n Hautarzt dermatologist [haut][&amp;quot;a:rtst] [d3:][m@][&amp;quot;tO]-</Paragraph>
    <Paragraph position="15"> scribed lexicon - the bilingual dictionary - to the final bilingual onset, nucleus and coda lists ( left to right) [&amp;quot;haus]-[haUs] is the transcription of Haus-house in the German-English part.</Paragraph>
    <Paragraph position="16"> We aim at revealing phonological relationships between German-Dutch and German-English word pairs on the phonemic level, hence, we need something similar to an alignment procedure on the syllable level. Thus, we first extract only those word pairs which contain the same number of syllables.</Paragraph>
    <Paragraph position="17"> The underlying assumption is that words with a historically related stem often preserve their syllable structure. The only exception is that we do not use all inflectional paradigms of verbs to gain more data because they are often a reason for uneven syllable numbers (e.g., the past tense German suffix /tete/ is in Dutch /te/ or /de/). Hautarzt-huidarts would be chosen both made up of two syllables; however, Hautarzt-dermatologist will be dismissed as the German word consists of two syllables whereas the English word comprises five syllables. Figures 1 and 2 (3rd table) show the remaining items after this filtering process. We split each syllable within the bilingual word lists into onset, nucleus and coda.</Paragraph>
    <Paragraph position="18"> All consonants to the left of the vowel are considered the onset. The consonants to the right of the vowel represent the coda. Empty onsets and codas are replaced by the word [NOP]. After this processing step, each word pair consists of the same number of onsets, nuclei and codas.</Paragraph>
    <Paragraph position="19"> The final step is to extract a list of German-Dutch and German-English phoneme pairs. It is easy to extract the bilingual onset, nucleus and coda pairs from the transcribed word pairs (fourth table of Figures 1 and 2). For instance, we extract the onset pair [h][h], the nucleus pair [au]-[UI] and the coda pair [s]-[s] from the German-Dutch word pair [&amp;quot;haus]-[&amp;quot;hUIs]. With the described method, we obtain from the remaining 21,212 German-Dutch and 13,067 German-English words, 59,819 German-Dutch and 35,847 German-English onset, nucleus and coda pairs.</Paragraph>
  </Section>
  <Section position="4" start_page="34" end_page="35" type="metho">
    <SectionTitle>
4 Phonological Clustering
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the unsupervised clustering method used for clustering of phonological units. Three- and five-dimensional EM-based clustering has been applied to monolingual phonological data (M&amp;quot;uller et al., 2000) and two-dimensional clustering to syntax (Rooth et al., 1999). In our approach, we apply two-dimensional clustering to reveal classes of bilingual sound correspondences.</Paragraph>
    <Paragraph position="1"> The method is well-known but the application of probabilistic clustering to bilingual phonological data allows a new view on bilingual phonological  processes. We choose EM-based clustering as we need a technique which provides probabilities to deal with noise in the training data. The two main parts of EM-based clustering are (i) the induction of a smooth probability model over the data, and (ii) the automatic discovery of class structure in the data. We aim to derive a probability distribution p(y) on bilingual phonological units y from a large sample (p(c) denotes the class probability, p(ysource|c) is the probability of a phoneme of the source language given class c, and p(ytarget|c) is the probability of a phoneme of the target language given class c).</Paragraph>
    <Paragraph position="3"> The re-estimation formulas are given in (Rooth et al., 1999) and our training regime dealing with the free parameters (e.g. the number of |c |of classes) is described in Sections 4.1 and 4.2. The output of our clustering algorithm are classes with their class number, class probability and a list of class members with their probabilities.</Paragraph>
    <Paragraph position="4"> class 2 0.069</Paragraph>
    <Paragraph position="6"> The above table comes from our German-Dutch experiments and shows Class # 2 with its probability of 6.9%, the German onsets in the left column (e.g., [t] appears in this class with the probability of 63.3%, [ts] with 14.4% and [s] with 5.5%) and the Dutch onsets in the right column ([t] appears in this class with the probability of 76.4% and [d] with 12.8%).</Paragraph>
    <Paragraph position="7"> The examples presented in this paper are fragments of the full classes showing only those units with the highest probabilities.</Paragraph>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
4.1 Experiments with German-Dutch data
</SectionTitle>
      <Paragraph position="0"> We use the 59,819 onset, nucleus and coda pairs as training material for our unsupervised training.</Paragraph>
      <Paragraph position="1"> Unsupervised methods require the variation of all free parameters to search for the optimal model.</Paragraph>
      <Paragraph position="2"> There are three different parameters which have to be varied: the initial start parameters, the number of classes and the number of re-estimation steps.</Paragraph>
      <Paragraph position="3"> Thus, we experiment with 10 different start parameters, 6 different numbers of classes (5, 10, 15, 20, 25 and 304) and 20 steps of re-estimation. Our training regime yields 1,200 onset, 1,200 coda and 1,000 nucleus models.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
4.2 Experiments with German-English data
</SectionTitle>
      <Paragraph position="0"> Our training material is slightly smaller for German-English than for German-Dutch. We derive 35,847 onset, nucleus and coda pairs for training. The reduced training set is due to the structure of words which is less similar for German-English words than for German-Dutch words leading to words with unequal syllable numbers. We used the same training regime as in Section 4.1, yielding the same number of models.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="35" end_page="38" type="metho">
    <SectionTitle>
5 Similarity scores of the syllable parts
</SectionTitle>
    <Paragraph position="0"> We apply our models to a translation task. The main idea is to take a German phoneme and to predict the most probable Dutch and English counterpart.</Paragraph>
    <Paragraph position="1"> Hence, we extract 808 German-Dutch and 738 German-English cognate pairs from a cognate database5, consisting of 836 entries. As for the training data, we extract those pairs that consist of the same number of syllables because our current models are restricted to sound correspondences and do not allow the deletion of syllables. We split our corpus into two parts by putting the words with an even line number in the development database and the words with an uneven line number in the gold standard database. The development set and the gold standard corpus consist of 404 transcribed words for the German to Dutch translation task and of 369 transcribed words for the German to English translation task.</Paragraph>
    <Paragraph position="2"> The task is then to predict the translation of German onsets to Dutch onsets taken from German-Dutch cognate pairs, e.g. the models should predict from the German word durch ([dUrx]) (through), the Dutch word door ([do:r]). If the phoneme correspondence, [d]:[d], is predicted, the similarity score of the onset model increases. The nucleus score increases if the nucleus model predicts [U]:[o:] and the coda score increases if the coda model predicts [rx]:[r].</Paragraph>
    <Paragraph position="3"> We assess all our onset, nucleus and coda models  nates indicating that German is closer related to Dutch than to English.</Paragraph>
    <Paragraph position="4"> by measuring the most probable phoneme translations of the cognates from our development set. We choose the models with the highest onset, nucleus and coda scores. Only the models with the highest scores (for onset, nucleus and coda prediction) are applied to the gold standard to avoid tuning to the development set. Using this procedure shows how our models perform on new data. We apply our scoring procedure to both language pairs.</Paragraph>
    <Paragraph position="5"> Table 1 shows the results of our best models by measuring the onset, nucleus and coda translation scores on our gold standard. The results point out that the prediction of the onset is easier than predicting the nucleus or the coda. We achieve an onset similarity score of 80.7% for the German to Dutch task and 69.6% for the German to English task. Although the set of possible nuclei is smaller than the set of onsets and codas, the prediction of the nuclei is much harder. The nucleus similarity score decreases to 50.7% and to 17.1% for German-English respectively. Codas seem to be slightly easier to predict than nuclei leading to a coda similarity score of 52.2% for German-Dutch and to 28.7% for GermanEnglish. null The comparison of the similarity scores from the translation tasks of the two language pairs indicates that predicting the phonological correspondences from German to Dutch is much easier than from German to English. These results supply statistical evidence that German is historically more closely related to Dutch than to English. We do not believe that the difference in the similarity scores are due to the different size of the training corpora but rather to their closer relatedness. Revealing phonological relationships between languages is possible simply because the noisy training data comprise enough related words to learn from them the similar structure of the languages on the syllable-part level.</Paragraph>
    <Paragraph position="6">  In this section, we interpret our classes by manually identifying classes that show typical similarities between the two language pairs. Sometimes, the classes reflect sound changes in historically related stems. Our data is synchronic, and thus it is not possible to directly identify in our classes which sound changes took place (Modern German (G), Modern English (E) and Modern Dutch (NL) did not develop from each other but from a common ancestor). However, we will try to connect the data to ancient languages such as Old High German (OHG), Middle High German (MHG), Old English (OE), Middle Dutch (MNL), Old Dutch (ONL), Proto or West Germanic (PG, WG). Naturally, we can only go back in history as far as it is possible according to the information provided by the following literature: For Dutch, we use de Vries (1997) and the on-line version of Philippa et al. (2004), for English, an etymological dictionary (Harper, 2001) and for German, Burch et al. (1998). We find that certain historic sound changes took place regularly, and thus, the results of these changes can be rediscovered in our synchronic classes. Figure 3 shows the historic relationship between the three languages. A potential learner of a related language does not have to be aware of the historic links between languages but he/she can implicitly exploit the similarities such as the ones discovered in the classes.</Paragraph>
    <Paragraph position="7"> The relationship of words from different languages can be caused by different processes: some words are simply borrowed from another language and adapted to a new language. Papagei-papegaai  (parrot) is borrowed from Arabic and adapted to German and Dutch phonetics, where the /g/ is pronounced in German as a voiced velar plosive and in Dutch as an unvoiced velar fricative.</Paragraph>
    <Paragraph position="8"> Other language changes are due to phonology; e.g., the Old English word [mus] (PG: muHs) was subject to diphthongization and changed to mouse ([maUs]) in Modern English. A similar process took place in German and Dutch, where the same word changed to the German word Maus (MHG: m^us) and to the Dutch word muis (MNL: muus).</Paragraph>
    <Paragraph position="9"> On the synchronic level, we find [au] and [aU] in the same class of a German-English model and [au] and [UI] in a German-Dutch model. There are also other phonological processes which apply to the nuclei, such as monophthongization, raising, lowering, backing and fronting. Other phonological processes can be observed in conjunction with consonants, such as assimilation, dissimilation, deletion and insertion. Some of the above mentioned phonological processes are the underlying processes of the subsequent described classes.</Paragraph>
    <Section position="1" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
6.1 German-Dutch classes
</SectionTitle>
      <Paragraph position="0"> According to our similarity scores presented in Section 5, the best onset model comprises 30 classes, the nucleus model 25 classes and the coda model 30 classes. We manually search for classes, which show interesting sound correspondences.</Paragraph>
      <Paragraph position="1">  The German part of class # 20 reflects Grimm's first law which states that a West Germanic [p] is often realized as a [pf] in German. The underlying phonological process is that sounds are inserted in a certain context. The onsets of the Middle High German words phat (E: path) and phert (E: horse, L: paraver-eredus) became the affricate [pf] in Modern German. In contrast to German, Dutch preserved the simple onsets from the original word form, as in paard (E: horse, MNL: peert) and pad (E: path, MNL: pat).</Paragraph>
      <Paragraph position="2">  are more complex than the onsets in German. From the Old High German word sc^af (E: sheep) the onset /sc/ is assimilated in Modern German to [S] whereas the Dutch onset [sx] preserves the complex consonant cluster from the West Germanic word skaepan (E: sheep, MNL: scaep).</Paragraph>
      <Paragraph position="3">  man short high back vowel /U/ can be often transformed to the Dutch low back vowel /O/. The underlying processes are that the Dutch vowel is sometimes lowered from /i/ to /O/; e.g., the Dutch word gezond (E: healthy, MNL: ghesont, WG: gezwind) comes from the West Germanic word gezwind. In Modern German, the same word changed to gesund (OHG: gisunt).</Paragraph>
      <Paragraph position="4">  tive suffixes /en/, as in Menschen-mensen (E: humans) or laufen-lopen (E: to run), are reduced to a Schwa [@] in Dutch and thus appear in this class with an empty coda [NOP]. It also shows that certain German codas are assimilated by the alveolar sounds /d/ and /s/ from the original bilabial [m] to an apico-alveolar [n], as in Boden (E: ground, MHG: bodem) or in Besen (E: broom, MHG: b&amp;quot;esem, OHG: p&amp;quot;esamo). In Dutch, the words bodem (E: ground, MNL: b-odem, Greek: puthm-en), and bezem (E: broom, MNL: b-esem, WG: besman) kept the /m/.</Paragraph>
      <Paragraph position="5">  Class # 23 comprises complex German codas which are less complex in Dutch. In the German word Arzt (E: doctor, MHG: arz^at), the complex coda [tst] emerges. However in Modern Dutch, arts came from MNL arst or arsate (Latin: archi-ater). We can also find the rule that German codas [Nst] of a 2nd person singular form of a verb are reduced to [Nt] in Dutch as in bringst-brengt (E: bring).</Paragraph>
    </Section>
    <Section position="2" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
6.2 German-English classes
</SectionTitle>
      <Paragraph position="0"> The best German-English models contain 30 onset classes, 20 nucleus classes, and 10 coda classes.</Paragraph>
      <Paragraph position="1"> Our German-English models are noisier than the German-Dutch ones, which again points at the closer relation between the German and Dutch lexicon. However, when we analyze the 30 onset classes, we find meaningful processes as for German-Dutch.</Paragraph>
      <Paragraph position="2">  preserves the consonant cluster, as in sprechen (E: to speak, OHG: sprehhan, PG: sprekanan). Modern English, however, deleted the /r/ to [sp], as in speak (OE: sprecan). Another regularity can be found: the palato-alveolar [S] in the German onset [Sp] is realized in English as the alveolar [s] in [sp]. Both the German word spinnen and the English word spin come from spinnan (OHG, OE).</Paragraph>
      <Paragraph position="3">  the onset /c/ is realized in German as [ts] and in English as [s] in Akzent-accent (Latin: accentus).  In some loan words, we find that an original /u/ or /o/ becomes in German the long vowel [o:] and in English the diphthong [@U], as in Sofa-sofa (Arabic: suffah) or in Foto-photo (Latin: Phosphorus). The diphthongization in English usually applies to open syllables with the nucleus /o/, as shown in class # 8.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="38" end_page="39" type="metho">
    <SectionTitle>
7 Discussion
</SectionTitle>
    <Paragraph position="0"> We automatically generated two bilingual phonological corpora. The data is classified by using an EM-based clustering algorithm which is new in that respect that this method is applied to bilingual onset, nucleus and coda corpora. The method provides a probability model over bilingual syllable parts which is exploited to measure the similarity between the language pairs German-Dutch and German-English. The method is able to generalize from the data and reduces the noise introduced by the automatic generation process. Highly probable sound correspondences appear in very likely classes with a high probability whereas unlikely sound correspondences receive lower probabilities.</Paragraph>
    <Paragraph position="1"> Our approach differs from other approaches either in the method used or in the different linguistic task.</Paragraph>
    <Paragraph position="2"> Cahill and Tiberius (2002) is based on mere counts of phoneme correspondences; Kondrak (2003) generates Algonquian phoneme correspondences which are possible according to his translation models; Kondrak (2004) measures if two words are possible cognates; and Knight and Graehl (1998) focus on the back-transliteration of Japanese words to English. Thus, we regard our approach as a thematic complement and not as an overlap to former approaches. null The presented approach depends on the available resources. That means that we can only learn those phoneme correspondences which are represented in the bilingual data. Thus, metathesis which applies to onsets and codas can not be directly observed as the syllable parts are modeled separately. In the Dutch word borst (ONL: bructe), the /r/ shifted from the onset to the coda whereas in English and German (breast-Brust), it remained in the onset. We are also  dependent on the CELEX builders, who followed different transcription strategies for the German and Dutch parts. For instance, elisions occur in the Dutch lexicon but not in the German part. The coda consonant /t/ in lucht (air) disappears in the Dutch word luchtdruk (E: air pressure), [&amp;quot;lUG][drUk], but not in the German word Luftdruck, [lUft][drUk].</Paragraph>
    <Paragraph position="3"> We assume that the similarity scores of the syllable parts might be sharpened by increasing the size of the databases. A first possibility is to take the first transcribed translation and not the first translation in general. As often the first translation is not contained in the pronunciation dictionary.</Paragraph>
    <Paragraph position="4"> Our current data generation process also introduces unrelated word pairs such as Haut-skin ([haut]-[skIn]). However, it is very unlikely that related words do not include similar phonemes. Thus, this word pair should be excluded. Exploiting this knowledge could lead to cleaner input data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML