XML Viewer - w99-0906

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0906_metho.xml
Size: 18,421 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0906">
  <Title>A Computational Approach to Deciphering Unknown Scripts</Title>
  <Section position="3" start_page="0" end_page="37" type="metho">
    <SectionTitle>
2 Writing Systems
</SectionTitle>
    <Paragraph position="0"> To decipher unknown scripts, is useful to understand the nature of known scripts, both ancient and modern. Scholars often classify scripts into three categories: (1) alphabetic, (2) syllabic,  disk is six inches wide, double-sided, and is the earliest known document printed with a form of movable type.</Paragraph>
    <Paragraph position="1"> .~ and (3) log6graphic (Sampson, 1985).</Paragraph>
    <Paragraph position="2"> Alphabetic systems attempt to represent single sounds with single characters, though no system is &amp;quot;perfect.&amp;quot; For example, Semitic alphabets have no characters for vowel sounds. And even highly regular writing systems like Spanish have plenty of spelling variation, as we shall see later.</Paragraph>
    <Paragraph position="3"> Syllabic systems have characters for entire syllables, such as &amp;quot;ba&amp;quot; and &amp;quot;shu.&amp;quot; Both Linear B and Mayan are primarily syllabic, as is Japanese kana. The Phaistos Disk from Crete (see Figure 1) is thought to be syllabic, because of the number of distinct characters present.</Paragraph>
    <Paragraph position="4"> Finally, logographic systems have characters for entire words. Chinese is often cited as a typical example.</Paragraph>
    <Paragraph position="5"> Unfortunately, actual scripts do not fall neatly into one category or another (DeFrancis, 1989; Sproat, forthcoming). Written Japanese will contain syllabic kana, alphabetic roomaji, and logographic kanji characters all in the same document. Chinese characters actually have a phonetic component, and words are often composed of more than one character. Irregular English writing is neither purely alphabetic nor purely logographic; it is sometimes called morphophonemic. Ancient writing is also mixed, and archaeologists frequently observe radical writing changes in a single language over time.</Paragraph>
  </Section>
  <Section position="4" start_page="37" end_page="37" type="metho">
    <SectionTitle>
3 Experimental Framework
</SectionTitle>
    <Paragraph position="0"> In this paper, we do not decipher any ancient scripts. Rather, we develop algorithms and apply them to the &amp;quot;decipherment&amp;quot; of known, modern scripts. We pretend to be ignorant of the connection between sound and writing. Once our algorithms have come up with a proposed phonetic decipherment of a given document, we route the sound sequence to a speech synthesizer. If a native speaker can understand the speech and make sense of it, then we consider the decipherment a success. (Note that the native speaker need not even be literate, theoretically). We experiment with modern writing systems that span the categories described above.</Paragraph>
    <Paragraph position="1"> We are interested in the following questions:  * Can automatic techniques decipher an unknown script? If so, how accurately? * What quantity of written text is needed for successful decipherment? (this may be * quite limited by circumstances) * What knowledge of the spoken language is needed? Can it to be extracted automatically from available resources? What quantity of resources? * Are some writing systems easier to decipher than others? Are there systematic differences among alphabetic, syllabic, and logographic systems? * Are word separators necessary or helpful? * Can automatic techniques be robust against language evolution (e.g., modern versus ancient forms of a language)? * Can automatic techniques identify the language behind a script as a precursor to deciphering it?</Paragraph>
  </Section>
  <Section position="5" start_page="37" end_page="41" type="metho">
    <SectionTitle>
4 Alphabetic Writing (Spanish)
</SectionTitle>
    <Paragraph position="0"> Five hundred years ago, Spaniards invaded Mayan lands, burning documents and effectively eliminating everyone who could read and write. (Modern Spaniards will be quick to point out that most of the work along those lines  Sounds: B, D, G, J (ny as in canyon), L (y as in yarn), T (th as in thin), a, b, d, e, f, g, i, k, l, m, n, o, p, r, rr (trilled), s, t, tS (ch as in chin), u, x (h as in hat) Characters: fi, PS, 6, i, o, u, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, Z  rough English equivalents in parentheses) and characters.</Paragraph>
    <Paragraph position="1"> had already been carried out by the Aztecs). Mayan hieroglyphs remained uninterpreted for many centuries. We imagine that if the Mayans have invaded Spain, then 20th century Mayan scholars might be deciphering ancient Spanish documents instead.</Paragraph>
    <Paragraph position="2"> We begin with an analysis of Spanish writing. The task of decipherment will be to re-invent these rules and apply them to written documents in reverse. First, is necessary to settle on the basic inventory of sounds and characters. Characters are easy; we simply tabulate the distinct ones observed in text. For sounds, we need something that will serve as reasonable input to a speech synthesizer. We use a Spanish-relevant subset of the International Phonetic Alphabet (IPA), which seeks to capture all sounds in all languages. Actually, we use an ASCII version of the IPA called SAMPA (Speech Assessment Methods Phonetic Alphabet), originally developed under ESPRIT project 1541. There is also a public-domain Castillian speech synthesizer (called Mbrola) for the Spanish SAMPA sound set. Figure 2 shows the sound and character inventories.</Paragraph>
    <Paragraph position="3"> Now to spelling rules. Spanish writing is clearly not a one-for-one proposition:</Paragraph>
    <Paragraph position="5"> rules. The left-hand side of each rule contains a Spanish sound (and possible conditions), while theright-hand side contains zero or more Spanish characters.</Paragraph>
    <Paragraph position="6"> * silence can produce a character (h) Moreover, there are ambiguities. The sound L (English y-sound) may be written as either ll or y. The sound i may also produce the character y, so the pronunciation of y varies according to context. The sound rr (trilled r) is written rr in the middle of a word and r at the beginning of a word.</Paragraph>
    <Paragraph position="7"> Figure 3 shows a sample set of Spanish  spelling rules. We formalized these rules computationally in a finite-state transducer (Pereira and Riley, 1997). The transducer is bidirectional. Given a specific sound sequence, we can extract all possible character sequences, and vice versa. It turns out that while there are many ways to write a given Spanish sound sequence with these rules, it is fairly clear how to pronounce a written sequence* In our decipherment experiment, we blithely ignore many of the complications just described, and pretend that Spanish writing is, in fact, a one-for-one proposition. That is, to write down a sound sequence, one replaces each sound with a single character. We do allow ambiguity, however. A given sound may produce one character sometimes, and another character other times.</Paragraph>
    <Paragraph position="8"> Decipherment is driven by knowledge about the spoken language. In the case of archeological decipherment, this knowledge may include vocabulary, grammar, and meaning. We use simpler data. We collect frequencies of soundtriples in spoken Spanish. If we know that triple &amp;quot;t 1 k&amp;quot; is less frequent than &amp;quot;a s t,&amp;quot; then we should ultimately prefer a decipherment that contains the latter instead of the former, all other things being equal.</Paragraph>
    <Paragraph position="9"> This leads naturally into a statistical approach to decipherment. Our goal is to settle on a sound-to-character scheme that somehow maximizes the probability of the observed written document. Like many linguistic problems, this one can be formalized in the noisy-channel framework. Our sound-triple frequencies can be turned into conditional probabilities such as P(t I a s). We can estimate the probability of a sound sequence as the product of such local probabilities.</Paragraph>
    <Paragraph position="11"> A specific sound-to-character scheme can be represented as a set of conditional probabilities such as P(v I B). Read this as &amp;quot;the probability that Spanish sound B is written with character v.&amp;quot; We can estimate the conditional probability of entire character sequence given an entire sound sequence as a product of such probabilities. null</Paragraph>
    <Paragraph position="13"> Armed with these basic probabilities, we can compute two things. First, the total probability of observing a particular written sequence of characters cl ... cn:</Paragraph>
    <Paragraph position="15"> And second, we can compute the most probable phonetic decipherment sl ... s, of a particular written sequence of characters cl ... c,. This will be the one that maximizes P(sl ...s~ I cl</Paragraph>
    <Paragraph position="17"> to us. We want to assign values that maximize P(cl ...c,). These same values can then be used to decipher.</Paragraph>
    <Paragraph position="18"> We adapt the EM algorithm (Dempster et al., 1977), for decipherment, starting with a uniform probability over P(character \[ sound). That is, any sound will produce any character with probability 0.0333. The algorithm successively refines these probabilities, guaranteeing to increase P(cl ... cn) at each iteration. EM requires us to consider an exponential number of decipherments at each iteration, but this can be done efficiently with a dynamic-programming implementation (Baum, 1972). The training scheme is illustrated in Figure 4.</Paragraph>
    <Paragraph position="19"> In our experiment, we use the first page of the novel Don Quixote as our &amp;quot;ancient&amp;quot; Spanish document cl ...cn. To get phonetic data, we might tape-record modern Spanish speakers and transcribe the recorded speech into the IPA alphabet. Or we might use documents written in an alternate, known script, if any existed. In this work, we take a short cut by reverse-engineering a set of medical Spanish documents, using the finite-state transducer described above, to obtain a long phonetic training</Paragraph>
    <Paragraph position="21"> probable decipherment and synthesize it into audible form. At iteration 0, with uniform probabil!ties, the result is pure babble. At iteration 1, Spanish speakers report that &amp;quot;it sounds like someone speaking Spanish, but using words I don't know.&amp;quot; At iteration 15, the decipherment can be readily understood.</Paragraph>
    <Paragraph position="22"> (Recordings can be accessed on the World Wide Web at http : llwww, isi. edu/natural-</Paragraph>
    <Paragraph position="24"> spoken data. L_ trained on document written in unknown script (only sound-to-character parameter values are allowed to change  We first train a phonetic model on phonetic data. We then combine the phonetic model with a generic (uniform) spelling model to create a probabilistic generator of character sequences. Given a particular character sequence (&amp;quot;ancient document&amp;quot;), the EM algorithm searches for adjustments to the spelling model that will increase the probability of that character sequence. null la.nguago/mt/decipher, html).</Paragraph>
    <Paragraph position="25"> If we reverse-engineer Don Quixote, we can obtain a gold standard phonetic decipherment. Our automatic decipherment correctly identifies 96% of the sounds. Incorrect or dropped sounds are due to our naive one-for-one model, and not to weak algorithms or small corpora. For example, &amp;quot;de la Mancha&amp;quot; is deciphered as &amp;quot;d e l a m a n T i a&amp;quot; even though the characters ch really represent the single sound tS rather than the two sounds T i.</Paragraph>
    <Paragraph position="26"> Figure 5 shows how performance changes at each EM iteration. It shows three curves. The worst-performing curve reflects the accuracy of the most-probable decipherment using the formula above, i.e., the one that maximizes P(sl * ..sn) * P(cl ...ca \] sl ...sn). We find that it is better to ignore the P(Sl ...s,~) factor altogether, because while the learned sound-to-</Paragraph>
    <Paragraph position="28"> ment. As we increase the number of EM iterations, we see an improvement in decipherment performance (measured in terms of correctly generated phonemes). The best result is obtained by weighting the learned spelling model more highly than the sound model, i.e., by choosing a phonetic decoding Sl ...sn for character sequence cl ... c~ that maximizes P(sl ...s.) * P(cl ...c. Is1 ...s.) character probabilities are fairly good, they are still somewhat unsure, and this leaves room for the phonetic model to overrule them incorrectly.</Paragraph>
    <Paragraph position="29"> However, the P(sl ...sn) model does have useful things to contribute. Our best performance, shown in the highest curve, is obtained by weighing the learned sound-to-character probabilities more highly, i.e., by maximizing P(sl * ..Sn)&amp;quot; P(Cl ...ca I sl ...sn) 3.</Paragraph>
    <Paragraph position="30"> We performed some alternate experiments* Using phoneme pairs instead of triples is workable--it results in a drop from 96% accuracy to 92%. Our main experiment uses  word separators; removing these degrades performance. For example, it becomes more difficult to tell whether the r character should be trilled or not. In our experiments with Japanese and Chinese, described next, we did not use word separators, as these languages are usually written without them.</Paragraph>
  </Section>
  <Section position="6" start_page="41" end_page="41" type="metho">
    <SectionTitle>
5 Syllabic writing (Japanese Kana)
</SectionTitle>
    <Paragraph position="0"> The phonology of Japanese adheres strongly to a consonant-vowel-consonant-vow@l structure, which makes it quite amenable to syllabic writing. Indeed, the Japanese have devised a kana syllabary consisting of about 80 symbols. There is a symbol for ka, another for ko, etc. Thus, the writing of a sound like K depends on its phonetic context. Modern Japanese is not written in pure kana; it employs a mix of alphabetic, syllabic, and logographic writing devices. However, we will use pure kana as a stand-in for wide range of syllabic writing systems, as Japanese data is readily available. We obtain kana text sequences from the Kyoto Treebank, and we obtain sound sequences by using the finite-state transducer described in (Knight and Graehl, 1998).</Paragraph>
    <Paragraph position="1"> As with Spanish, we build a spoken language model based on sound-triple frequencies. The sound-to-kana model is more complex. We assume that each kana token is produced by a sequence of one, two, or three sounds. Using knowledge of syllabic writing in general-plus an analysis of Japanese sound patterns-we restrict those sequences to be (1) consonantvowel, (2) vowel-only, (3) consonant-no-vowel, and (4) consonant-semivowel-vowel. For initial experiments, we mapped &amp;quot;small kana&amp;quot; onto their large versions, even though this leads to some incorrect learning targets, such as KIYO instead of KYO. We implement the sound- and sound-to-kana models in a large finite-state machine and use EM to learn individual weights such as P(ka-kana \[ SHY U). Unlike the Spanish case, we entertain phonetic hypotheses of various lengths for any given character sequence.</Paragraph>
    <Paragraph position="2"> Deciphering 200 sentences of kana text yields 99% phoneme accuracy. We render the sounds imperfectly (yet inexpensively) through our public-domain Spanish synthesizer. The result is comprehensible to a Japanese speaker.</Paragraph>
    <Paragraph position="3"> We also experimented with deciphering smaller documents. 100 sentences yields 97.5% accuracy; 50 sentences yields 96.2% accuracy; 20 sentences yields 82.2% accuracy; five sentences yields 48.5% accuracy. If we were to give the sound sequence model some knowledge about words or grammar, the accuracy would likely not fall off as quickly.</Paragraph>
  </Section>
  <Section position="7" start_page="41" end_page="42" type="metho">
    <SectionTitle>
6 &amp;quot;Logographic&amp;quot; writing (Chinese)
</SectionTitle>
    <Paragraph position="0"> As we mentioned in Section 2, Chinese char.... acters .have internal phonetic components, and written Chinese does not really have a different character for every word: so, it is not really logographic. However, it is representative of writing systems whose distinct characters are measured in the thousands, as opposed to 20-50 for alphabets and 40-90 for syllabaries. This creates several difficulties for decipherment: * computational complexity--our decipherment algorithm runs in time roughly cubic in the number of known sound triples.</Paragraph>
    <Paragraph position="1"> * very rare characters--if we only see a character once, the context may not be rich enough for us to guess its sound.</Paragraph>
    <Paragraph position="2"> * sparse sound-triple data--the decipherment of a written text is likely to include novel sound triples.</Paragraph>
    <Paragraph position="3"> We created spoken language data for Chinese by automatically (if imperfectly) pronouncing Chinese text. We are indebted to Richard Sproat for running our documents through the text-to-speech system at Bell Labs. We created sound-pair frequencies over the resulting set of 1177 distinct syllables, represented in pinyin format, suitable for synthesizing speech. We attempted to decipher a written document of 1900 phrases and sentences, containing 2113 distinct characters and no word separators.</Paragraph>
    <Paragraph position="4"> Our result was 22% syllable accuracy, after 20 EM iterations. We may compare this to a baseline strategy of guessing the pinyin sound de0 (English &amp;quot;of&amp;quot;) for every character, which yields 3.2% accuracy. This shows a considerable improvement, but the speech in not comprehensible. Due to computational limits, we had to (1) eliminate all pinyin pairs that occurred less than five times, and (2) prevent our decoder from proposing any novel pinyin pairs. Because our our goal-standard decipherment contained  many rare sounds and novel pairs, these computational limits severely impaired accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML