File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/98/w98-1005_relat.xml

Size: 4,483 bytes

Last Modified: 2025-10-06 14:16:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1005">
  <Title>Translating Names and Technical Terms in Arabic Text</Title>
  <Section position="3" start_page="34" end_page="35" type="relat">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> (Arbabi et al., 1994) developed an algorithm at IBM for the automatic forward transliteration of Arabic personal names into the Roman alphabet. Using a hybrid neural network and knowledge-based system approach, this program first inserts the appropriate missing vowels into the Arabic name, then converts the name into a phonetic representation, and maps this representation into one or more possible Roman spellings of the name. The Roman spellings may also vary across languages (Sharifin English corresponds to Chgrife in French). However, they do not deal with back-transliteration.</Paragraph>
    <Paragraph position="1"> (Knight and Graehl, 1997) describe a back-transliteration system for Japanese. It comprises a generative model of how an English phrase becomes Japanese:  1. An English phrase is written.</Paragraph>
    <Paragraph position="2"> 2. A translator pronounces it in English.</Paragraph>
    <Paragraph position="3"> 3. The pronunciation is modified to Japanese sound inventory.</Paragraph>
    <Paragraph position="4"> fit the 4. The sounds are converted into the Japanese katakana alphabet.</Paragraph>
    <Paragraph position="5"> 5. Katakana is written.</Paragraph>
    <Paragraph position="6">  They build statistical models for each of these five processes. A given model describes a mapping between sequences of type A and sequences of type B. The model assigns a numerical score to any particular sequence pair a and b, also called the probability of b given a, or P(b\]a). The result is a bidirectional translator: given a particular Japanese string, they compute the n most likely English translations. Fortunately, there are techniques for coordinating solutions to sub-problems like the five above, and for using generative models in the reverse direction. These techniques rely on probabilities and Bayes' Rule.</Paragraph>
    <Paragraph position="7"> For a rough idea of how this works, suppose we built an English phrase generator that produces word sequences according to some probability distribution P(w). And suppose we built an English pronouncer that takes a word sequence and assigns it a set of pronunciations, again probabilistically, according to some P(elw ). Given a pronunciation e, we may want to search for the word sequence w that  maximizes P(w\[e). Bayes' Rule lets us equivalently maximize P(w) * P(e\]w), exactly the two distributions just modeled.</Paragraph>
    <Paragraph position="8"> Extending this notion, (Knight and Graehl, 1997) built five probability distributions: 1. P(w) - generates written English word sequences. null 2. P(e\]w) - pronounces English word sequences. 3. P(jle) - converts English sounds into Japanese sounds.</Paragraph>
    <Paragraph position="9"> 4. P(klj ) - converts Japanese sounds to katakana writing.</Paragraph>
    <Paragraph position="10"> 5. P(o\[k) - introduces misspellings caused by optical character recognition (OCR).</Paragraph>
    <Paragraph position="11"> Given a Japanese string o they can find the English word sequence w that maximizes the sum over all e, j, and k, of</Paragraph>
    <Paragraph position="13"> These models were constructed automatically from data like text corpora and dictionaries. The most interesting model is P(jle), which turns English sound sequences into Japanese sound sequences, e.g., S AH K ER (soccer) into s a kk a a.</Paragraph>
    <Paragraph position="14"> Following (Pereira and Riley, 1997), P(w) is implemented in a weighted finite-state acceptor (WFSA) and the other distributions in weighted finite-state transducers (WFSTs). A WFSA is a state/transition diagram with we.ights and symbols on the transitions, making some output sequences more likely than others. A WFST is a WFSA with a pair of symbols on each transition, one input and one output. Inputs and outputs may include the empty string. Also following (Pereira and Riley, 1997), there is a general composition algorithm for constructing an integrated model P(xlz ) from models P(x\]y) and P(y\]z). They use this to combine an observed Japanese string with each of the models in turn. The result is a large WFSA containing all possible English translations, the best of which can be extracted by graph-search algorithms.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML