XML Viewer - p01-1067

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1067_metho.xml
Size: 9,586 bytes
Last Modified: 2025-10-06 14:07:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1067">
  <Title>A Syntax-based Statistical Translation Model</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Model
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 An Example
</SectionTitle>
      <Paragraph position="0"> We first introduce our translation model with an example. Section 2.2 will describe the model more formally. We assume that an English parse tree is fed into a noisy channel and that it is translated to a Japanese sentence.1 1The parse tree is flattened to work well with the model. See Section 3.1 for details.</Paragraph>
      <Paragraph position="1"> Figure 1 shows how the channel works. First, child nodes on each internal node are stochastically reordered. A node with a47 children has a47a49a48 possible reorderings. The probability of taking a specific reordering is given by the model's r-table. Sample model parameters are shown in Table 1. We assume that only the sequence of child node labels influences the reordering. In Figure 1, the top VB node has a child sequence PRP-VB1-VB2. The probability of reordering it into PRP-VB2-VB1 is 0.723 (the second row in the r-table in Table 1). We also reorder VB-TO into TO-VB, and TO-NN into NN-TO, so therefore the probability of the second tree in Figure 1 is a50a52a51a54a53a56a55a58a57a60a59a61a50a52a51a54a53a63a62a65a64a60a59a66a50a52a51a68a67a58a64a58a57a70a69a71a50a52a51a72a62a65a67a73a62 . Next, an extra word is stochastically inserted at each node. A word can be inserted either to the left of the node, to the right of the node, or nowhere. Brown et al. (1993) assumes that there is an invisible NULL word in the input sentence and it generates output words that are distributed into random positions. Here, we instead decide the position on the basis of the nodes of the input parse tree. The insertion probability is determined by the n-table. For simplicity, we split the n-table into two: a table for insert positions and a table for words to be inserted (Table 1). The node's label and its parent's label are used to index the table for insert positions. For example, the PRP node in Figure 1 has parent VB, thus</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"/>
      <Paragraph position="7"> dex. Using this label pair captures, for example, the regularity of inserting case-marker particles.</Paragraph>
      <Paragraph position="8"> When we decide which word to insert, no conditioning variable is used. That is, a function word like ga is just as likely to be inserted in one place as any other. In Figure 1, we inserted four words (ha, no, ga and desu) to create the third tree. The top VB node, two TO nodes, and the NN node inserted nothing. Therefore, the probability of obtaining the third tree given the second tree is</Paragraph>
      <Paragraph position="10"> Finally, we apply the translate operation to each leaf. We assume that this operation is dependent only on the word itself and that no context is consulted.2 The model's t-table specifies the probability for all cases. Suppose we obtained the translations shown in the fourth tree of Figure 1.</Paragraph>
      <Paragraph position="11"> The probability of the translate operation here is a50a52a51a68a64a58a165a58a55a178a59a66a50a52a51a68a64a56a50a58a50a176a59a61a50a52a51a172a50a170a57a58a67a60a59a61a50a52a51a68a57a58a57a58a57a60a59a65a168a56a51a172a50a58a50a58a50a179a69a71a50a52a51a172a50a180a168a181a50a170a67 . The total probability of the reorder, insert and translate operations in this example is a50a52a51a72a62a65a67a73a62a49a59 3.498e-9 a59a173a50a52a51a172a50a180a168a181a50a170a67a49a69 1.828e-11. Note that there 2When a TM is used in machine translation, the TM's role is to provide a list of possible translations, and a language model addresses the context. See (Berger et al., 1996).</Paragraph>
      <Paragraph position="12"> are many other combinations of such operations that yield the same Japanese sentence. Therefore, the probability of the Japanese sentence given the English parse tree is the sum of all these probabilities. null We actually obtained the probability tables (Table 1) from a corpus of about two thousand pairs of English parse trees and Japanese sentences, completely automatically. Section 2.3 and Appendix 4 describe the training algorithm.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Formal Description
</SectionTitle>
      <Paragraph position="0"> This section formally describes our translation model. To make this paper comparable to (Brown et al., 1993), we use English-French notation in this section. We assume that an English parse tree a182 is transformed into a French sentence a183 . Let the English parse tree a182 consist of nodes</Paragraph>
      <Paragraph position="2"> a184a66a187 , and let the output French sentence consist of French words a188 a185 a161a189a188 a186 a161a181a51a181a51a181a51a63a161a189a188a73a190 . Three random variables, a191 , a192 , and a193 are channel operations applied to each node. Insertion a191 is an operation that inserts a French word just before or after the node. The insertion can be none, left, or right. Also it decides what French word to insert. Reorder a192 is an operation that changes the order of the children of the node. If a node has three children, e.g., there are a57a194a48a70a69a195a164 ways to reorder them. This operation applies only to non-terminal nodes in the tree. Translation a193 is an operation that translates a terminal English leaf word into a French word. This operation applies only to terminal nodes. Note that an English word can be translated into a French NULL word.</Paragraph>
      <Paragraph position="3"> The notation a196a197a69 a160a199a198 a161a201a200a202a161a204a203a205a162 stands for a set of values of a160 a191a206a161a204a192a207a161a189a193a208a162 . a196a66a209a49a69 a160a199a198 a209a210a161a201a200a65a209a211a161a204a203a212a209a199a162 is a set of values of random variables associated with  The probability of getting a French sentence a183 given an English parse tree a182 is</Paragraph>
      <Paragraph position="5"> where Stra163 a213 a163 a182a234a169a201a169 is the sequence of leaf words of a tree transformed by a213 from a182 .</Paragraph>
      <Paragraph position="6"> The probability of having a particular set of values of random variables in a parse tree is</Paragraph>
      <Paragraph position="8"> This is an exact equation. Then, we assume that a transform operation is independent from other transform operations, and the random variables of each node are determined only by the node itself.</Paragraph>
      <Paragraph position="9"> So, we obtain</Paragraph>
      <Paragraph position="11"> The random variables a196a63a209a60a69 a160a199a198 a209a211a161a201a200a173a209a250a161a204a203a212a209a240a162 are assumed to be independent of each other. We also assume that they are dependent on particular features of the node a184 a209 . Then,  where a8 , a9 , and a10 are the relevant features to a191 , a192 , and a193 , respectively. For example, we saw that the parent node label and the node label were used for a8 , and the syntactic category sequence of children was used for a9 . The last line in the above formula introduces a change in notation, meaning that those probabilities are the model pa-</Paragraph>
      <Paragraph position="13"> and a18 are the possible values for a8 , a9 , and a10 , respectively.</Paragraph>
      <Paragraph position="14"> In summary, the probability of getting a French sentence a183 given an English parse tree a182 is  and Pa163 a203 a12a18 a169 , decide the behavior of the translation model, and these are the probabilities we want to estimate from a training corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Automatic Parameter Estimation
</SectionTitle>
      <Paragraph position="0"> To estimate the model parameters, we use the EM algorithm (Dempster et al., 1977). The algorithm iteratively updates the model parameters to maximize the likelihood of the training corpus. First, the model parameters are initialized. We used a uniform distribution, but it can be a distribution taken from other models. For each iteration, the number of events are counted and weighted by the probabilities of the events. The probabilities of events are calculated from the current model parameters. The model parameters are re-estimated based on the counts, and used for the next iteration. In our case, an event is a pair of a value of a random variable (such as a198 , a200 , or a203 ) and a feature value (such as a14 , a16 , or a18 ). A separate counter is used for each event. Therefore, we need the same number of counters, a25 a163a199a198 a161 a14 a169 , a25 a163 a200a205a161 a16 a169 , and a25 a163 a203a52a161 a18 a169 , as the number of entries in the probability tables,  ble combinations, where a12a198a24a12 and a12a200 a12 are the number of possible values for a198 and a200 , respectively (a203 is uniquely decided when a198 and a200 are given for a particular a160 a182a175a161a204a183a73a162 ). Appendix describes an efficient implementation that estimates the probability in polynomial time.3 With this efficient implementation, it took about 50 minutes per iteration on our corpus (about two thousand pairs of English parse trees and Japanese sentences. See the next section).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML