XML Viewer - w06-1626

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1626_metho.xml
Size: 23,863 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1626">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Distributed Language Modeling for N-best List Re-ranking</Title>
  <Section position="4" start_page="0" end_page="216" type="metho">
    <SectionTitle>
2 N-best list re-ranking
</SectionTitle>
    <Paragraph position="0"> When translating a source language sentence f into English, the SMT decoder first builds a translation lattice over the source words by applying the translation model and then explores the lattice and searches for an optimal path as the best translation.</Paragraph>
    <Paragraph position="1"> The decoder uses different models, such as the translation model, n-gram language model, fertility model, and combines multiple model scores to calculate the objective function value which favors one translation hypothesis over the other (Och et al., 2004).</Paragraph>
    <Paragraph position="2"> Instead of outputting the top hypothesis e(1) based on the decoder model, the decoder can output N (usually N = 1000) alternative hypotheses {e(r)|r = 1,...,N} for one source sentence and rank them according to their model scores.</Paragraph>
    <Paragraph position="3"> Figure 1 shows an example of the output from a SMT system. In this example, alternative hypothesis e(2) is a better translations than e(1) according to the reference (Ref) although its model score is lower.</Paragraph>
    <Paragraph position="4"> SMT models are not perfect, it is unavoidable to have a sub-optimal translation output as the model-best by the decoder. The objective of N-best list re-ranking is then to re-rank the translation hypotheses using features which are not used during decoding so that better translations can emerge as &amp;quot;optimal&amp;quot; translations. Our exper- null f: ,2001# )ssI9] {/G Ref: Since the terrorist attacks on the United States in 2001 e(1): since 200 year , the united states after the terrorist attacks in the incident e(2): since 2001 after the incident of the terrorist attacks on the united states e(3): since the united states 2001 threats of terrorist attacks after the incident e(4): since 2001 the terrorist attacks after the incident e(5): since 200 year , the united states after the terrorist  iments (section 5.1) have shown that the oracle-best translation from a typical N-best list could be 6 to 10 BLEU points better than the model-best translation.</Paragraph>
    <Paragraph position="5"> In this paper we use the distributed language model on very large data to re-rank the N-best list.</Paragraph>
    <Section position="1" start_page="216" end_page="216" type="sub_section">
      <SectionTitle>
2.1 Sentence likelihood
</SectionTitle>
      <Paragraph position="0"> The goal of a language model is to determine the probability, or in general the &amp;quot;likelihood&amp;quot; of a word sequence w1 ...wm (wm1 for short) given some training data. The standard language modeling approach breaks the sentence probability down into:</Paragraph>
      <Paragraph position="2"> Under the Markov or higher order Markov process assumption that only the closest n[?]1 words have real impact on the choice of wi, equation 1 is approximated to:</Paragraph>
      <Paragraph position="4"> The probability of a word given its history can be approximated with the maximum likelihood estimate (MLE) without any smoothing:</Paragraph>
      <Paragraph position="6"> In addition to the standard n-gram probability estimation, we propose 3 sentence likelihood metrics. null * L0: Number of n-grams matched.</Paragraph>
      <Paragraph position="7"> The simplest metric for sentence likelihood is to count how many n-grams in this sentence can be found in the corpus.</Paragraph>
      <Paragraph position="9"> For example, L0 for sentence in figure 2 is 52 because 52 n-grams have non-zero counts.</Paragraph>
      <Paragraph position="10"> * Ln1: Average interpolated n-gram conditional probability.</Paragraph>
      <Paragraph position="12"> gram counts (Eq. 3) without any smoothing.</Paragraph>
      <Paragraph position="13"> lk is the weight for k-gram conditional probability, summationtextlk = 1.</Paragraph>
      <Paragraph position="14"> Ln1 is similar to the standard n-gram LM except the probability is averaged over the words in the sentence to prevent shorter sentences being favored unfairly.</Paragraph>
      <Paragraph position="15"> * L2: Sum of n-gram's non-compositionality For each matched n-gram, we consider all the possibilities to cut/decompose it into two short n-grams, for example &amp;quot;the terrorist attacks on the united states&amp;quot; could be decomposed into (&amp;quot;the&amp;quot;, &amp;quot;terrorist attacks on the united states&amp;quot;) or (&amp;quot;the terrorist&amp;quot;, &amp;quot;attacks on the united states&amp;quot;), ... , or (&amp;quot;the terrorist attacks on the united&amp;quot;, &amp;quot;states&amp;quot;). For each cut, calculate the point-wise mutual information (PMI) between the two short ngrams. The one with the minimal PMI is the most &amp;quot;natural&amp;quot; cut for this n-gram. The PMI over the natural cut quantifies the non-compositionality Inc of an n-gram wji .</Paragraph>
      <Paragraph position="16"> The higher the value of Inc(wji) the more likely wji is a meaningful constituent, in other words, it is less likely that wji is composed from two short n-grams just by chance (Yamamoto and Church, 2001).</Paragraph>
      <Paragraph position="17"> Define L2 formally as:</Paragraph>
      <Paragraph position="19"/>
    </Section>
  </Section>
  <Section position="5" start_page="216" end_page="218" type="metho">
    <SectionTitle>
3 Distributed language model
</SectionTitle>
    <Paragraph position="0"> The fundamental information required to calculate the likelihood of a sentence is the frequency of n-grams in the corpus. In conventional LM training, all the counts are collected from the corpus D and saved to disk for probability estimation. When the size of D becomes large and/or n is increased to capture more context, the count file can be too large to be processed.</Paragraph>
    <Paragraph position="1"> Instead of collecting n-gram counts offline, we index D using a suffix array (Manber and Myers, 1993) and count the occurrences of wii[?]n+1 in D on the fly.</Paragraph>
    <Section position="1" start_page="216" end_page="216" type="sub_section">
      <SectionTitle>
3.1 Calculate n-gram frequency using suffix
</SectionTitle>
      <Paragraph position="0"> array For a corpus D with N words, locating all the occurrences of wii[?]n+1 takes O(logN). Zhang and Vogel (2005) introduce a search algorithm which locates all the m(m+1)/2 embedded n-grams in a sentence of m words within O(m*logN) time. Figure 2 shows the frequencies of all the embedded n-grams in sentence &amp;quot;since 2001 after the incident of the terrorist attacks on the united states&amp;quot; matched against a 26 million words corpus. For example, unigram &amp;quot;after&amp;quot; occurs 4.43x104 times, trigram &amp;quot;after the incident&amp;quot; occurs 106 times. The longest n-gram that can be matched is the 8-gram &amp;quot;of the terrorist attacks on the united states&amp;quot; which occurs 7 times in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="216" end_page="218" type="sub_section">
      <SectionTitle>
3.2 Client/Server paradigm
</SectionTitle>
      <Paragraph position="0"> To load the corpus and its suffix array index into the memory, each word token needs 8 bytes. For example, if the corpus has 50 million words,  total memory required is 22GB. It is practically impossible to fit such data into the memory of any single machine.</Paragraph>
      <Paragraph position="1"> To make use of the large amount of data, we developed a distributed client/server architecture for language modeling. Client/server is the most common paradigm of distributed computing at present (Leopold, 2001). The paradigm describes an asymmetric relationship between two type of processes, of which one is the client, and the other is the server. The server process manages some resources and offers a service which can be used by other processes. The client is a process that needs the service in order to accomplish its task. It sends a request to the server and asks for the execution of a task that is covered by the service.</Paragraph>
      <Paragraph position="2"> We split the large corpus D into d non-overlapping chunks. One can easily verify that for any n-gram wii[?]n+1 the count of its occurrences in D is the sum of its occurrences in all the chunks,</Paragraph>
      <Paragraph position="4"> Each server3 loads one chunk of the corpus with its suffix array index. The client sends an English sentence w1 ...wm to each of the servers and requests for the count information of all the n-grams in the sentence. The client collects the count information from all the servers, sums up the counts for each n-gram and then calculates the likelihood of the sentence.</Paragraph>
      <Paragraph position="5"> The client communicates with the servers via TCP/IP sockets. In our experiments, we used 150 servers running on 26 computers to serve one client. Multiple clients can be served at the same time if needed. The process of collecting counts and calculating the sentence probabilities takes about 1 to 2 ms for each English sentence (average length 23.5 words). With this architecture, we can easily make use of larger corpora by adding additional data servers. In our experiments, we used all the 2.7 billion word data in the English Gigaword corpus without any technical difficulties.</Paragraph>
      <Paragraph position="6"> 3A server is a special program that provides services to client processes. It runs on a physical computer but the concept of server should not be confused with the actual machine that runs it. In practice, one computer usually hosts multiple servers at the same time.</Paragraph>
      <Paragraph position="7">  terrorist attacks on the united states.&amp;quot; 4 &amp;quot;More data is better data&amp;quot; or &amp;quot;Relevant data is better data&amp;quot; Although statistical systems usually improve with more data, performance can decrease if additional data does not fit the test data. There have been debates in the data-driven NLP community as to whether &amp;quot;more data is better data&amp;quot; or &amp;quot;relevant data is better data&amp;quot;. For N-best list re-ranking, the question becomes: &amp;quot;should we use all the data to re-rank the hypotheses for one source sentence, or select some corpus chunks that are believed to be relevant to this sentence?&amp;quot; Various relevance measures are proposed in (Iyer and Ostendorf, 1999) including content-based relevance criteria and style-based criteria. In this paper, we use a very simple relevance metric.</Paragraph>
      <Paragraph position="8"> Define corporaDd's relevance to a source sentence</Paragraph>
      <Paragraph position="10"> R(Dd,ft) estimates how well a corpus Dd can cover the n-grams in the N-best list of a source sentence. The higher the coverage, the more relevant Dd is.</Paragraph>
      <Paragraph position="11"> In the distributed LM architecture, the client first sends N translations of ft to all the servers. From the returned n-gram matching information, client calculates R(Dd,ft) for each server, and choose the most relevant (e.g., 20) servers for ft. The n-gram counts returned from these relevant servers are summed up for calculating the likelihood of ft. One could also assign weights to the n-gram counts returned from different servers during the summation so that the relevant data has more impact than the less-relevant ones.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="218" end_page="220" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"> We used the N-best list generated by the Hiero SMT system (Chiang, 2005). Hiero is a statistical phrase-based translation model that uses hierarchical phrases. The decoder uses a trigram language model trained with modified Kneser-Ney smoothing (Kneser and Ney, 1995) on a 200 million words corpus. The 1000-best list was generated on 919 sentences from the MT03 Chinese-English evaluation set.</Paragraph>
    <Paragraph position="1"> All the data from the English Gigaword corpus plus the English side of the Chinese-English bilingual data available from LDC are used. The 2.97 billion words data is split into 150 chunks, each has about 20 million words. The original order is kept so that each chunk contains data from the same news source and a certain period of time.</Paragraph>
    <Paragraph position="2"> For example, chunk Xinhua2003 has all the Xinhua News data from year 2003 and NYT9499 038 has the last 20 million words from the New York Times 1994-1999 corpus. One could split the data into larger(smaller) chunks which will require less(more) servers. We choose 20 million words as the size for each chunk because it can be loaded by our smallest machine and it is a reasonable granularity for selection.</Paragraph>
    <Paragraph position="3"> In total, 150 corpus information servers run on 26 machines connected by the standard Ethernet LAN. One client sends each English hypothesis translations to all 150 servers and uses the returned information to re-rank. The whole process takes about 600 seconds to finish.</Paragraph>
    <Paragraph position="4"> We use BLEU scores to measure the translation accuracy. A bootstrapping method is used to calculate the 95% confidence intervals for BLEU (Koehn, 2004; Zhang and Vogel, 2004).</Paragraph>
    <Section position="1" start_page="218" end_page="219" type="sub_section">
      <SectionTitle>
5.1 Oracle score of the N-best list
</SectionTitle>
      <Paragraph position="0"> Because of the spurious ambiguity, there are only 24,612 unique hypotheses in the 1000-best list, on average 27 per source sentence. This limits the potential of N-best re-ranking. Spurious ambiguity is created by the decoder where two hypotheses generated from different decoding path are considered different even though they have identical word sequences. For example, &amp;quot;the terrorist attacks on the united states&amp;quot; could be the output of decoding path [the terrorist attacks][on the united  states] and [the terrorist attacks on] [the united states].</Paragraph>
      <Paragraph position="1"> We first calculate the oracle score from the N-best list to verify that there are alternative hypotheses better than the model-best translation. The oracle best translations are created by selecting the hypothesis which has the highest sentence BLEU score for each source sentence. Yet a critical problem with BLEU score is that it is a function of the entire test set and does not give meaningful scores for single sentences. We followed the approximation described in (Collins et al., 2005) to get around this problem. Given a test set with T sentences, N hypotheses are generated for each source sentence ft. Denote e(r)t as the r-th ranked hypothesis for ft. e(1)t is the model-best hypothesis for this sentence. The baseline BLEU scores are calculated based on the model-best translation set {e(1)t |t = 1,...,T}.</Paragraph>
      <Paragraph position="2"> Define the BLEU sentence-level gain for e(r)t as:</Paragraph>
      <Paragraph position="4"> GBLEUe(r)t calculates the gain if we switch the model-best hypothesis e(1)t using e(r)t for sentence ft and keep the translations for the rest of the test set untouched.</Paragraph>
      <Paragraph position="5"> With the estimated sentence level gain for each hypothesis, we can construct the oracle best translation set by selecting the hypotheses with the highest BLEU gain for each sentence. Oracle best BLEU translation set is: {e(r[?]t)t |t = 1,...,T} where r[?]t = argmaxr GBLEUe(r)t .</Paragraph>
      <Paragraph position="6">  oracle-best translations.</Paragraph>
      <Paragraph position="7"> Table 1 shows the BLEU score of the approximated oracle best translation. The oracle score is 7 points higher than the model-best scores even though there are only 27 unique hypotheses for each sentence on average. This confirms our observation that there are indeed better translations in the N-best list.</Paragraph>
    </Section>
    <Section position="2" start_page="219" end_page="219" type="sub_section">
      <SectionTitle>
5.2 Training standard n-gram LM on large
</SectionTitle>
      <Paragraph position="0"> data for comparison Besides comparing the distributed language model re-ranked translations with the model-best translations, we also want to compare the distributed LM with the the standard 3-gram and 4-gram language models on the N-best list re-ranking task.</Paragraph>
      <Paragraph position="1"> Training a standard n-gram model for a 2.9 billion words corpora is much more complicated and tedious than setting up the distributed LM. Because of the huge size of the corpora, we could only manage to train a test-set specific n-gram LM for this experiment.</Paragraph>
      <Paragraph position="2"> First, we split the corpora into smaller chunks and generate n-gram count files for each chunk.</Paragraph>
      <Paragraph position="3"> Each count file is then sub-sampled to entries where all the words are listed in the vocabulary of the N-best list (5,522 word types). We merge all the sub-sampled count files into one and train the standard language model based on it.</Paragraph>
      <Paragraph position="4"> We manage to train a 3-gram LM using the 2.97 billion-word corpus. Resulting LM requires 2.3GB memory to be loaded for the re-ranking experiment. null A 4-gram LM for this N-best list is of 13 GB in size and can not be fit into the memory. We split the N-best list into 9 parts to reduce the vocabulary size of each sub N-best list to be around 1000 words. The 4-gram LM tailored for each sub N-best list is around 1.5 to 2 GB in size.</Paragraph>
      <Paragraph position="5"> Training higher order standard n-gram LMs with this method requires even more partitions of the N-best list to get smaller vocabularies. When the vocabulary becomes too small, the smoothing could fail and results in unreliable LM probabilities. null Adapting the standard n-gram LM for each individual source sentence is almost infeasible given our limited computing resources. Thus we do not have equivalent n-gram LMs to be compared with the distributed LM for conditions where the most relevant data chunks are used to re-rank the N-best list for a particular source sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="219" end_page="220" type="sub_section">
      <SectionTitle>
5.3 Results
</SectionTitle>
      <Paragraph position="0"> Table 2 lists results of the re-ranking experiments under different conditions. The re-ranked translation improved the BLEU score from 31.44 to  32.64, significantly better than the model-best translation.</Paragraph>
      <Paragraph position="1"> Different metrics are used under the same data situation for comparison. L0, though extremely simple, gives quite nice results on N-best list reranking. With only one corpus chunk (the most relevant one) for each source sentence, L0 improved the BLEU score to 32.22. We suspect that L0 works well because it is inline with the nature of BLEU score. BLEU measures the similarity between the translation hypothesis and human reference by counting how many n-grams in MT can be found in the references.</Paragraph>
      <Paragraph position="2"> Instead of assigning weights 1 to all the matched n-grams in L0, L2 weights each n-gram by its non-compositionality. For all data conditions, L2 consistently gives the best results. Metric family L1 is close to the standard n-gram LM probability estimation. Because no smoothing is used, L31 performance (32.00) is slightly worse than the standard 3-gram LM result (32.22). On the other hand, increasing the length of the history in L1 generally improves the performance.</Paragraph>
      <Paragraph position="3"> Figure 3 shows the BLEU score of the re-ranked translation when using different numbers of relevant data chunks for each sentence. The selected data chunks may differ for each sentences. For example, the 2 most relevant corpora for sentence 1 are Xinhua2002 and Xinhua2003 while for sentence 2 APW2003A and NYT2002D are more relevant. When we use the most relevant data chunk (about 20 million words) to re-rank the N-best list, 36 chunks of data will be used at least once for 919 different sentences, which accounts for about 720 million words in total. Thus the x-axis in figure 3 should not be interpreted as the total amount of data used but the number of the most relevant corpora used for each sentence.</Paragraph>
      <Paragraph position="4"> All three metrics in figure 3 show that using all data together (150 chunks, 2.97 billion words) does not give better discriminative powers than using only some relevant chunks. This supports our argument in section 4 that relevance selection is helpful in N-best list re-ranking. In some cases the re-ranked N-best list has a higher BLEU score after adding a supposedly &amp;quot;less-relevant&amp;quot; corpus chunk and a lower BLEU score after adding a &amp;quot;more-relevant&amp;quot; chunk. This indicates that the relevance measurement (Eq. 11) is not fully reflecting the real &amp;quot;relevance&amp;quot; of a data chunk for a sentence. With a better relevance measurement one</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="220" end_page="221" type="metho">
    <SectionTitle>
6 Related work and discussion
</SectionTitle>
    <Paragraph position="0"> Yamamoto and Church (2001) used suffix arrays to compute the frequency and location of an n-gram in a corpus. The frequencies are used to find &amp;quot;interesting&amp;quot; substrings which have high mutual information.</Paragraph>
    <Paragraph position="1"> Soricut et al. (2002) build a Finite State Acceptor (FSA) to compactly represent all possible English translations of a source sentence according to the translation model. All sentences in a big monolingual English corpus are then scanned by this FSA and those accepted by the FSA are considered as possible translations for the source sentence. The corpus is split into hundreds of chunks for parallel processing. All the sentences in one chunk are scanned by the FSA on one processor. Matched sentences from all chunks are then used together as possible translations. The assumption of this work that possible translations of a source sentence can be found as exact match in a big monolingual corpus is weak even for very large corpus. This method can easily fail to find any possible translation and return zero proposed translations.</Paragraph>
    <Paragraph position="2"> Kirchhoff and Yang (2005) used a factored 3-gram model and a 4-gram LM (modified KN smoothing) together with seven system scores to re-rank an SMT N-best. They improved the translation quality of their best baseline (Spanish null English) from BLEU 30.5 to BLEU 31.0.</Paragraph>
    <Paragraph position="3"> Iyer and Ostendorf (1999) select and weight data to train language modeling for ASR. The data is selected based on its relevance for a topic or the similarity to data known to be in the same domain as the test data. Each additional document is classified to be in-domain or out-of-domain according to cosine distance with TF-IDF term weights, POS-tag LM and a 3-gram word LM. n-gram counts from the in-domain and the additionally selected out-of-domain data are then combined with an weighting factor. The combined counts are used to estimate a LM with standard smoothing.</Paragraph>
    <Paragraph position="4"> Hildebrand et al. (2005) use information retrieval to select relevant data to train adapted translation and language models for an SMT system. Si et al. (2002) use unigram distribution similarity to select the document collection which is most relevant to the query documents. Their work is mainly focused on information retrieval application. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML