File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3601_evalu.xml
Size: 3,938 bytes
Last Modified: 2025-10-06 13:59:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3601"> <Title>A Syntax-Directed Translator with Extended Domain of Locality</Title> <Section position="7" start_page="5" end_page="6" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> Our experiments are on English-to-Chinese translation, the opposite direction to most of the recent work in SMT. We are not doing the reverse direction at this time partly due to the lack of a sufficiently good parser for Chinese.</Paragraph> <Section position="1" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 6.1 Data Preparation </SectionTitle> <Paragraph position="0"> Our training set is a Chinese-English parallel corpus with 1.95M aligned sentences (28.3M words on the English side). We first word-align them by GIZA++, then parse the English side by a variant of Collins (1999) parser, and finally apply the rule-extraction algorithm of Galley et al. (2004). The resulting rule set has 24.7M xRs rules. We also use the SRI Language Modeling Toolkit (Stolcke, 2002) to train a Chinese trigram model with Knesser-Ney smoothing on the Chinese side of the parallel corpus.</Paragraph> <Paragraph position="1"> Our evaluation data consists of 140 short sentences (< 25 Chinese words) of the Xinhua portion of the NIST 2003 Chinese-to-English evaluation set.</Paragraph> <Paragraph position="2"> Since we are translating in the other direction, we use the first English reference as the source input and the Chinese as the single reference.</Paragraph> <Paragraph position="3"> 2: if cache[e] defined then triangleright this sub-tree visited before? 3: return cache[e] 4: best-0 5: for r[?]Rdo triangleright try each rule r 6: matched, sublist-PATTERNMATCH(t(r),e) triangleright tree pattern matching 7: if matched then triangleright if matched, sublist contains a list of matched subtrees 8: prob-Pr(r) triangleright the probability of rule r 9: for ei[?]sublist do 10: pi,si-TRANSLATE(ei) triangleright recursively solve each sub-problem 11: prob-prob*pi 12: if prob > best then 13: best-prob 14: str-[ximapsto-si]s(r) triangleright plug in the results 15: cache[e]-best, str triangleright caching the best solution for future use 16: return cache[e] triangleright returns the best string with its prob.</Paragraph> </Section> <Section position="2" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 6.2 Initial Results </SectionTitle> <Paragraph position="0"> We implemented our system as follows: for each input sentence, we first run Algorithm 1, which returns the 1-best translation and also builds the derivation forest of all translations for this sentence. Then we extract the top 5000 non-duplicate translated strings from this forest and rescore them with the trigram model and the length penalty.</Paragraph> <Paragraph position="1"> We compared our system with a state-of-the-art phrase-based system Pharaoh (Koehn, 2004) on the evaluation data. Since the target language is Chinese, we report character-based BLEU score instead of word-based to ensure our results are independent of Chinese tokenizations (although our language models are word-based). The BLEU scores are based on single reference and up to 4-gram precisions (r1n4). Feature weights of both systems are tuned on the same data set.3 For Pharaoh, we use the standard minimum error-rate training (Och, 2003); and for our system, since there are only two independent features (as we always fix a = 1), we use a simple grid-based line-optimization along the language-model weight axis. For a given language-model weight b, we use binary search to find the best length penalty l that leads to a length-ratio closest to 1 against the reference. The results are summarized in Table 1. The rescored translations are better than the 1-best results from the direct model, but still slightly worse than Pharaoh.</Paragraph> </Section> </Section> class="xml-element"></Paper>