File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0904_evalu.xml
Size: 8,781 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0904"> <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 25-32, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Syntactic Features for Evaluation of Machine Translation</Title> <Section position="4" start_page="27" end_page="30" type="evalu"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> Our testing data contains two parts. One part is a set of 665 English sentences generated by a Chinese-English MT system. And for each MT hypothesis, three reference translations are associated with it.</Paragraph> <Paragraph position="1"> Input: dependency tree T, maximum length N of the headword chain Output: headword chains from length 1 to N for i = 1 to N for every node n in T if i == 1 add n's word to n's 1 word headword chains; else for every direct child c of n for every i-1 words headword chain hc of c newchain = joint(n's word, hc); add newchain to the i words headword chains of n; The human judgments, on a scale of 1 to 5, were collected at the 2003 Johns Hopkins Speech and Language Summer Workshop, which tells the overall quality of the MT hypotheses. The translations were generated by the alignment template system of Och (2003). This testing set is called JHU testing set in this paper. The other set of testing data is from MT evaluation workshop at ACL05. Three sets of human translations (E01, E03, E04) are selected as the references, and the outputs of seven MT systems (E9 E11 E12 E14 E15 E17 E22) are used for testing the performance of our syntactic metrics. Each set of MT translations contains 929 English sentences, each of which is associated with human judgments for its uency and adequacy. The uency and adequacy scores both range from 1 to 5.</Paragraph> <Section position="1" start_page="28" end_page="29" type="sub_section"> <SectionTitle> 3.1 Sentence-level Evaluation </SectionTitle> <Paragraph position="0"> Our syntactic metrics are motivated by a desire to better capture grammaticality in MT evaluation, and thus we are most interested in how well they correlate with human judgments of sentences' uency, rather than the adequacy of the translation. To do this, the syntactic metrics (computed with the Collins (1999) parser) as well as BLEU were used to evaluate hypotheses in the test set from ACL05 MT workshop, which provides both uency and adequacy scores for each sentence, and their Pearson coef cients of correlation with the human uency scores were computed. For BLEU and HWCM, in order to avoid assigning zero scores to individual sentences, when precision for n-grams of a particular length is zero we replace it with an epsilon value of 10[?]3. We choose E14 and E15 as two representative MT systems in the ACL05 MT workshop data set, which have relatively high human scores and low human scores respectively. The results are shown in Table 1 and Table 2, with every metric indexed by the maximum n-gram length or subtree depth. The last row of the each table shows the treekernel-based measures, which have no depth parameter to adjust, but implicitly consider all depths. The results show that in both systems our syntactic metrics all achieve a better performance in the correlation with human judgments of uency. We also notice that with the increasing of the maximum length of n-grams, the correlation of BLEU with human judgments does not necessarily increase, but decreases in most cases. This is contrary to the argument in BLEU which says that longer n-grams better represent the sentences' uency than the shorter ones. The problem can be explained by the limitation of the reference translations. In our experiments, every hypothesis is evaluated by referring to three human translations. Since the three human translations can only cover a small set of possible translations, with the increasing of n-gram length, more and more correct n-grams might not be found in the references, so that the fraction of longer n-grams turns to be less reliable than the short ones and hurts the nal scores. In the the corpus-level evaluation of a MT system, the sparse data problem will be less serious than in the sentence-level evaluation, since the overlapping n-grams of all the sentences and their references will be summed up. So in the traditional BLEU algorithm used for corpus-level evaluation, a maximum n-gram of length 4 or 5 is usually used. A similar trend can be found in syntax tree and dependency tree based metrics, but the decreasing ratios are much lower than BLEU, which indicates that the syntactic metrics are less affected by the sparse data problem. The poor performance of tree-kernel based metrics also con rms our arguments on the sparse data problem, since the kernel measures implicitly consider the overlapping ratios of the sub-trees of all shapes, and thus will be very much affected by the sparse data problem.</Paragraph> <Paragraph position="1"> Though our syntactic metrics are proposed for evaluating the sentences' uency, we are curious how well they do in the overall evaluation of sentences. Thus we also computed each metric's correlation with human overall judgments in E14, E15 and JHU testing set. The overall human score for each sentence in E14 and E15 is computed by summing up its uency score and adequacy score. The results are shown in Table 3, Table 4, and Table for E15 competitive correlations in the test, among which HWCM, based on headword chains, gives better performances in evaluation of E14 and E15, and a slightly worse performance in JHU testing set than BLEU. Just as with the uency evaluation, HWCM and other syntactic metrics present more stable performance as the n-gram's length (subtree's depth) increases.</Paragraph> </Section> <Section position="2" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 3.2 Corpus-level Evaluation </SectionTitle> <Paragraph position="0"> While sentence-level evaluation is useful if we are interested in a con dence measure on MT outputs, corpus level evaluation is more useful for comparing MT systems and guiding their development. Does higher sentence-level correlation necessarily indicate higher correlation in corpus-level evaluation? To answer this question, we used our syntactic metrics and BLEU to evaluate all the human-scored MT systems (E9 E11 E12 E14 E15 E17 E22) in the ACL05 MT workshop test set, and computed the correlation with human overall judgments. The human judgments for an MT system are estimated by summing up each sentence's human overall score.</Paragraph> <Paragraph position="1"> Table 6 shows the results indexed by different n-grams and tree depths.</Paragraph> <Paragraph position="2"> We can see that the corpus-level correlation and the sentence-level correlation don't always correspond. For example, the kernel dependency subtree metric achieves a very good performance in corpus-level evaluation, but it has a poor performance in sentence-level evaluation. Sentence-level correlation re ects the relative qualities of different hypotheses in a MT system, which does not indicate any information for the relative qualities of different systems. If we uniformly decrease or increase every hypothesis's automatic score in a MT system, the sentence-level correlation with human judgments will remain the same, but the corpus-level correlation will be changed. So we might possibly get inconsistent corpus-level and sentence-level correlations. null From the results, we can see that with the increase of n-grams length, the performance of BLEU and HWCM will rst increase up to length 5, and then starts decreasing, where the optimal n-gram length of 5 corresponds to our usual setting for BLEU algorithm. This shows that corpus-level evaluation, compared with the sentence-level evaluation, is much less sensitive to the sparse data problem and thus leaves more space for making use of comprehensive evaluation metrics. We speculate this is why the kernel dependency subtree metric achieves the best performance among all the metrics. We can also see that HWCM and DSTM beat BLEU in most cases and exhibit more stable performance.</Paragraph> <Paragraph position="3"> An example hypothesis which was assigned a high score by HWCM but a low score by BLEU is shown in Table 7. In this particular sentence, the common head-modi er relations aboard-plane and plane - the caused a high headword chain overlap, but did not appear as common n-grams counted by BLEU. The hypothesis is missing the word fth , but was nonetheless assigned a high score by human judges. This is probably due to its uency, which HWCM seems to capture better than BLEU.</Paragraph> </Section> </Section> class="xml-element"></Paper>