File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1708_evalu.xml
Size: 7,297 bytes
Last Modified: 2025-10-06 13:59:14
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1708"> <Title>Automatic Measuring of English Language Proficiency using MT Evaluation Technology</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> BTEC </SectionTitle> <Paragraph position="0"> BTEC was designed to cover expressions for every potential subject in travel conversation.</Paragraph> <Paragraph position="1"> This test set was collected by investigating &quot;phrasebooks&quot; that contain Japanese/English sentence pairs that experts consider useful for tourists traveling abroad. One sentence contains 8 words on average. The test set for this experiment consists of 510 sentences from the BTEC corpus.</Paragraph> <Paragraph position="2"> The total number of examinees is 18, and the range of their TOEIC scores is between the SLTA1 consists of 330 sentences in 23 conversations from the ATR bilingual travel conversation database (Takezawa, 1999). One sentence contains 13 words on average. This corpus was collected by simulated dialogues between Japanese and English speakers through a professional interpreter. The topics of the conversations are mainly hotel conversations, such as reservations, enquiries and so on.</Paragraph> <Paragraph position="3"> The total number of examinees is 29, and the range of their TOEIC score is between the 300s and 800s. Excluding the 600s, every hundredpoint range has 5 examinees.</Paragraph> <Paragraph position="4"> For the automatic evaluation, we collected 16 references for each test sentence. One of them is from the English side of the test set, and the remaining 15 were translated by 5 bilinguals (3 references by 1 bilingual).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Experimental Results 4.2.1 Experimental Results of Test Set Selection </SectionTitle> <Paragraph position="0"> Figures3 and 4 show the correlation between the test sentence unit automatic score and the subjects' TOEIC score. Here, the automatic score is calculated using Equation2, 4 or 5. Figure 3 shows the results on BTEC, and Fig.4 shows the results on SLTA1. In these figures, the ordinate represents the correlation.</Paragraph> <Paragraph position="1"> The filled circles indicate the results using the DP-based automatic evaluation method. The gray circles indicate the results using BLEU.</Paragraph> <Paragraph position="2"> The empty circles indicate the results using NIST. Looking at these figures, we find that the three automatic evaluation methods show a similar tendency. Comparing BTEC and SLTA1, BTEC contains more cumbersome test sentences. In BTEC, about 20% of the test sentences give a correlation of less than 0. Meanwhile, in the SLTA1, this percentage is about Table 1 shows examples of low-correlated test sentences. As shown in the table, BTEC contains more short and frequently used expressions than does SLTA1. This kind of expression is thought to be too easy for testing, so this low-correlation phenomenon is thought to occur. SLTA1 still contains a few sentences of this kind (&quot;Example 1&quot; of SLTA1 in the table). Additionally, there is another contributing factor explaining the low correlation in SLTA1.</Paragraph> <Paragraph position="3"> Looking at &quot;Example 2&quot; of SLTA1 in the table, this expression is not very easy to translate. For this test sentence, several expressions can be produced as an English translation. Thus, automatic evaluation methods cannot evaluate correctly due to the insufficient variety of references. Considering these results, this method can remove inadequate test sentences due not only to the easiness of the test sentence but also to the difficulty of the automatic evaluation. Figures5 and 6 show the relationship between the number of test sentences and correlation. This correlation is calculated between the test set unit automatic scores and the subjects' TOEIC scores. Here, the automatic score is calculated using Equation3, 6 or 7. Figure 5 shows the results on BTEC, and Fig.6 shows the results on SLTA1.</Paragraph> <Paragraph position="4"> In these figures, the abscissa represents the number of test sentences, i.e., Nsent in Equations 3, 6 and 7, and the ordinate represents the correlation. Definitions of the circles are the same as those in the previous figure. Here, the test sentence selection is based on the correlation shown in Figs. 3 and 4.</Paragraph> <Paragraph position="5"> Comparing Fig. 5 to Fig. 6, in the case of BTEC, 330 test sentences for SLTA1), the correlation of BTEC is lower than that of SLTA1.</Paragraph> <Paragraph position="6"> As we mentioned above, the ratio of the lowcorrelatedtestsentencesinBTECishigherthan null that of SLTA1 (See Figs.3 and 4). This issue is thought to cause a decrease in the correlation shown in Fig. 5. However, by applying the se- null lection based on sentence unit correlation, these obstructive test sentences can be removed. This permits the selection of high-correlated small-sized test sets. In these figures, the highest correlations are around 0.95.</Paragraph> <Paragraph position="7"> For the experiments on English proficiency measurement, we carried out a leave-one-out cross validation test. The leave-one-out cross valida- null tion test is conducted not only for the measurement of the English proficiency but also for the test set selection.</Paragraph> <Paragraph position="8"> To evaluate the proficiency measurement by the proposed method, we calculate the standard error of the results of a leave-one-out cross validation test. The following formula is the definition of the standard error.</Paragraph> <Paragraph position="10"> where Nuser is the number of users, Ti is the actual TOEIC score of user i, and Ai is user i's estimated TOEIC score by using the proposed method.</Paragraph> <Paragraph position="11"> Figures7 and 8 show the relationship between the number of test sentences and the standard error.</Paragraph> <Paragraph position="12"> In these figures, the abscissa represents the number of test sentences, and the ordinate represents the standard error. Definitions of the circles are the same as in the previous figure. Here, the test sentence selection is based on the correlation shown in Figs. 3 and 4.</Paragraph> <Paragraph position="13"> Looking at Figs. 7 and 8, we can observe differences between the standard errors of BTEC and SLTA1. This is thought to be due to the difference of the number of subjects in the experiments (for the leave-one-out cross validation test, 17 subjects with BTEC and 28 subjects with SLTA1). Even though these were closed experiments, the results in Figs. 5 and 6 show an even higher correlation with BTEC than with SLTA1 at the highest point. Therefore, there is room for improvement by increasing the number of subjects with BTEC.</Paragraph> <Paragraph position="14"> In the test using 30 to 60 test sentences in Figs. 7 and 8, the standard errors are much smaller than in the test using the full test set (510 test sentences for BTEC, 330 test sentences for SLTA1). These results imply that the test set selection works very well and that it enables precise testing using a smaller size test set.</Paragraph> </Section> </Section> class="xml-element"></Paper>