File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0210_intro.xml
Size: 21,100 bytes
Last Modified: 2025-10-06 14:03:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0210"> <Title>amp; Spoken Language Communication Research Laboratories</Title> <Section position="3" start_page="0" end_page="65" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> English has spread so widely that 1,500 million people, about a quarter of the world's population, speak it, though at most about 400 million speak it as their native language (Crystal, 2003). Thus, English education for non-native speakers both now and in the near future is of great importance.</Paragraph> <Paragraph position="1"> The progress of computer technology is advancing an electronic tool for language learning called Computer-Assisted Language Learning (CALL) and for language testing called Computer-Based Testing (CBT) or Computer-Adaptive Testing (CAT). However, no computerized support for producing a test, a collection of questions for evaluating language proficiency, has emerged to date.</Paragraph> <Paragraph position="2"> * Fill-in-the-Blank Questions (FBQs) are widely used from the classroom level to far larger scales to measure peoples' proficiency at English as a second language. Examples of such tests include TOEFL (Test Of English as a Foreign Language, http://www.ets.org/toefl/) and TOEIC (Test Of English for International Communication, http://www.ets.org/toeic/).</Paragraph> <Paragraph position="3"> A test comprising FBQs has merits in that (1) it is easy for test-takers to input answers, (2) computers can mark them, thus marking is invariable and objective, and (3) they are suitable for the modern testing theory, Item Response Theory (IRT).</Paragraph> <Paragraph position="4"> Because it is regarded that writing incorrect choices that distract only the non-proficient test-taker is a highly skilled business (Alderson, 1996), FBQs have been written by human experts. Thus, test construction is time-consuming and expensive. As a result, utilizing up-to-date texts for question writing is not practical, nor is tuning in to individual students.</Paragraph> <Paragraph position="5"> * See the detailed discussion in Section 6.</Paragraph> <Paragraph position="6"> To solve the problems of time and expenditure, this paper proposes a method for generating FBQs using a corpus, a thesaurus, and the Web. Experiments have shown that the proficiency estimated through IRT with generated FBQs highly correlates with non-native speakers' real proficiency. This system not only provides us with a quick and inexpensive testing method, but it also features the following advantages: (I) It provides &quot;anyone&quot; individually with up-to-date and interesting questions for self-teaching. We have implemented a program that downloads any Web page such as a news site and generates questions from it.</Paragraph> <Paragraph position="7"> (II) It also enables on-demand testing at &quot;anytime and anyplace.&quot; We have implemented a system that operates on a mobile phone. Questions are generated and pooled in the server, and upon a user's request, questions are downloaded. CAT (Wainer, 2000) is then conducted on the phone. The system for mobile phone is scheduled to be deployed in May of 2005 in Japan.</Paragraph> <Paragraph position="8"> The remainder of this paper is organized as follows. Section 2 introduces a method for making FBQ, Section 3 explains how to estimate test-takers' proficiency, and Section 4 presents the experiments that demonstrate the effectiveness of the proposal. Section 5 provides some discussion, and Section 6 explains the differences between our proposal and related work, followed by concluding remarks.</Paragraph> <Section position="1" start_page="61" end_page="62" type="sub_section"> <SectionTitle> Question Generation Method </SectionTitle> <Paragraph position="0"> We will review an FBQ, and then explain our method for producing it.</Paragraph> <Paragraph position="1"> Fill-in-the-Blank Question (FBQ) FBQs are the one of the most popular types of questions in testing. Figure 1 shows a typical sample consisting of a partially blanked English sentence and four choices for filling the blank. The tester ordinarily assumes that exactly one choice is correct (in this case, b)) and the other three choices are incorrect. The latter are often called distracters, because they fulfill a role to distract the less profi- null I only have to _______ my head above water one more week.</Paragraph> <Paragraph position="2"> a) reserve b) keep c) guarantee d) promise N.B. the correct choice is b) keep.</Paragraph> <Paragraph position="3"> Flow of generation Using question 1 above, the outline of generation is presented below (Figure 2).</Paragraph> <Paragraph position="4"> A seed sentence (in this case, &quot;I only have to keep my head above water one more week.&quot;) is input from the designated source, e.g., a corpus or a Web page such as well-known news site.</Paragraph> <Paragraph position="5"> [a] The seed sentence is a correct English sentence that is decomposed into a sentence with a blank (blanked sentence) and the correct choice for the blank. After the seed * Selection of the seed sentence (source text) is an important open problem because the difficulty of the seed (text) should influence the difficulty of the generated question. As for text difficulty, several measures such as Lexile by MetaMetrics (http://www.Lexile.com) have been proposed. They are known as readability and are usually defined as a function of sentence length and word frequency.</Paragraph> <Paragraph position="6"> In this paper, we used corpora of business and travel conversations, because TOEIC itself is oriented toward business and daily conversation.</Paragraph> <Paragraph position="7"> sentence is analyzed morphologically by a computer, according to the testing knowledge null * the blank position of the sentence is determined. In this paper's experiment, the verb of the seed is selected, and we obtain the blanked sentence &quot;I only have to ______ my head above water one more week.&quot; and the correct choice &quot;keep.&quot; [b] To be a good distracter, the candidates must maintain the grammatical characteristics of the correct choice, and these should be similar in meaning + . Using a thesaurus</Paragraph> <Paragraph position="9"> words similar to the correct choice are listed up as candidates, e.g., &quot;clear,&quot; &quot;guarantee,&quot; &quot;promise,&quot; &quot;reserve,&quot; and &quot;share&quot; for the above &quot;keep.&quot; [c] Verify (see Section 2.3 for details) the incorrectness of the sentence restored by each candidate, and if it is not incorrect (in this case, &quot;clear&quot; and &quot;share&quot;), the candidate is given up.</Paragraph> <Paragraph position="10"> [d] If a sufficient number (in this paper, three) of candidates remain, form a question by randomizing the order of all the choices (&quot;keep,&quot; &quot;guarantee,&quot; &quot;promise,&quot; and &quot;reserve&quot;); otherwise, another seed sentence is input and restart from step [a].</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 2.3 Incorrectness Verification </SectionTitle> <Paragraph position="0"> In FBQs, by definition, (1) the blanked sentence restored with the correct choice is correct, and (2) the blanked sentence restored with the distracter must be incorrect.</Paragraph> <Paragraph position="1"> In order to generate an FBQ, the incorrectness of the sentence restored by each distracter candidate must be verified and if the combination is not incorrect, the candidate is rejected.</Paragraph> </Section> <Section position="3" start_page="62" end_page="63" type="sub_section"> <SectionTitle> Zero-Hit Sentence </SectionTitle> <Paragraph position="0"> The Web includes all manners of language data in vast quantities, which are for everyone easy to access through a networked computer. Recently, exploitation of the Web for various natural language applications is rising (Grefenstette, 1999; Turney, 2001; Kilgarriff and Grefenstette, 2003; Tonoike et al., 2004).</Paragraph> <Paragraph position="1"> We also propose a Web-based approach. We dare to assume that if there is a sentence on the Web, that sentence is considered correct; otherwise, the sentence is unlikely to be correct in that there is no sentence written on the Web despite the variety and quantity of data on it.</Paragraph> <Paragraph position="2"> * Testing knowledge tells us what part of the seed sentence should be blanked. For example, we selected the verb of the seed because it is one of the basic types of blanked words in popular FBQs such as in TOEIC.</Paragraph> <Paragraph position="3"> Figure 3 illustrates verification based on the retrieval from the Web. Here, s (x) is the blanked sentence, s (w) denotes the sentence restored by the word w, and hits (y) represents the number of documents retrieved from the Web for the key y.</Paragraph> <Paragraph position="4"> This can be a word of another POS (Part-Of-Speech). For this, we can use knowledge in the field of second-language education. Previous studies on errors in English usage by Japanese native speakers such as (Izumi and Isahara, 2004) unveiled patterns of errors specific to Japanese, e.g., (1) article selection error, which results from the fact there are no articles in Japanese; (2) preposition selection error, which results from the fact some Japanese counterparts have broader meaning; (3) adjective selection error, which results from mismatch of meaning between Japanese words and their counterpart. Such knowledge may generate questions harder for Japanese who study English.</Paragraph> <Paragraph position="5"> + There are various aspects other than meaning, for example, spelling, pronunciation, and translation and so on. Depending on the aspect, lexical information sources other than a thesau- null Blanked sentence: s (x)= &quot;I only have to ____ my head above water one more week. &quot; Hits of incorrect choice candidates:</Paragraph> <Paragraph position="7"> We used an in-house English thesaurus whose hierarchy is based on one of the off-the-shelf thesauruses for Japanese, called Ruigo-Shin-Jiten (Ohno and Hamanishi, 1984). In the above examples, the original word &quot;keep&quot; expresses two different concepts: (1) possession-or-disposal, which is shared by the words &quot;clear&quot; and &quot;share,&quot; and (2) promise, which is shared by the words &quot;guarantee,&quot; &quot;promise,&quot; and &quot;reserve.&quot; Since this depends on the thesaurus used, some may sense a slight discomfort at these concepts. If a different thesaurus is used, the distracter candidates may differ.</Paragraph> <Paragraph position="8"> If hits (s (w)), is small, then the sentence restored with the word w is unlikely, thus the word w should be a good distracter. If hits (s (w)), is large then the sentence restored with the word w is likely, then the word w is unlikely to be a good distracter and is given up.</Paragraph> <Paragraph position="9"> We used the strongest condition. If hits (s (w)) is zero, then the sentence restored with the word w is unlikely, thus the word w should be a good distracter. If hits (s (w)), is not zero, then the sentence restored with the word w is likely, thus the word w is unlikely to be a good distracter and is given up.</Paragraph> <Paragraph position="10"> language tests such as TOEIC, and enables Computerized Adaptive Testing (CAT). Here, we briefly introduce IRT. IRT, in which a question is called an item, calculates the test-takers' proficiency based on the answers for items of the given test (Embretson, 2000).</Paragraph> </Section> <Section position="4" start_page="63" end_page="63" type="sub_section"> <SectionTitle> Retrieval NOT By Sentence </SectionTitle> <Paragraph position="0"> It is often the case that retrieval by sentence does not work. Instead of a sentence, a sequence of words around a blank position, beginning with a content word (or sentence head) and ending with a content word (or sentence tail) is passed to a search engine automatically. For the abovementioned sample, the sequence of words passed to the engine is &quot;I only have to clear my head&quot; and so on.</Paragraph> <Paragraph position="1"> The basic idea is the item response function, which relates the probability of test-takers answering particular items correctly to their proficiency. The item response functions are modeled as logistic curves making an S-shape, which take the form</Paragraph> <Paragraph position="3"> We can use any search engine, though we have been using Google since February 2004. At that point in time, Google covered an enormous four billion pages.</Paragraph> <Paragraph position="4"> The test-taker parameter, th, shows the proficiency of the test-taker, with higher values indicating higher performance.</Paragraph> <Paragraph position="5"> The &quot;correct&quot; hits may come from non-native speakers' websites and contain invalid language usage. To increase reliability, we could restrict Google searches to Websites with URLs based in English-speaking countries, although we have not done so yet. There is another concern: even if sentence fragments cannot be located on the Web, it does not necessarily mean they are illegitimate. Thus, the proposed verification based on the Web is not perfect; the point, however, is that with such limitations, the generated questions are useful for estimating proficiency as demonstrated in a later section.</Paragraph> <Paragraph position="6"> Each of the item parameters, a</Paragraph> <Paragraph position="8"> the shape of the item response function. The a parameter, called discrimination, indexes how steeply the item response function rises. The b parameter is called difficulty. Difficult items feature larger b values and the item response functions are shifted to the right. These item parameters are usually estimated by a maximal likelihood method.</Paragraph> <Paragraph position="9"> For computations including the estimation, there are many commercial programs such as BILOG (http://www.assess.com/) available.</Paragraph> </Section> <Section position="5" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 3.2 Reducing test size by selection of effective </SectionTitle> <Paragraph position="0"> items Setting aside the convenience provided by the off-the-shelf search engine, another search specialized for this application is possible, although the current implementation is fast enough to automate generation of FBQs, and the demand to accelerate the search is not strong. Rather, the problem of time needed for test construction has been reduced by our proposal.</Paragraph> <Paragraph position="1"> It is important to estimate the proficiency of the test-taker by using as few items as possible. For this, we have proposed a method based on item information.</Paragraph> <Paragraph position="3"> , the proficiency of the test-taker j, which indicates how much measurement discrimination an item provides.</Paragraph> <Paragraph position="4"> The throughput depends on the text from which a seed sentence comes and the network traffic when the Web is accessed. Empirically, one FBQ is obtained in 20 seconds on average and the total number of FBQs in a day adds up to over 4,000 on a single computer.</Paragraph> <Paragraph position="5"> The procedure is as follows.</Paragraph> <Paragraph position="6"> 1. Initialize I by the set of all generated FBQs. 2. According to Equation (3), we select the item whose contribution to test information is maximal.</Paragraph> <Paragraph position="7"> 3. We eliminate the selected item from I according to Equation (4).</Paragraph> <Paragraph position="8"> 4. If I is empty, we obtain the ordered list of effective items; otherwise, go back to step 2.</Paragraph> <Paragraph position="10"/> </Section> <Section position="6" start_page="64" end_page="65" type="sub_section"> <SectionTitle> 4.1 Experiment </SectionTitle> <Paragraph position="0"> The FBQs for the experiment were generated in February of 2004. Seed sentences were obtained from ATR's corpus (Kikui et al., 2003) of the business and travel domains. The vocabulary of the corpus comprises about 30,000 words. Sentences are relatively short, with the average length being 6.47 words. For each domain 5,000 questions were generated automatically and each question consists of an English sentence with one blank and four choices.</Paragraph> <Paragraph position="1"> Experiment with non-native speakers We used the TOEIC score as the experiment's proficiency measure, and collected 100 Japanese subjects whose TOEIC scores were scattered from 400 to less than 900. The actual range for TOEIC scores is 10 to 990. Our subjects covered the dominant portion * of test-takers for TOEIC in Japan, excluding the highest and lowest extremes. + We had the subjects answer 320 randomly selected questions from the 10,000 mentioned above. The raw marks were as follows: the average</Paragraph> <Paragraph position="3"> was 235.2 (73.5%); the highest mark was 290 (90.6%); and the lowest was 158 (49.4%). This suggests that our FBQs are sensitive to test-takers' proficiency. In Figure 4, the y-axis represents estimated proficiency according to IRT (Section 3.1) and generated questions, while the x-axis is the real TOEIC score of each subject.</Paragraph> <Paragraph position="4"> As the graph illustrates, the IRT-estimated proficiency (th ) and real TOEIC scores of subjects correlate highly with a co-efficiency of about 80%.</Paragraph> <Paragraph position="5"> For comparison we refer to CASEC (http://casec.evidus.com/), an off-the-shelf test consisting of human-made questions and IRT. Its co-efficiency with real TOEIC scores is reported to be 86%.</Paragraph> <Paragraph position="6"> This means the proposed automatically generated questions are promising for measuring English proficiency, achieving a nearly competitive level with human-made questions but with a few reservations: (1) whether the difference of 6% is large depends on the standpoint of possible users; (2) as for the number of questions to be answered, our proposal uses 320 questions in the experiments, while TOEIC uses 200 questions and CASEC uses only about 60 questions; (3) the proposed method uses FBQs only whereas CASEC and TOEIC use various types of questions.</Paragraph> <Paragraph position="7"> Experiment with a native speaker To examine the quality of the generated questions, we asked a single subject SS who is a native speaker of English to answer 4,000 questions (Table 1). * Over 70% of all test-takers are covered (http://www.toeic.or.jp/toeic/data/data02.html). The native speaker largely agreed with our generation, determining correct choices (type I). The + We have covered only the range of TOEIC scores from 400 to 900 due to expense of the experiment. In this restricted experiment, we do not claim that our proficiency estimation method covers the full range of TOEIC scores.</Paragraph> <Paragraph position="8"> SS Please note that the analysis is based on a single nativespeaker, thus we need further analysis by multiple subjects. ++ The standard deviation was 29.8 (9.3%).</Paragraph> <Paragraph position="9"> rate was 93.50%, better than 90.6%, the highest mark among the non-native speakers.</Paragraph> <Paragraph position="10"> We present the problematic cases here.</Paragraph> <Paragraph position="11"> null Type II is caused by the seed sentence being incorrect for the native speaker, and a distracter is bad because it is correct. Or like type III, it consists of ambiguous choices.</Paragraph> <Paragraph position="12"> null Type III is caused by some generated distracters being correct; therefore, the choices are ambiguous. Figure 5 Correlation coefficient and Test size null Type IV is caused by the seed sentence being incorrect and the generated distracters also being incorrect; therefore, the question cannot be answered. null null Type V is caused by the seed sentence being nonsense to the native speaker; the question, therefore, cannot be answered.</Paragraph> <Paragraph position="13"> Table 1 Responses of a Native speaker Cases with bad seed sentences (portions of II, IV, and V) require cleaning of the corpus by a native speaker, and cases with bad distracters (portions of II and III) require refinement of the proposed generation algorithm.</Paragraph> <Paragraph position="14"> Since the questions produced by this method can be flawed in ways which make them unanswerable even by native speakers (about 6.5% of the time) due to the above-mentioned reasons, it is difficult to use this method for high-stakes testing applications although it is useful for estimating proficiency as explained in the previous section.</Paragraph> </Section> </Section> class="xml-element"></Paper>