File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0201_metho.xml

Size: 19,745 bytes

Last Modified: 2025-10-06 14:09:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0201">
  <Title>Applications of Lexical Information for Algorithmically Composing Multiple-Choice Cloze Items</Title>
  <Section position="5" start_page="2" end_page="4" type="metho">
    <SectionTitle>
4 Target Sentence Retriever
</SectionTitle>
    <Paragraph position="0"> The sentence retriever in Figure 2 extracts qualified sentences from the corpus. A sentence must contain the desired key of the requested POS to be considered as a candidate target sentence. Having identified such a candidate sentence, the item generator needs to determine whether the sense of the key also meets the requirement. We conduct this WSD task based on an extended notion of selectional preferences. null</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Extended Selectional Preferences
</SectionTitle>
      <Paragraph position="0"> Selectional preferences generally refer to the phenomenon that, under normal circumstances, some verbs constrain the meanings of other words in a sentence (Manning and Sch&amp;quot;utze, 1999; Resnik, 1997). We can extend this notion to the relationships between a word of interest and its signals, with the help of HowNet. Let w be the word of interest, and pi be the first listed class, in HowNet, of a signal word that has the syntactic relationship u with w.</Paragraph>
      <Paragraph position="1"> We define the strength of the association of w and pi as follows:</Paragraph>
      <Paragraph position="3"> where Pru(w) is the probability of w participating in the u relationship, and Pru(w,pi) is the probability that both w and pi participate in the u relationship.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
4.2 Word Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> We employ the generalized selectional preferences to determine the sense of a polysemous word in a sentence. Consider the task of determining the sense  of &amp;quot;spend&amp;quot; in the candidate target sentence &amp;quot;They say film makers don't spend enough time developing a good story.&amp;quot; The word &amp;quot;spend&amp;quot; has two possible  meanings in WordNet.</Paragraph>
      <Paragraph position="1"> 1. (99) spend, pass - (pass (time) in a specific way; &amp;quot;How are you spending your summer vacation?&amp;quot;) null 2. (36) spend, expend, drop - (pay out; &amp;quot;I spend  all my money in two days.&amp;quot;) Each definition of the possible senses include (1) the head words that summarize the intended meaning and (2) a sample sentence for sense. When we work on the disambiguation of a word, we do not consider the word itself as a head word in the following discussion. Hence, &amp;quot;spend&amp;quot; has one head word, i.e., &amp;quot;pass&amp;quot;, in the first sense and two head words, i.e., &amp;quot;extend&amp;quot; and &amp;quot;drop&amp;quot;, in the second sense. An intuitive method for determining the meaning of &amp;quot;spend&amp;quot; in the target sentence is to replace &amp;quot;spend&amp;quot; with its head words in the target sentence. The head words of the correct sense should go with the target sentence better than head words of other senses. This intuition leads to the a part of the scores for senses, i.e., St that we present shortly.</Paragraph>
      <Paragraph position="2"> In addition, we can compare the similarity of the contexts of &amp;quot;spend&amp;quot; in the target sentence and sample sentences, where context refers to the classes of the signals of the word being disambiguated. For the current example, we can check whether the subject and object of &amp;quot;spend&amp;quot; in the target sentence have the same classes as the subjects and objects of &amp;quot;spend&amp;quot; in the sample sentences. The sense whose sample sentence offers a more similar context for &amp;quot;spend&amp;quot; in the target sentence receives a higher score. This intuition leads to the other part of the scores for senses, i.e., Ss that we present below.</Paragraph>
      <Paragraph position="3"> Assume that the key w has n senses. Let Th = {th1,th2,*** ,thn} be the set of senses of w. Assume that sense thj of word w has mj head words in Word-Net. (Note that we do not consider w as its own head word.) We use the set Lj = {lj,1,lj,2,*** ,lj,mj} to denote the set of head words that WordNet provides for sense thj of word w.</Paragraph>
      <Paragraph position="4"> When we use the partial parser to parse the target sentence T for a key, we obtain information about the signal words of the key. Moreover, for each of these signals, we look up their classes in HowNet, and adopt the first listed class for each of the signals when the signal covers multiple classes.</Paragraph>
      <Paragraph position="5"> Assume that there are u(T) signals for the key w in a sentence T. We use the set Ps(T,w) = {ps1,T,ps2,T,*** ,psu(T),T} to denote the set of signals for w in T. Correspondingly, we use uj,T to denote the syntactic relationship between w and psj,T in T, use U(T,w) = {u1,T,u2,T,*** ,uu(T),T} for the set of relationships between signals in Ps(T,w) and w, use pij,T for the class of psj,T, and use</Paragraph>
      <Paragraph position="7"> classes of the signals in Ps(T,w).</Paragraph>
      <Paragraph position="8"> Equation (2) measures the average strength of association of the head words of a sense with signals of the key in T, so we use (2) as a part of the score for w to take the sense thj in the target sentence T.</Paragraph>
      <Paragraph position="9"> Note that both the strength of association and St fall in the range of [0,1].</Paragraph>
      <Paragraph position="11"> In (2), we have assumed that the signal words are not polysemous. If they are polysemous, we assume that each of the candidate sense of the signal words are equally possible, and employ a slightly more complicated formula for (2). This assumption may introduce errors into our decisions, but relieves us from the needs to disambiguate the signal words in the first place (Liu et al., 2005).</Paragraph>
      <Paragraph position="12"> Since WordNet provides sample sentences for important words, we also use the degrees of similarity between the sample sentences and the target sentence to disambiguate the word senses of the key word in the target sentence. Let T and S be the target sentence of w and a sample sentence of sense thj of w, respectively. We compute this part of score, Ss, for thj using the following three-step procedure.</Paragraph>
      <Paragraph position="13"> If there are multiple sample sentences for a given sense, say thj of w, we will compute the score in (3) for each sample sentence of thj, and use the average score as the final score for thj.</Paragraph>
      <Paragraph position="14">  1. Compute signals of the key and their relation- null ships with the key in the target and sample sentences. null</Paragraph>
      <Paragraph position="16"> 2. We look for psj,T and psk,S such that uj,T = uk,S, and then check whether pij,T = pik,S.</Paragraph>
      <Paragraph position="17"> Namely, for each signal of the key in T, we check the signals of the key in S for matching syntactic relationships and word classes, and record the counts of matched relationship in M(thj,T) (Liu et al., 2005).</Paragraph>
      <Paragraph position="18"> 3. The following score measures the proportion of matched relationships among all relationships between the key and its signals in the target sentence. null</Paragraph>
      <Paragraph position="20"> The score for w to take sense thj in a target sentence T is the sum of St(thj|w,T) defined in (2) and Ss(thj|w,T) defined in (3), so the sense of w in T will be set to the sense defined in (4) when the score exceeds a selected threshold. When the sum of St(thj|w,T) and Ss(thj|w,T) is smaller than the threshold, we avoid making arbitrary decisions about the word senses. We discuss and illustrate effects of choosing different thresholds in Section 6.</Paragraph>
      <Paragraph position="22"/>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="5" type="metho">
    <SectionTitle>
5 Distractor Generation
</SectionTitle>
    <Paragraph position="0"> Distractors in multiple-choice items influence the possibility of making lucky guesses to the answers.</Paragraph>
    <Paragraph position="1"> Should we use extremely impossible distractors in the items, examinees may be able to identify the correct answers without really knowing the keys.</Paragraph>
    <Paragraph position="2"> Hence, we need to choose distractors that appear to fit the gap, and must avoid having multiple answers to items in a typical cloze test at the same time.</Paragraph>
    <Paragraph position="3"> There are some conceivable principles and alternatives that are easy to implement and follow.</Paragraph>
    <Paragraph position="4"> Antonyms of the key are choices that average examinees will identify and ignore. The part-of-speech tags of the distractors should be the same as the key in the target sentence. We may also take cultural background into consideration. Students in Taiwan tend to associate English vocabularies with their Chinese translations. Although this learning strategy works most of the time, students may find it difficult to differentiate English words that have very similar Chinese translations. Hence, a culturedependent strategy is to use English words that have similar Chinese translations with the key as the distractors. null To generate distractors systematically, we employ ranks of word frequencies for selecting distractors (Poel and Weatherly, 1997). Assume that we are generating an item for a key whose part-of-speech is r, that there are n word types whose part-of-speech may be r in the dictionary, and that the rank of frequency of the key among these n types is m.</Paragraph>
    <Paragraph position="5"> We randomly select words that rank in the range [m[?]n/10,m+n/10] among these n types as candidate distractors. These distractors are then screened by their fitness into the target sentence, where fitness is defined based on the concept of collocations of word classes, defined in HowNet, of the distractors and other words in the stem of the target sentence.</Paragraph>
    <Paragraph position="6"> Recall that we have marked words in the corpus with their signals in Section 3. The words that have more signals in a sentence usually contribute more to the meaning of the sentence, so should play a more important role in the selection of distractors. Since we do not really look into the semantics of the target sentences, a relatively safer method for selecting distractors is to choose those words that seldom collocate with important words in the target sentence.</Paragraph>
    <Paragraph position="7"> Let T = {t1,t2,*** ,tn} denote the set of words in the target sentence. We select a set Tprime [?] T such that each tprimei [?] Tprime has two or more signals in T and is a verb, noun, adjective, or adverb. Let k be the first listed class, in HowNet, of the candidate distractor, and N = {ti|ti is the first listed class of a tprimei [?] Tprime}. The fitness of a candidate distractor is defined in (5).</Paragraph>
    <Paragraph position="9"> The candidate whose score is better than 0.3 will be admitted as a distractor. Pr(k) and Pr(ti) are the probabilities that each word class appears individually in the corpus, and Pr(k,ti) is the probability that the two classes appear in the same sentence. Operational definitions of these probabilities  adv. 36.7%(11/30) 52.4%(11/21) 58.3%(7/12) are provided in (Liu et al., 2005). The term in the summation is a pointwise mutual information, and measures how often the classes k and ti collocate in the corpus. We negate the averaged sum so that classes that seldom collocate receive higher scores. We set the threshold to 0.3, based on statistics of (5) that are observed from the cloze items used in the</Paragraph>
  </Section>
  <Section position="7" start_page="5" end_page="6" type="metho">
    <SectionTitle>
1992-2003 CEET.
6 Evaluations and Applications
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.1 Word Sense Disambiguation
</SectionTitle>
      <Paragraph position="0"> Different approaches to WSD were evaluated in different setups, and a very wide range of accuracies in [40%, 90%] were reported (Resnik, 1997; Wilks and Stevenson, 1997). Objective comparisons need to be carried out on a common test environment like SEN-SEVAL, so we choose to present only our results.</Paragraph>
      <Paragraph position="1"> We arbitrarily chose, respectively, 50, 50, 30, and 30 sentences that contained polysemous verbs, nouns, adjectives, and adverbs for disambiguation.</Paragraph>
      <Paragraph position="2"> Table 1 shows the percentage of correctly disambiguated words in these 160 samples.</Paragraph>
      <Paragraph position="3"> The baseline column shows the resulting accuracy when we directly use the most frequent sense, recorded in WordNet, for the polysemous words.</Paragraph>
      <Paragraph position="4"> The rightmost two columns show the resulting accuracy when we used different thresholds for applying (4). As we noted in Section 4.2, our system selected fewer sentences when we increased the threshold, so the selected threshold affected the performance. A larger threshold led to higher accuracy, but increased the rejection rate at the same time. Since the corpus can be extended to include more and more sentences, we afford to care about the accuracy more than the rejection rate of the sentence retriever.</Paragraph>
      <Paragraph position="5"> We note that not every sense of all words have sample sentences in the WordNet. When a sense does not have any sample sentence, this sense will receive no credit, i.e., 0, for Ss. Consequently, our current reliance on sample sentences in Word- null Net makes us discriminate against senses that do not have sample sentences. This is an obvious drawback in our current design, but the problem is not really detrimental and unsolvable. There are usually sample sentences for important and commonly-used senses of polysemous words, so the discrimination problem does not happen frequently. When we do want to avoid this problem once and for all, we can customize WordNet by adding sample sentences to all senses of important words.</Paragraph>
    </Section>
    <Section position="2" start_page="5" end_page="6" type="sub_section">
      <SectionTitle>
6.2 Cloze Item Generation
</SectionTitle>
      <Paragraph position="0"> We asked the item generator to create 200 items in the evaluation. To mimic the distribution over keys of the cloze items that were used in CEET, we used 77, 62, 35, and 26 items for verbs, nouns, adjectives, and adverbs, respectively, in the evaluation.</Paragraph>
      <Paragraph position="1"> In the evaluation, we requested one item at a time, and examined whether the sense and part-of-speech of the key in the generated item really met the requests. The threshold for using (4) to disambiguate word sense was set to 0.7. Results of this experiment, shown in Table 2, do not differ significantly from those reported in Table 1. For all four major classes of cloze items, our system was able to return a correct sentence for less than every 2 items it generated. In addition, we checked the quality of the distractors, and marked those items that permitted unique answers as good items. Table 3 shows that our system was able to create items with unique answers for another 200 items most of the time.</Paragraph>
    </Section>
    <Section position="3" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
6.3 More Applications
</SectionTitle>
      <Paragraph position="0"> We have used the generated items in real tests in a freshman-level English class at National Chengchi University, and have integrated the reported item generator in a Web-based system for learning English. In this system, we have two major subsystems: the authoring and the assessment subsystems. Using the authoring subsystem, test administrators may select items from the interface shown in Figure 4, save the selected items to an item bank, edit the items, including their stems if necessary, and finalize the selection of the items for a particular examination. Using the assessment subsystem, students answer the test items via the Internet, and can receive grades immediately if the administrators choose to do so. The answers of students are recorded for student modelling and analysis of the item facility and the item discrimination.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="6" end_page="7" type="metho">
    <SectionTitle>
7 Generating Listening Cloze Items
</SectionTitle>
    <Paragraph position="0"> We apply the same infrastructure for generating reading cloze items, shown in Figure 2, for the generation of listening cloze items (Huang et al., 2005).</Paragraph>
    <Paragraph position="1"> Due to the educational styles in Taiwan, students generally find it more difficult to comprehend messages by listening than by reading. Hence, we can regard listening cloze tests as an advanced format of reading cloze tests. Having constructed a database of sentences, we can extract sentences that contain the key for which the test administrator would like to have a listening cloze, and employ voice synthesizers to create the necessary recordings.</Paragraph>
    <Paragraph position="2"> Figure 5 shows an interface through which administrators choose and edit sentences for listening cloze items. Notice that we employ the concept that is related to ordinary concordance in arranging the extracted sentences. By defining a metric for measuring similarity between sounds, we can put sentences that have similar phonetic contexts around the key near each other. We hope this would better help teachers in selecting sentences by this rudimentary  clustering of sentences.</Paragraph>
    <Paragraph position="3"> Figure 6 shows the most simple format of listening cloze items. In this format, students click on the options, listen to the recorded sounds, and choose the option that fit the gap. The item shown in this figure is very similar to that shown in Figure 1, except that students read and hear the options. From this most primitive format, we can image and implement other more challenging formats. For instance, we can replace the stem, currently in printed form in Figure 6, into clickable links, demanding students to hear the stem rather than reading the stem. A middle ground between this more challenging format and the original format in the figure is to allow the gap to cover more words in the original sentence.</Paragraph>
    <Paragraph position="4"> This would require the students to listen to a longer stream of sound, so can be a task more challenging than the original test. In addition to controlling the lengths of the answer voices, we can try to modulate the speed that the voices are replayed. Moreover, for multiple-word listening cloze, we may try to find word sequences that sound similar to the answer sequence to control the difficulty of the test item.</Paragraph>
    <Paragraph position="5"> Defining a metric for measuring similarity between two recordings is the key to support the aforementioned functions. In (Huang et al., 2005), we consider such features of phonemes as place and manner of pronunciation in calculating the similarity between sounds. Using this metric we choose as distractors those sounds of words that have similar pronunciation with the key of the listening cloze. We have to define the distance between each phoneme so that we could employ the minimal-edit-distance algorithm for computing the distance between the sounds of different words.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML