File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2025_metho.xml
Size: 10,933 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2025"> <Title>Investigating the features that affect cue usage of non-native speakers of English</Title> <Section position="3" start_page="144" end_page="144" type="metho"> <SectionTitle> 8 draws a conclusion. 2 Related work </SectionTitle> <Paragraph position="0"> Almost all researches on cue phrases have been done for native speakers. (Elhadad and McKeown, 1990) explored the problem of cue selection. They presented a model that distinguishes a small set of similar cue phrases. (Moser and Moore, 1995a) put forward a method to identify the features that predict cue selection and placement. (Eugenio and Moore and Paolucci, 1997) used C4.5 to predict cue occurrence and placement. Until now, the research similar to ours is the GIRL system (Williams, 2004) which generates texts for poor readers and good readers of native speakers. The author measured the differences of reading speed (especially cue phrases) between good readers and bad readers, by which they inferred how discourse level choice (e.g., cue selection) makes the difference for the two kinds of readers.</Paragraph> </Section> <Section position="4" start_page="144" end_page="144" type="metho"> <SectionTitle> 3 Creating two corpora </SectionTitle> <Paragraph position="0"> We used two corpora (SUB-BNC and CNNSE) to investigate difference in cue usage between native and non-native speakers. The two corpora have the same size (200,000 words each). According to the Flesch Reading Ease scale, the readability of SUB-BNC and CNNSE is 47.5 (difficult) and 68.7 (easy) respectively.</Paragraph> <Paragraph position="1"> The two corpora are comparable. SUB-BNC is a sub-corpus of BNC (British National Corpus).</Paragraph> <Paragraph position="2"> While creating SUB-BNC, we selected the written texts according to the three features: domain (&quot;natural and pure science&quot;), medium (&quot;book&quot;), target audience (&quot;adult&quot;). CNNSE (Corpus of Non-Native Speaker of English) was created by the first author. Non-native speakers have three levels: primary (middle school student level), intermediate (high school student level) and advanced (university student level). The users of this study are assumed to be at intermediate level.</Paragraph> <Paragraph position="3"> We extracted English texts (written or rewritten by native speakers) from the books published in China and in Japan. The target audiences of these books were high school students in the two countries. The domain of the selected texts is natural and pure science as well.</Paragraph> </Section> <Section position="5" start_page="144" end_page="145" type="metho"> <SectionTitle> 4 Annotating two corpora </SectionTitle> <Paragraph position="0"> We followed (Carlson and Marcu and Okurowski, 2001) to classify the discourse relations. In the manual, some relations share some type of rhetorical meaning, so we defined several relations as follows: 1. background: background, circumstance 2. cause: cause, result, consequence 3. comparison: comparison, preference, analogy, proportion 4. condition: condition, hypothetical, contingency, otherwise 5. contrast: contrast, concession, antithesis 6. elaboration: elaboration-additional, elaboration-general-specific, elaborationpart-whole, elaboration-process-step, elaboration-object-attribute, elaboration-setmember null 7. enablement: purpose, enablement 8. evaluation: evaluation, interpretation, conclusion, comment 9. explanation: evidence, explanationargumentative, reason 10. summary: summary, restatement Annotation includes two stages: first, we allowed two coders to choose &quot;explanation&quot; relations signaled by because using (Hirschberg and Litman, 1993)'s 3-way classification. The word because could signal not only &quot;explanation&quot; relation, but other relations. On the other hand, we do not consider some structures, e.g., &quot;not because ... but because&quot;. Thus, because could be judged as &quot;explanation&quot;, &quot;other&quot;, or &quot;not considered&quot;. If both coders classified because as &quot;explanation&quot;, this discourse was selected. Lastly, 228 because were selected from two corpora.</Paragraph> <Paragraph position="1"> At the second stage, two coders annotated the boundary of nucleus and satellite of each discourse selected. Moreover, a selected discourse could be a span (nucleus or satellite) of another one (we call it embedding structure). The coders labeled the discourse relation of the embedding structure and determined the boundary of its nucleus and satellite. Example 4.1 shows an example. null Example 4.1 [Global warming will be a major threat to the whole world over the next century.]-S- contrast -N-[But [because it will take many years for our actions to produce a significant effect,]-S-explanation -N-[the problem needs attention now.]] (From CNNSE) In order to assess reliability of annotation, we followed (Moser and Moore, 1995b)'s approach to compare the disagreements of results annotated by two independent coders from three aspects. First, the boundary of nucleus and satellite of the relation signaled by because. The disagreements occurred 7 times (96.9% agreement). Second, the discourse relation of embedding structure. The disagreements occurred 16 times (93% agreement). Third, the boundary of nucleus and satellite of the embedding structure. The disagreements occurred 9 times (96.1% agreement). That is, the agreement of the two coders is 86%. This is better than that mentioned in (Moser and Moore, 1995b).</Paragraph> <Paragraph position="2"> 5 Analyzing the usage of because within two corpora Through investigating annotated SUB-BNC, we found that there are 104 &quot;explanation&quot; relations signaled by because, in which 96/104 (92.3%) (Table 1) occurs in the second span. This conclusion is the same as (Quirk and Greenbaum and Leech and Svartvik, 1972) and (Moser and Moore, 1995b)' opinion, i.e., because typically occurs in the second span. However, within CNNSE, we found that only 88/124 (71%) occurs in the second span. This result is quite different from that of SUB-BNC. Moreover, Chi Square critical values ( 2 = 16.54, p < 0.001) also support this conclusion.</Paragraph> <Paragraph position="4"/> </Section> <Section position="6" start_page="145" end_page="146" type="metho"> <SectionTitle> 6 Machine learning program - C4.5 </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 6.1 Evaluation method </SectionTitle> <Paragraph position="0"> The results of C4.5 are learned classification models from the training sets. The error rates of the learned models are estimated by cross-validation (Weiss and Kulikowski, 1991), which is widely applied to evaluating decision trees, especially whose dataset is relatively small. Data for learning is randomly divided into N test sets. The program is run for N times, each run uses (N-1) test sets as the training set and the remaining one as the test set. The error rate of a tree obtained by using the whole dataset for training is then assumed to be the average error rate on the test set over the N runs (Eugenio and Moore and Paolucci, 1997).</Paragraph> <Paragraph position="1"> The advantage of this method is that all data are eventually used for testing, and almost all examples are used in any given training run (Litman, 1996). This study follows (Eugenio and Moore and Paolucci, 1997) (Litman, 1996)' s approach to identify the best learned models by comparing their error rates to the error rates of the other models. The method of determining whether two error rates are significantly different is by computing and comparing the 95% confidence intervals for the two error rates. If the upper bound of the 95% confidence interval for error rate &quot;1 is lower than the lower bound of the 95% confidence interval for &quot;2, then the difference between &quot;1 and &quot;2 is considered to be significant.</Paragraph> </Section> <Section position="2" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 6.2 Features </SectionTitle> <Paragraph position="0"> We classified features into two groups: sentence features and embedding structure features. Sentence features are concerned with the information of relations signaled by because. Nt and St represent tense of nucleus and satellite respectively.</Paragraph> <Paragraph position="1"> Nv and Sv represent voice of nucleus and satellite respectively. We also used the features Ng (nucleus length) and Sg (satellite length). Mean- null while, nucleus structure (Ns) and satellite structure (Ss) were considered.</Paragraph> <Paragraph position="2"> Another group of features reflect information of the embedding structures that contain relations signaled by because. R represents discourse relation of the embedding structure. C represents whether the embedding structure is cued or not.</Paragraph> <Paragraph position="3"> N-S indicates that in the embedding structure, the relation signaled by because could be either nucleus or satellite. P indicates that the relation signaled by because could occur either in the first span or in the second span. Bs represents the structure of the span containing the relation signaled by because. Os represents the structure of the span not containing the relation signaled by because. Features used in the experiments are as follows: Sentence features - Nt. Tense of nucleus: past, present, future. null - St. Tense of satellite: past, present, future. null - Nv. Voice of nucleus: active, passive.</Paragraph> <Paragraph position="4"> - Sv. Voice of satellite: active, passive.</Paragraph> <Paragraph position="5"> - Ng. Length of nucleus (in words): integer. null - Sg. Length of satellite (in words): integer. null - Ns. Structure of nucleus: simple, other.</Paragraph> <Paragraph position="6"> - Ss. Structure of satellite: simple, other. Embedding structure features - R. Discourse relation of embedding structure: attribution, background, cause, comparison, condition, contrast, elaboration, example, enablement, evaluation, explanation, list, summary, temporal.</Paragraph> <Paragraph position="7"> - C. Signaled by cue or not: yes, no.</Paragraph> <Paragraph position="8"> - N-S. Role of the relation signaled by because: nucleus, satellite.</Paragraph> <Paragraph position="9"> - P. Position of relation signaled by because: first span, second span.</Paragraph> <Paragraph position="10"> - Bs. Structure of the span containing the relation signaled by because: complex sentence, other.</Paragraph> <Paragraph position="11"> - Os. Structure of the span not containing the relation signaled by because: simple sentence, other.</Paragraph> </Section> </Section> class="xml-element"></Paper>