XML Viewer - p05-3030

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-3030_metho.xml
Size: 7,406 bytes
Last Modified: 2025-10-06 14:09:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-3030">
  <Title>Organizing English Reading Materials for Vocabulary Learning</Title>
  <Section position="4" start_page="117" end_page="119" type="metho">
    <SectionTitle>
3 Experiment
</SectionTitle>
    <Paragraph position="0"> This section describes how the courseware was constructed by applying the method described in the previous section. We will first describe the vocabulary and corpus used to construct the courseware and then present the statistics for the courseware.</Paragraph>
    <Section position="1" start_page="117" end_page="117" type="sub_section">
      <SectionTitle>
3.1 Vocabulary
</SectionTitle>
      <Paragraph position="0"> We used the specialized vocabulary used in the Test of English for International Communication (TOEIC) because it is one of the most popular English certification tests in Japan. The vocabulary was compiled by Chujo (2003) and Chujo et al. (2004), who confirmed that the vocabulary was useful in preparing for the TOEIC test. The vocabulary had 640 entries and we used 638 words from it that occurred at least once in the corpus as the target vocabulary. null</Paragraph>
    </Section>
    <Section position="2" start_page="117" end_page="118" type="sub_section">
      <SectionTitle>
3.2 Corpus
</SectionTitle>
      <Paragraph position="0"> We used articles from English Wikipedia as the target corpus, which is a free-content encyclopedia that anyone can edit. The version we used in this study had 478,611 articles. From these, we first discarded stub and other non-normal articles. We also discarded short articles of less than 150 words. We then selected 60,498 articles that were referred to (linked) by more than 15 articles. This 15-link threshold was  set empirically to screen out noisy articles. Finally, we extracted a 150-word excerpt from the lead part of each of these 60,498 articles to prepare the target corpus. We set 150-word limit on an empirical basis to reduce the burden imposed on learners. In short, the target corpus consisted of 60,498 excerpts from the English Wikipedia. In the rest of the paper, we will use the term an article to refer to an excerpt that was extracted according to this procedure.</Paragraph>
    </Section>
    <Section position="3" start_page="118" end_page="118" type="sub_section">
      <SectionTitle>
3.3 Example article
</SectionTitle>
      <Paragraph position="0"> Figure 1 has an example of the articles in the courseware. It was the first article obtained with the algorithm. It shares 27 types and 49 tokens with the target vocabulary. These words are printed in bold.</Paragraph>
      <Paragraph position="1"> Corporate finance Corporate finance is the specific area of finance dealing with the financial decisions corporations make, and the tools and analysis used to make the decisions. The discipline as a whole may be divided between long-term and short-term decisions and techniques. Both share the same goal of enhancing firm value by ensuring that return on capital exceeds cost of capital. Capital investment decisions comprise the long-term choices about which projects receive investment, whether to finance that investment with equity or debt, and when or whether to pay dividends to shareholders. Short-term corporate finance decisions are called working capital management and deal with balance of current assets and current liabilities by managing cash, inventories, and short-term borrowing and lending (e.g., the credit terms extended to customers). Corporate finance is closely related to managerial finance, which is slightly broader in scope, describing the financial techniques available to all forms of busi-</Paragraph>
    </Section>
    <Section position="4" start_page="118" end_page="118" type="sub_section">
      <SectionTitle>
3.4 Courseware statistics
3.4.1 Basic courseware statistics
</SectionTitle>
      <Paragraph position="0"> Table 1 lists basic statistics for the courseware constructed from the target vocabulary and corpus.10 The courseware consisted of 131 articles. Each article was 150 words long because only excerpts were used. The average number of tokens per article shared with the vocabulary (&amp;quot;num. of common tokens&amp;quot; in the Table) was 18.4 and that of types (&amp;quot;num. of common types&amp;quot;) was 12.4. About 12.3%(= 18.4150 x 100) of the tokens in each article were covered by the vocabulary. Each article in the 10On our web site, we prepared 10 sets of article sets called course-1 to course-10. These 10 courses were obtained by repeatedly applying our algorithm to the English Wikipedia removing articles included in earlier courses. The statistics presented in this paper were calculated from the first courseware, course-1.</Paragraph>
      <Paragraph position="1"> courseware was referred to by 70.7 articles on average as can be seen from the bottom row. Table 1 indicates that articles in the courseware included many target words and were heavily referred to by other articles.</Paragraph>
    </Section>
    <Section position="5" start_page="118" end_page="119" type="sub_section">
      <SectionTitle>
3.4.2 Distribution of covered types
</SectionTitle>
      <Paragraph position="0"> Figure 2 plots the increase in the number of covered types against the order (ranking) of articles that were put into the courseware. The horizontal axis represents the ranking of articles. The vertical axis indicates the number of covered types. The increase was sharpest when the ranking value was lowest (left of figure). The dotted horizontal lines indicate 50% and 90% of the target vocabulary. These lines cross the curved solid line at the 22nd and 83rd articles, i.e., 16.8% and 63.4% of the courseware, respectively. This means that learners can learn most of the target vocabulary from the beginning of the courseware. This is desirable because learners sometimes do not have enough time to read all the courseware.</Paragraph>
      <Paragraph position="1">  the document frequencies (DFs) of the words, where the DF of a word is the number of articles in which the word occurred. These words were the most basic words in the target vocabulary with respect to the courseware.</Paragraph>
      <Paragraph position="2"> Table 2 lists the distribution of DFs. The first column lists the different DFs of the target words.</Paragraph>
      <Paragraph position="3"> The values in the &amp;quot;#DF&amp;quot; column are the numbers of  Average SD Min Median Max Num. of common tokens 18.4 10.8 1 16 55 Num. of common types 12.4 5.5 1 12 27 Num. of incoming links 70.7 145.3 16 32 1056 SD means standard deviation.</Paragraph>
      <Paragraph position="4"> words that occurred in the corresponding DF articles. The &amp;quot;CUM&amp;quot; and &amp;quot;CUM%&amp;quot; columns show the cumulative numbers and percentages of words calculated from the values in the second column. As we can see from Table 2, more than 50% of the target words occurred in multiple articles. Consequently, learners were likely to be sufficiently exposed to efficiently learn the target vocabulary.</Paragraph>
      <Paragraph position="5"> service (19), form (17), information (12), feature (12), operation (11), cost (11), individual (10), department (10), consumer (9), company (9), product (9), complete (9), range (9), law (9), associate (9), cause (9), consider (9), offer (9), provide (9), present (8), activity (8), due (8), area (8), bill (8), require (8), order (8)</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML