File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1216_metho.xml

Size: 3,028 bytes

Last Modified: 2025-10-06 14:15:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1216">
  <Title>An Attempt to Use Weighted Cusums to Identify Sublanguages</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The method
</SectionTitle>
    <Paragraph position="0"> Our experiment is to use the WQsum test on a corpus of small texts which we believe can be grouped according to genre or sublanguage. We gathered 15 sets of different text-types: each set of three texts is assumed to represent a different sublanguage, and each text was written, as far as we know, by a different author. The 15 groups of texts were as follows: blurbs publishers' announcements of scientific text-books null BMJ abstracts of articles appearing in the British</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Medical Journal
</SectionTitle>
      <Paragraph position="0"> childrcns extracts from children's stories church articles from local Catholic church newsletters economy economic reports from a Swiss bank e-mails discussing arrangement of a meeting footie reports of soccer matches from the same newspaper, same date lawreps extracts from The Weekly Law Reports obits obituaries of Jacques Cousteau, from different newspapers recipes recipes from the Interact Chef web site TVscripts Autocue scripts from Central TV News programmes tourism extracts from the &amp;quot;Shopping&amp;quot; seetiorL of Berlitz guides univs descriptions of Computer Science courses weather state-wide general weather forecasts from US  xwords sets of clues to cryptic crosswords Our first task is to see that the WQsum test can confirm the homogeneity of the text triplets. For each group of three texts, we ran our test and averaged the t-scores for each group. Table 2 shows an example of this for the &amp;quot;church' group of texts. Table 3 lists the 14 groups together with some information about the texts, including their 'homogeneity score', an indication of their length (average number of sentences, and average words per sentence), and their source.</Paragraph>
      <Paragraph position="1"> The first thing to note is that all the groups of texts are well within the 1.65 threshold of significant difference. In other words, the pairwise WQsum test for each group firmly indicates homogeneity within the groups.</Paragraph>
      <Paragraph position="2"> Table 2 WQsum test results for 'church' text set. Scores marked '*' suggest a difference significant at p &lt; .05.  We now proceed to compare all the texts with each other, pairwise. It is fortunate that the WQsum procedure is so simple, since this pairwise comparison involves a huge number of iterations: each text comparison involves seven applications of the WQsum test, each group comparison involves nine text comparisons, and there are 105 pairwise group comparisons, making a total of 6615 tests. In the following section we will attempt to summarize the findings to be had from this large body of data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML