XML Viewer - w98-1216

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1216_evalu.xml
Size: 3,441 bytes
Last Modified: 2025-10-06 14:00:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1216">
  <Title>An Attempt to Use Weighted Cusums to Identify Sublanguages</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> The full results of the comparison are given in Table 4.</Paragraph>
    <Paragraph position="1"> This table shows the pairwise average t-scores, replicated for ease of consultation. The groups are ordered as in Table 3, so that results in the top left-hand comer Somers 134 Use Weighted Cusums to Identify Sublanguages  of the table are between the most homogeneous groups, results in the bottom right the least homogeneous. The scores given on the diagonal are repeated from Table 3 and show the average score for the internal comparison of the texts in that group.</Paragraph>
    <Paragraph position="2"> This time we are looking for high scores to support the hypothesis that the WQsum test can identify the texts as belonging to different sublanguages. At first glance the results look disappointing. If we again take a score of 1.65 as the notional cut-off point, then only 43% (45 out of 105) of the results qualify. On the other hand, if we compare the scores with those for the group-internal comparisons (Table 3), we may view the results more positively. The average internal score was 0.885 (s.d. = 0.232), the worst score 1.175; 67% of our scores are better than that.</Paragraph>
    <Paragraph position="3"> One problem stems from averaging the scores for all the tests. When the WQsum test is used in authorship attribution, it is necessary first to determine which linguistic feature is significant for the author under investigation. Looking at the raw scores for our experiment, we see that very often consistently high scores with one test are undermined by low scores on others. Table 5 shows an example of this, where an average score of 2.197 on the '1w34' test is mitigated by insignificant scores on the other test, giving an overall average of 1.074.</Paragraph>
    <Paragraph position="4"> Table 5 Raw scores for 'childrens'-'emails' comparison.  So an alternative that suggests itself is to take in each case the highest of the average scores for each linguistic feature, on a pairwise basis. These alternative results are presented in Table 6, which also shows in each ease which linguistic feature gave the best result. Since we are now taking the highest rather than the average score for the pairwise comparisons, we should also take the highest score for within-group comparison, which is again shown on the diagonal. As in Table 4, the groups are ordered from 'best' to 'worst' within-group score.</Paragraph>
    <Paragraph position="5"> The 'improvement' in the results is considerable: this time 82 of the 105 results (78%) are above the 1.650 threshold. However, taking the highest rather than the average score for the within-groups comparison leaves four of the groups- 'TVseripts', 'recipes', 'tourism' and 'childrens' m with scores above the 1.65 threshold, and a fifth group, 'weather', has a score very close to this. The scores for these groups are otten high for comparisons with other texts, but they are also high for the within-group comparison: this suggests that the texts in these groups are not homogeneous, so we have to take this into account when we consider the results in the discussion that follows.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML