File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1216_intro.xml
Size: 13,915 bytes
Last Modified: 2025-10-06 14:06:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1216"> <Title>An Attempt to Use Weighted Cusums to Identify Sublanguages</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Sublanguage </SectionTitle> <Paragraph position="0"> We will assume that readers of this paper are fairly familiar with the literature on sublanguage (e.g. Kittredge & Lehrberger 1982; Grishman & Kittredge 1986), including definitions of the notion, history of the basic idea, and, above all, why it is a useful concept. Some readers will prefer terms like 'register' (which Biber uses); an affinity with work on genre detection will also be apparent. Because there is sometimes some dispute about the use of the term 'sublanguage', let us clarify from the start that for our purposes a sublanguage is an identifiable genre or text-type in a given subject field, with a relatively or even absolutely closed set of syntactic structures and vocabulary. In recent years, the availability of large corpora and 'new' methods to process them have led to renewed interest in the question of sublanguage identification (e.g. Sekine 1997), while Karlgren & Cutting (1994) and Kessler et al. (1997) have focussed on the narrower but clearly related questio.n of genre.</Paragraph> <Paragraph position="1"> Our purpose in this paper is to explore a technique for identifying whether a set of texts 'belong to' the Somers 131 Use Weighted Cusums to Identify Sublanguages Harold Somers (1998) An Attempt to Use Weighted Cusums to Identify Sublanguages. In D.M.W. Powers (ed.) NeMLaP3/CoNLL 98 : New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 131-139. same sublanguage, and of quantifying the difference between texts: our technique compares texts palrwise and delivers a 'score' which can be used to group texts judged similar by the technique. As we shall see later, what is of interest here is that the score is derived from a simple count of linguistic features such as word length and whether words begin with a vowel; yet this apparently unpromising approach seems to deliver usable results.</Paragraph> <Paragraph position="2"> In his well-known study, Biber (1988) took a number of potentially distinct text genres and measured the incidence of 67 different linguistic features in the texts to see what correlation there was between genre and linguistic feature. He also performed factor analysis on the features to see how they could be grouped, and thereby see if sublanguages could be defined in terms of these factors.</Paragraph> <Paragraph position="3"> The linguistic features that Biber used I are a mixture of lexical and syntactic ones, and almost all require a quite sophisticated level of analysis of the text data - dictionary look-up, tagging, a parser. They are presumably also, it should be said, hand-picked as features whose use might differ significantly from one genre to another. Although Biber gives details of the algorithms used to extract the features, it is not a trivial matter to replicate his experiments.</Paragraph> <Paragraph position="4"> Kessler et al. (1997) make the same criticism of Biber and of Karlgren & Cutting (1994), and restrict their experimentation on genre recognition to &quot;surface cues&quot;. In their paper they do not give any detail about the cues they use, except to say that they are &quot;mainly punctuation cues and other separators and delimiters used to mark text categories like phrases, clauses, and sentences&quot; (p. 34); however, Hinrich Schfitze (personal eornmunication) has elaborated that &quot;The cues are punctuation, non-content words (pronouns, prepositions, auxiliaries), counts of words, \[of\] unique words, \[of\] sentences, and \[of\] characters; and deviation features (standard deviation of word length and sentence length)&quot;. As we shall see below, the use of superficial linguistic aspects of the text is a feature of the approach described here.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Authorship attribution and weighted eusums </SectionTitle> <Paragraph position="0"> Authorship attribution has for a long time been a significant part of literary stylistics, familiar even to lay people in questions such as &quot;Did Shakespeare really write all of his plays?&quot;, &quot;Who wrote the Bible?&quot;, and I The features can be grouped into &quot;sixteen major categories: (A) tense and aspect markers, (B) place and time adverbials, (C) pronouns and pro-verbs, (D) questions, (E) nominal forms, (F) passives, ((3) stative forms, (H) subordination features, (I) adjectives and adverbs, (J) lexical specificity, (K) specialized lexical classes, (L) medals, (M) specialized verb classes, (hi) reduced or dispreferred forms, (O) coordination, and (P) negation.&quot; (Biber 1988:223) so on. With the advent of computers, this once rather subjective field of study has become more rigorous, attracting also the attention of statisticians, so that now the field of 'stylometrics' - the objective measurement of (aspects) of literary style - has become a precise and technical science.</Paragraph> <Paragraph position="1"> One technique that has been used in authorship attribution studies, though not without controversy, is the cumulative sum chart ('cusum') technique, a variant of which: we shall be using for our own investigation.</Paragraph> <Paragraph position="2"> Since we&quot; are not actually using standard cusums here, our explanation can be relatively brief. Cusums are a fairly well-known statistical device used in process control. The technique was adapted for author identification by Morton (1978) - see also Farringdon (1996) - and achieved some notoriety for its use in court cases (e.g. to identify faked or coerced confessions) as well as in literary studies. The technique is easy to implement, and requires only small mounts of text.</Paragraph> <Paragraph position="3"> A cusum is a graphic plot based on a sequence of measures. For example, suppose we have a set of measures (11,7,4, 10,2 .... ) with a mean value of 6. The corresponding divergences from the mean are (5,1,-2,4,-4,...). The cusum chart plots not these divergences, but their aggregate sum, i.e.</Paragraph> <Paragraph position="4"> (5, 6, 4, 8, 4,...), the sequence inevitably ending in 0.</Paragraph> <Paragraph position="5"> The plot reflects the variability of the measure: the straighter the line, the more stable the measure. In authorship attribution studies, the eusum chart is used to plot the homogeneity of a text with respect to a linguistic 'feature' such as use of two- and three-letter words on a sentence-by-sentence basis. Two graphs are plotted, one for the sentence lengths, the other for the incidence of the feature, and superimposed after scaling so that they cover roughly the same range. The authorship identification technique involves taking the texts in question, concatenating them, and then plotting the eusum chart. If the authors differ in their use of the linguistic feature chosen, this will manifest itself as a marked divergence in the two plots at or near the point(s) where the texts have been joined.</Paragraph> <Paragraph position="6"> There are a number of drawbacks with this method, the main one being the manner in which the result of the test is arrived at, namely the need to scrutinize the plot and use one's skill and experience (i.e. subjective judgment) to determine whether there is a &quot;significant discrepancy&quot; at or near the join point in the plot.</Paragraph> <Paragraph position="7"> A solution to this and several other problems with the standard cusum technique is offered by Hilton & Holmes (1993) and Bissell (1995a,b) in the form of weighted cusums (henceforth WQsums). Since this is the technique we shall use for our experiments, we need to describe it in full detail.</Paragraph> <Paragraph position="8"> Somers 132 Use Weighted Cusums to Identify Sublanguages</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 The calculations </SectionTitle> <Paragraph position="0"> As in the standard cusum, the WQsum is a measure of the variation and homogeneity of use of a linguistic feature on a sentence-by-sentence basis throughout a text. It captures not only the relative amount of use of the feature, but also whether its use is spread evenly throughout the texts in question.</Paragraph> <Paragraph position="1"> In a WQsum, instead of summing the divergence from the mean wi - ~ for the sentence lengths w and similarly xi-PS&quot; for the linguistic feature x, we sum xi~'wi, where /r, the 'weight', is the overall proportion of feature words in the whole text, as given by (I).</Paragraph> <Paragraph position="2"> As Hilton & Holmes (1993) explain, this weighting means that we are calculating &quot;the cumulative sum of the difference between the observed number of feature occurrences and the 'expected' number of occurrences&quot; (p. 75).</Paragraph> <Paragraph position="3"> ~_ E~i (1) As we shall see shortly, the variation in a WQsum can be measured systematically, and its statistical significance quantified with something like a t-test. This means that visual inspection of the WQsum plot is not necessary. There is no need, either, to concatenate or sandwich the texts to be compared. For the t-test, the two texts, .4 and B, are treated as separate samples.</Paragraph> <Paragraph position="4"> The formula for t is (2).</Paragraph> <Paragraph position="6"> The t-value is, in the words of Hilton & Holmes, &quot;a * measure of the evidence against the null hypothesis that the frequency of usage of the habit \[i.e. linguistic feature\] under consideration is the same in Text A and Text B. The higher the t-value, the more evidence against the hypothesis&quot; (.p. 76). The formula chosen for the calculation of variance e in (2) is given in (3), where n is the number of sentences in the text.</Paragraph> <Paragraph position="7"> The resulting value is looked up in a standard ttable, which will tell us how confidently we can assert that the difference is significant. For this we need to know the degrees of freedom v, which depends on the number of sentences in the respective texts, and is given by (4). Tradition suggests that p < .05 is the minimum acceptable confidence level, i.e. the probability is less than 5% that the differences between the texts are due to chance.</Paragraph> <Paragraph position="9"/> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The linguistic features </SectionTitle> <Paragraph position="0"> A point of interest for us is that both the cusums and WQsums have been used in the stylometrics field to measure the incidence of linguistically banal features, easily measured and counted. The linguistic features proposed by Farringdon (1996:25), and used in this experiment, involve the number of words of a given length, and/or beginning with a vowel, as listed in Table 1.</Paragraph> <Paragraph position="1"> Table 1 Linguistic features identified by Farringdon (1996:25).</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Habit Abbreviation </SectionTitle> <Paragraph position="0"> Two- and three-letter words Iw23 Two-, three- and four-letter words 1w234 Three- and four-letter words lw34 Initial-vowel words vowel Two- and three-letter words or lw23v initial-vowel words Two-, three- and four-letter words or lw234v initial-vowel words Three- and four-letter words or lw34v initial-vowel words Other experimemers have suggested counting the number of nouns and other parts of speech, but it is not clear if there are any limitations on the linguistic features that could be used for this test, except the obvious one that the feature should in principle be roughly correlated with sentence length. In any case, part of the attraction for our experiment is that the features are so fundamentally different from the linguistic features used by Biber in his experiments, and so will offer a point of comparison. Furthermore, they are easy to compute and involve no overheads (lexicons, parsers etc.) whatsoever.</Paragraph> <Paragraph position="1"> It is also interesting to note that the WQsum is a measure of variation, a type of metric which, according to Kessler et al. (1997) has not previously used in this type of study.</Paragraph> <Paragraph position="2"> In authorship identification, it is necessary first to determine which of these features is &quot;distinctive&quot; for a given author, and then to test the documents in question for that feature. This is not appropriate for our sublanguage experiment, so for each text comparison we run all seven tests. Each test gives us a t-score from which a confidence level can be determined. Obviously, the result over the seven tests may vary somewhat. For our experiment we simply take the average of the seven t-scores as the result of text comparison. It Somers 133 Use Weighted Cusums to Idenn'fy Sublanguages is not obvious that it makes sense any more to treat this as a t-score, and in the experiments described below we tend to treat it as a raw score, a lower score indicating cohesion, a higher score suggesting difference. Nevertheless it is useful to bear in mind that, given the degrees of freedom involved in all eases (the texts are all roughly the same length), the threshold for siguifieance is around 1.65.</Paragraph> </Section> </Section> class="xml-element"></Paper>