File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-4401_metho.xml

Size: 17,964 bytes

Last Modified: 2025-10-06 14:11:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C69-4401">
  <Title>SYNTACTIC PATTERNS IN A SAMPLE OF TECHNICAL ENGLISH The Importance of the Concept of Homogeneity</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SYNTACTIC PATTERNS IN A SAMPLE
OF TECHNICAL ENGLISH
</SectionTitle>
    <Paragraph position="0"> The Importance of the Concept of Homogeneity A fundamental assumption of statistical linguistics is that there are differences worthy of note in the frequency of various units in certain texts. At the same Time, there are differences in frequencies which would not be considered important. The question is, how is an &amp;quot;important&amp;quot; difference tO be determined? The mesolution of this pmoblem has been made more important by the increasing populamity of statistical appmoaches to questions of style and authorship. Definitions of style from this point of view are based on notions of distinctiveness and consistencyin literary performance.</Paragraph>
    <Paragraph position="1"> While distinctiveness appears to be the more important component of style, it is recognized that some consistency is necessary to lend significance to whatever feature might be distinctive.</Paragraph>
    <Paragraph position="2"> The Deter,nination of Homogeneity For this discussion we define homogeneity as the similarity of parts of the whole with respect to certain features. For some features it may be perfectly clear, even without counting, that parts of a text or texts from a genre are not alike. This seems more likely to occur for some features and for some genmes than for others, for  -1example, syntacticor phonological constructions in poetry, as opposed to parts of speech in technical writing.</Paragraph>
    <Paragraph position="3"> Few would be satisfied to rely solely on subjective impression fom the estimation of the similarity of text samples. For statistical linguists the decision to count is the foundation of their science. Fop literary scholars the decision to count stems from a desire to give quantitative verification of existing theories and interpretations, and to gain greater insight into the structure of literary works for the purpose of proposing new theories and interpretations. Both groups are faced with the problem of evaluating the results of the counting.</Paragraph>
    <Paragraph position="4"> The Nature of Statistical Tests The techniques of statistical description ame, of course, uniquely suited to the statement of the raw, uninterpreted results. Measures of location such as means, modes, medians are commonly used for this purpose.</Paragraph>
    <Paragraph position="5"> In examining the raw results it may be clear at once that there is a meaningful difference among the counts or scopes. If samples of 100 sentences were taken at random from each of two texts, and the mean lengths for the two samples were 20 words and W0 words, no one would hesitate to conclude that one text revealed a r'significantly&amp;quot; greater sentence length than the other. But if the figures were closer, say 27 and 33, more exact methods ape needed. m It is a law of nature that a sample taken from a population will not always yield exactly the statistics of the popula-Tion, that on occasion even a large discrepancy will be found. The extent to which sample values may be expected to vary from population values through chance alone is a subject of mathematical statistics, as is the extent to which two or more sample values from the same population will differ.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Language Statistics and Homogeneity
</SectionTitle>
      <Paragraph position="0"> There is considerable data that demonstrates overall similarities in the frequencies of various units between samples from the same writer, fmom different writers, and even from different languages. 1 The problem for statistical linguistics and stylistics is the ordering of degrees of similarity into groups according to some notion of homogeneity. If the sample values differ no more than could reasonably be attributed to chance, we see no reason why the populations from which the samples were taken could not be called one homogeneous population.</Paragraph>
      <Paragraph position="1"> Whether text samples pass a statistical test fom homogeneity depends on the nature of the text~ the chosen iSee, for example, Herdan, The Advanced Theory o~f Language as Choice and Chance, pp. i--7-/-27, and M. Rensk~&amp;quot;The Noun-Verb Quotlent in Englxsh and Czech, Phllolo~la Pra~ensla, VIII (1965), pp. 289-302.</Paragraph>
      <Paragraph position="2">  -3&amp;- null significance level, and the power of the test as determined by characteristics of the test itself in conjunction With the size of the sample. It is possible to imagine a perfectly uniform text, for example, one composed oPS nothing more than repetitions of the same identical sentence. In this case, a statistical test will reveal this homogeneity for any significance level or sample size. For real texts, though, the selection of the s.l. and s.s. poses a problem of practical and theoretical interest. The danger is that an investigator will be tempted to make a flat statement concerning the homogeneity of a feature for a text or a genre, when a slight change in s.l. or s.s. could have led to a reversal of that finding. Homogeneity, then, as a product of statistical hypothesis testing, should not be regarded as a function of the text alone, but rather as a function of the text and the significance level and power associated with the test and the sample size. If the samplesrepresent different populations even if different only in some minimal way, it is only a question of increasing sufficiently the sample size to cause the hypothesis of homogeneity to be rejected.</Paragraph>
      <Paragraph position="3"> In discussing the size of samples to be taken, Herdan states that &amp;quot;for statistical investigations in general, it is usually a question of how small the sample should be--I for reasons of economy--without becoming unrepresentative of the universe, and without the errors acquiring such dimensions as to make significance testing illusory. &amp;quot;2 2Ibid., p. 170  -4q null It is clear that hard infommation is needed on The extent to which parts of a single text will differ with respect to the frequency of various measured units. IT is also clear that different units may occur with vamying degrees of consistency throughout a text. The question of the homogeneity of a text is complex. But until The nature of variation within texts is understood, statements about variation between texts cannot be made with great authority.</Paragraph>
      <Paragraph position="4"> The Design of the Study A suitable model for the study of quantitative change in linguistic behavior is one which views change as taking place along dimensions, such that if two texts vary significantly in the proportion or distribution of one or more units, this difference would be attributed to the two texts occupying different positions in a context space. The examination of other texts of varying similarity to each of the original two texts should lead to the description of factors (dimensions) responsible for the original observed difference. The proposed dimensions can then be tested by predicting the behavior of texts not yet examined.</Paragraph>
      <Paragraph position="5"> In this study we propose to examine some aspects of the statistical behavior of certain syntactic units in a sample of technical English. In this as in any other study we must carefully set our goals and gather an appropriate  --5-h null amount of data to carry them out.</Paragraph>
      <Paragraph position="6"> The major focus of this study will be on the variation in frequency of syntactic units within the writing of two individuals. A primary hypothesis to be tested is that the distributions of units will remain l~easonably the same throughout a single text written by one person. If the distributions are not uniform~ several explanations could be offered. For example~ the varying content could influence the frequencies; that is, even in a single text there might be contextual variations. A comparison of the individual chapters should reveal such variations since the chapters represent the way in which the content has been divided in the text. For this reason the chapters will be compared with each other in each of the two texts. There may be other causes for internal differences in a text. During the time that the text was written various circumstances could have arisen to influence the frequencies. This study does not attempt, however, to account for such influences except as they may be co~related with chapter content and position.</Paragraph>
      <Paragraph position="7"> The other primary hypothesis to be tested is that the two sample texts will reveal essentially the same distributions. Several studies have compared samples of technical writing as a whole with samples of non-technical I writing, but no one-seems to have reported on the variation in linguistic performance among individual American  --6-v- null technical writers.</Paragraph>
      <Paragraph position="8"> In order to be sure that differences between the texts would be attributable as much as possible to the writers themselves it was decided to select the sample texts from the same discipline. In other werds, if a history text differed in average sentence length from a biology text this could be due either to the different writers or the subject areas or beth. While it may seem unreasonable to believe that biology and history writings could exhibit distinctive patterns, there is also no inherent reason why technical and non-technical should vary.</Paragraph>
      <Paragraph position="9"> The texts selected for this study are both from linguistics. They are: I. Emmon Bach's Introduction to Transformational Grammars {New York, 1964), all but exercises at the end of chapters.</Paragraph>
      <Paragraph position="10"> 2. Kenneth Pike's Language in Relation to a Unified Theor Z of the Structure of Human Behavior (The Hague, f967), pp. 25-82, excluding bibliographical sections.</Paragraph>
      <Paragraph position="11"> The choice of linguistics as the technical field was arbitrary. These samples of technical writing cannot be regarded as-random samples of technical writing as a whole, or even of linguistic writing, or even of Bach's or Pike's writing. The requirement of this study for large amounts of data from single texts precluded the possibility of gaining representativeness through the use of many smaller samples. Factors leading to the selection of the particular text by Bach were its relative shortness as a complete  -7book, its recent publication date, and the varied material covered. The three chapters by Pike may be regarded as a smaller control sample to be available to confirm any major conclusions for the Bach sample. Moreover, it was PSelt that Pike exhibited a rather different approach to sentence construction from Bach, and that this difference, when demonstrated quantitatively, would dispel any notion that technical writers could not show individual styles. For convenience the samples from Bach and Pike will be referred to hereafter as simply Bach and Pike.</Paragraph>
      <Paragraph position="12"> Before conducting a statistical investigation of texts various parameters or units must be selected which later will be counted and used as the basis for determining the similarity of the samples to be compared. The parameters discussed here represent 2 syntactic levels, that oPS clause and sentence. Table 1 depicts the basic clause level units.  This theorem is true. The description has not been useful.</Paragraph>
      <Paragraph position="13"> This description has many parts.</Paragraph>
      <Paragraph position="14"> Ideas flourish. Progress gives men hope. Linguists study language. We consider this false.</Paragraph>
      <Paragraph position="15"> This was realized by others.</Paragraph>
      <Paragraph position="16"> There a~e few days left. There seems to be no way to do this.</Paragraph>
      <Paragraph position="17"> It is not easy to estimate this quantity. It seems futile to try this.</Paragraph>
      <Paragraph position="18">  -9-Sentence types are defined through constituent clause types. A sentence is assumed to consist of a sequence of clauses, each of which is either a main clause or a subordinate clause. In the coded text symbols for main clauses are preceded by an &amp;quot;M&amp;quot;. Further, some clauses will be embedded within another clause. Embedded clauses appear in parentheses following the clause in which they are embedded. Thus, those sentences which are composed of the same clauses in the same order are considered to belong to the same sentence type. The following examples should clarify the clause and sentence type classifications: i. Numerous examples and problems are presented throughout this introduction. Bach, page 2. One main  passive clause: MS.</Paragraph>
      <Paragraph position="19"> 2. These are works that embod 7 in the medium of language the esthetic values of the individual or the com-Bach, page I. A main be clause followed by a subte transitive clause: M3---z\[.</Paragraph>
      <Paragraph position="20"> 3. The particular wa 7 of statin~ a theory of a language with which we shall be concerned has taken inspiration from modern logic. Bach, page 9. A main transitive clause with an embedded b_~e clause: M4(3).</Paragraph>
      <Paragraph position="21"> 4. It is doubtful whether there are an 7 natural lansuases conformin~ to an 7 of these tTpes. Bach, page 105. A main it clause followed by subordinate there and transitive clauses: MEC4.</Paragraph>
      <Paragraph position="22"> 5. We set up terminall 7 discontinuous consZructions as continuous ones and then separate them. Bach, page 120. Two main transitive clauses: M4M4.</Paragraph>
      <Paragraph position="23">  The coding of the original texts'was carried out &amp;quot;manually,&amp;quot; that is, no computer program was written to convert</Paragraph>
      <Paragraph position="25"> the source text to coded text. For each chapter (8 in Bach, 3 in Pike) the occurrences or tokens of each of The clause and sentence Types were counted and compared. The chi-square test was employed To determine the validity of the assumption that the chapters in each text can be regarded as random samples from one population.</Paragraph>
      <Paragraph position="26"> The counting and statistical analysis was carried out through the facilities of the Michigan Terminal System at the University of Michigan Computing Center. This time-sharing system is presently driven by two IBM System /360-67 processors. The clause level unit analysis programs were written in assembly language and FORTRAN IV. The sen-Tence type counting was programmed in SNOBOL~.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Results for Bach
</SectionTitle>
      <Paragraph position="0"> Table 2 depicts the frequency counts of the five clause types in Bach. Here considerable variation is apparent, especially in the be clause and the passive clause. The there and i_~t clause frequencies appear to be relatively constrained. The assumption that The chapters may be regarded as random samples from one population must be rejected.</Paragraph>
      <Paragraph position="1"> The frequency of the most common sentence types in Bach is illustrated in Table 3. The percentages given in the table represent the proportion of a sentence type among the five sentence Types listed. It was expected that a few sentence types would occur quite often, and that many Types would be found only once. It was disappointing ,  however, to find that only five types occummed with sufficient frequency fom statistical testing.</Paragraph>
      <Paragraph position="2"> Theme is clearly little consistency in the frequency of these sentence types, and the chi-squame test is able to meject strongly The hypothesis of homogeneity of the chaptems. A cumsomy inspection of the table reveals little ovemall pattern. The main passive Type (MS) occums least in chapters 1 and 8, the introduction and the conclusion. This is consistent with the notion of the passive clause being highly comrelated with technical material. Of course, the main passive type is not the only source of passive clauses. The active plus subordinate passive type (M45) listed in the table also pmovides one passive clause per sentence.</Paragraph>
      <Paragraph position="3"> We find that this type has its lowest frequencies in chapters 4 and 7. Theme is, then, no strong correlation between sentence types on the basis that they both contain passive clauses.</Paragraph>
      <Paragraph position="4"> Bach and Pike Compared Table 4 depicts the distmibution of clauses in Pike.</Paragraph>
      <Paragraph position="5"> As for Bach, the assumption that the chapters mepmesent random samples from one population must be mejected. As in Bach, the passive vamies considerably fmom chaptem to chaptem. Bach's fimst chaptem, the intmoduction, has the I smallest propomtion of passives but Pike's fimst chaptem has the most passives. Bach's be clauses range fmom 15.8  -14per cent to 30.5 per cent, but Pike's be clauses are more stable, ranging from 16.1 per cent to 22.4 per cent.</Paragraph>
      <Paragraph position="6"> Pike's active and passive clauses are also more consistent, but with eight chapters it must be taken into account that Bach has a greater opportunity to reveal inconsistency.</Paragraph>
      <Paragraph position="7"> Bach appears to use slightly more b__ee clauses, many fewer active clauses, and somewhat more passive and it clauses. The difference in the frequency of there clauses does not seem substantial. A chi-square test comparing Bach's and Pike's clause totals yields a probability far less than .001.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML