File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-2001_metho.xml
Size: 17,216 bytes
Last Modified: 2025-10-06 14:13:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-2001"> <Title>Using Register-Diversified Corpora for General Language Studies</Title> <Section position="3" start_page="228" end_page="228" type="metho"> <SectionTitle> 3 The factorial structure was derived from a common factor analysis with a Promax rotation. A </SectionTitle> <Paragraph position="0"> seven-factor solution was extracted as the most adequate, although only the first five factors are presented here. The first factor in the analysis accounts for 26.8% of the shared variance, while all seven factors together account for 51.9% of the shared variance. Further details are given in Biber (1988).</Paragraph> </Section> <Section position="4" start_page="228" end_page="232" type="metho"> <SectionTitle> 4 The polarity of Dimension 1 has been reversed to aid in the comparison to Dimensions 3 and 5. 5 See Biber (in press, b) for a comprehensive framework comparing registers with respect to their </SectionTitle> <Paragraph position="0"> situational characteristics, such as the relations among participants, purposes, production circumstances, and typical topics.</Paragraph> <Paragraph position="1"> As can be seen from Figure 1, these nine registers are strikingly different in their linguistic characteristics, even within this two-dimensional space. When all six dimensions are considered, these differences are even more notable. Table 8 further shows that a significant and important amount of variation among texts can be accounted for based on the register distinctions. It is important to emphasize here that the register categories were not considered when the dimensions were originally identified; rather, the dimensions represent the linguistic co-occurrence patterns across texts, regardless of their register category. However, Table 8 shows that there are important differences across registers with respect to each dimension. The F values and probabilities report the results of an Analysis of Variance showing that the registers are significant discriminators for each dimension, and the r 2 values show their strength (r 2 is a direct measure of the percentage of variation in the dimension score that can be predicted on the basis of the register distinctions). Four of the five dimensions have r 2 values over 50%, with only Dimension 4 having a relatively small r 2 value of 16.9%. These Douglas Biber Using Register-Diversified Corpora for General Language Studies values show that the registers are strong predictors of linguistic variability along all five dimensions.</Paragraph> <Paragraph position="2"> Registers are defined in terms of their situational characteristics, and they can be analyzed at many different levels of specificity. 6 However, there are important linguistic differences even among many closely related subregisters (Biber 1988; Chapter 8). A complementary perspective is to analyze the total space of variation within a language in terms of linguistically well-defined text categories, or text types (Biber 1989). 7 Given a text type perspective, linguistically distinct texts within a register represent different types, while linguistically similar texts from different registers represent a single text type.</Paragraph> <Paragraph position="3"> In sum, all of these analyses show that there are extensive differences among English registers with respect to a wide array of linguistic features. A corpus restricted to any one or two of these registers would clearly be excluding much of the English language, linguistically as well as situationally.</Paragraph> </Section> <Section position="5" start_page="232" end_page="239" type="metho"> <SectionTitle> 4. Further Applications of Corpus-Based Analyses of Register Variation </SectionTitle> <Paragraph position="0"> The multidimensional model of register variation summarized in Section 3 can be used to address additional computational issues. In this section, I focus on two of these: the automated prediction of register category and cross-linguistic comparisons.</Paragraph> <Section position="1" start_page="232" end_page="235" type="sub_section"> <SectionTitle> 4.1 Automated Prediction of Registers </SectionTitle> <Paragraph position="0"> One issue of current relevance within computational linguistics is the automated prediction of register category, as a preliminary step to work in information retrieval, machine translation, and other kinds of text processing. Because the model of register variation summarized in the last section is multidimensional, with each dimension comprising a different set of linguistic features and representing a different set of relations among registers, it is well suited to this research question.</Paragraph> <Paragraph position="1"> One statistical procedure commonly used for classificatory purposes is discriminant analysis. This procedure computes the generalized squared distance between a text and each text category (or register), and the text is then automatically classified as belonging to the closest category. 8 To illustrate, Figure 2 plots the five-dimensional profiles of three target registers (academic prose, fiction, and newspaper reportage) together with an unclassified text 6 Because registers can be specified at many different levels of generality, there is no &quot;correct&quot; set of register distinctions for a language; rather, I have argued elsewhere that registers should be seen as semi-continuous (rather than discrete) constructs varying along multiple situational parameters (Biber, in press, b). Further, since registers are defined situationally rather than on a linguistic basis, they are not equally coherent in their linguistic characteristics. Some registers have quite focused norms and therefore show little internal linguistic variation (e.g., science fiction). Registers such as popular magazine articles, on the other hand, include a wide range of purposes, and thus show extensive linguistic differences among the texts within the register (cf. the investigations in Biber 1988, Chapter 8; Biber This procedure identifies concentrations of texts such that the texts within each cluster, or text type, are maximally similar to one another in their linguistic characteristics, while the types are maximally distinct from one another. Biber (1989) identifies eight major text types in English, which are interpretively labeled: Intimate Interpersonal Interaction, Informational Interaction, &quot;Scientific&quot; Exposition, Learned Exposition, Imaginative Narrative, General Narrative Exposition, Situated Reportage, and Involved Persuasion. Biber (in press, c) compares the text type distinctions in English the unclassified text is about equidistant to the mean scores for academic prose and newspaper reportage, but along Dimensions 2, 3, and 5, the unclassified text is much closer to the target means for academic prose. Generalizing over all five dimensions, this text has the smallest distance to academic prose and would be classified into that category.</Paragraph> <Paragraph position="2"> The predictive power of this five-dimensional model of variation was tested at three levels of abstraction. The first test was based on texts from three high-level register categories: newspaper articles (including press reportage, editorials, and reviews), academic prose (including humanities, social science, medicine, natural science, and engineering), and fiction. Table 9 presents the discriminant analysis results for these categories. The top part of the table presents the calibration results: the model was trained on 118 texts from these three categories and then used to classify the same 9 To aid in comparison across dimensions, all dimension scores in Figure 2 have been converted to a common scale of plus-or-minus 10. The scaling coefficients are: texts. The rates for successful prediction are high in all three cases. This model was then tested on a new set of 124 'unknown' texts; the rates for successful classification are still high (ranging from 68.89% to 84.62%), although not as high as in the calibration data. Academic prose and fiction are clearly distinguished, with no misclassifications between these two groups. Newspaper texts are less sharply distinguished from the other two categories: 28.89% of the newspaper texts are incorrectly classified as academic prose; 27.5% of the academic prose texts are incorrectly classified as newspaper texts; and 15.38% of the fiction texts are incorrectly classified as newspaper texts. Overall, though, the large majority of texts in these three categories are correctly classified by the five-dimensional model.</Paragraph> <Paragraph position="3"> Table 10 shows that roughly the same success rate can be achieved at a more specific level of prediction: distinguishing among press reportage, editorials, and reviews within newspapers. The calibration model shows success rates ranging from 69.23% to 87.5% for these three categories, and the test data are correctly predicted at comparable rates (ranging from 68.18% to 92.86%).</Paragraph> <Paragraph position="4"> Finally, Table 11 applies this technique at a much more specific level, attempting to discriminate among four kinds of press reportage that differ primarily in their content domains: political, sports, spot news, and financial reportage. In this case, only the Computational Linguistics Volume 19, Number 2 Table 10 Automatic classification of texts into three specific press registers (press reportage, press reviews, press editorials), based on a discriminant analysis using five underlying dimensions.</Paragraph> <Paragraph position="5"> Calibration results: Classification of the 43 newspaper texts used to derive the discriminant function calibration results are reported because of the small sample size. The results indicate, however, a high success rate (ranging from 61.54% to 85.71%), suggesting that this approach can be profitably used for prediction among closely related subregisters or sublanguages.</Paragraph> <Paragraph position="6"> The predictive power of this technique is only as robust as the underlying model of variation. In the present case, that model represents multiple dimensions of linguistic variation derived from analysis of a register-diversified corpus, and the results achieved are generally robust for the successful prediction of different kinds of text at quite different levels of abstraction.</Paragraph> </Section> <Section position="2" start_page="235" end_page="239" type="sub_section"> <SectionTitle> 4.2 Cross-Linguistic Comparisons </SectionTitle> <Paragraph position="0"> The analysis of parallel text corpora in different languages has received considerable attention in recent years, usually in relation to research on information retrieval and machine translation. Many researchers dealing with these issues from a register perspective have focused on the computational analysis of sublanguages, a subsystem of a language that operates within a particular domain of use with restricted subject matter (see Kittredge and Lehrberger 1982; Grishman and Kittredge 1986). Processing Automatic classification of 34 newspaper reportage texts into four content areas (political reportage, sports reportage, spot news reportage, financial reportage), based on a discriminant analysis using five underlying dimensions.</Paragraph> <Paragraph position="1"> research in this area has achieved high levels of success by focusing on very restricted textual domains.</Paragraph> <Paragraph position="2"> Kittredge (1982) adopts a variation perspective, comparing the extent of sublanguage differences within and across languages. Some of the provocative conclusions of that study are: &quot;the written style of English and French tended to be more similar in specialized technical texts than in general language texts&quot; (1982, p. 108). &quot;parallel sublanguages of English and French are much more similar structurally than are dissimilar sublanguages of the same language.</Paragraph> <Paragraph position="3"> Parallel sublanguages seem to correspond more closely when the domain of reference is a technical one&quot; (1982, p. 108).</Paragraph> <Paragraph position="4"> The multidimensional framework provides a complementary approach to these issues. From a linguistic perspective, dimensions are more readily compared cross-linguistically than individual features, since structurally similar features often serve quite different functional roles across languages. Similarly, cross-linguistic comparisons of individual registers are more readily interpretable when they are situated relative to the range of other registers in each language, since the 'same' registers can serve quite different functions across languages when considered relative to their respective register systems.</Paragraph> <Paragraph position="5"> To date, there have been multidimensional analyses of register variation in four languages: English (summarized in Section 3), Nukulaelae Tuvaluan (Besnier 1988), Korean (Kim and Biber in press), and Somali (Biber and Hared 1992, in press). In each case, the description is based on analysis of a diversified corpus representing a wide range of spoken and written registers. The cross-linguistic patterns of variation represented by these four languages, both synchronic and diachronic, are discussed in Biber (in press, c).</Paragraph> <Paragraph position="6"> Discourse gerunds, agentive nouns announcements memos To illustrate, the multidimensional analysis of English (discussed above; see Table 7) can be compared with the multidimensional patterns of variation in Somali, summarized in Table 12. Both languages represent many of the same functional considerations in their dimensional structure, including interactiveness, involvement, produc- null tion circumstances, informational focus, personal stance, and narrative purposes. There are also many similarities in the co-occurrence patterns among linguistic features. For example, first and second person pronouns, downtoners, stance features, contractions, and questions group together in both languages as markers of involvement; nouns and adjectives group together in both languages as markers of an informational focus; third person pronouns and past tense verbs group together in both languages as markers of narration. In other respects, though, the multi-dimensional structure of the two languages differ. For example, Somali has two dimensions marking different kinds of interaction plus a third dimension relating to production circumstances; all of these functions are combined into a single dimension in English (Dimension 1). Conversely, English Dimension 5 marks a passive, abstract style, which has no counterpart in Somali.</Paragraph> <Paragraph position="7"> One of the surprising findings from the comparison of all four languages (including Nukulaelae Tuvaluan and Korean) is the extent of the cross-linguistic similarities (see Biber, in press, c). Thus, all four languages have multiple dimensions reflecting oral/literate differences, interactiveness, production circumstances, and an informational focus; these dimensions are defined by similar kinds of linguistic features, and analogous registers have similar cross-linguistic characterizations along these dimensions. In addition, two functional domains that relate to purpose are marked in all four languages: personal stance (toward the content) and narration. These dimensions also have similar structural correlates across the languages.</Paragraph> <Paragraph position="8"> In contrast, there are fewer major differences among these languages in their patterns of register variation. Dimensions relating to argumentation/persuasion are found in only some languages, and there are other dimensions particular to a single language (such as abstract style in English, and honorification in Korean). Analogous registers show some differences cross-linguistically with respect to these latter dimensions plus the purpose-related dimensions mentioned above (e.g., marking personal stance or narration).</Paragraph> <Paragraph position="9"> Findings such as these are directly relevant to several of the issues raised in recent studies of sublanguages, since they can be used to specify the linguistic relations among sublanguages both within and across languages. In particular, these analyses support Kittredge's (1982) conclusion that parallel sublanguages across languages are more similar in their linguistic structure than are dissimilar sublanguages within the same language. Romaine (in press) discusses similar findings in a comparison of sports Computational Linguistics Volume 19, Number 2 reportage in Tok Pisin and English. The multidimensional comparisons summarized here show that even when registers are defined at a high level of generality (e.g., conversation, fiction, academic prose), and even when comparisons are across markedly different language families and cultures, parallel registers are indeed more similar cross-linguistically than are disparate registers within a single language.</Paragraph> </Section> </Section> class="xml-element"></Paper>