File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/93/j93-2001_abstr.xml

Size: 27,690 bytes

Last Modified: 2025-10-06 13:47:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="J93-2001">
  <Title>Using Register-Diversified Corpora for General Language Studies</Title>
  <Section position="2" start_page="0" end_page="228" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> As the use of computer-based text corpora has become increasingly important for research in natural language processing, lexicography, and descriptive linguistics, issues relating to corpus design have also assumed central importance. Two main considerations are important here: 1) the size of the corpus (including the length and number of text samples), and 2) the range of text categories (or registers) that samples are selected from. 1 Within social science, these considerations are associated with the two main kinds of error that can threaten 'external validity' (the extent to which it is possible to generalize from a sample to a larger target population): 'random error' and 'bias error.' Random error occurs when a sample is not large enough to accurately estimate the true population; bias error occurs when the selection of a sample is systematically Department of English, Northern Arizona University, P. O. Box 6032, Flagstaff, AZ 86011-6032; biber@nauvax.bitnet. 1 Corpus designs can differ along several other parameters, including:  1. bounded/static versus unbounded/dynamic; 2. richly encoded versus minimally encoded (e.g., grammatical tagging, phonological/prosodic encoding, tagging of social characteristics (of participants) and situational characteristics); 3. complete texts versus samples from texts; 4. selection of texts: convenience versus purposeful versus random within strata versus proportional random.</Paragraph>
    <Paragraph position="1"> (c) 1993 Association for Computational Linguistics  Computational Linguistics Volume 19, Number 2 different from the target population it is intended to represent. Both kinds of error must be minimized to achieve a representative corpus.</Paragraph>
    <Paragraph position="2"> Recent debates concerning the design of general purpose corpora have often divided into two opposing camps emphasizing one or the other source of error: those advocating large corpora versus those advocating &amp;quot;balanced&amp;quot; corpora (i.e., including a wide range of registers). For example, the ACL Data Collection Initiative (DCI) and the Linguistic Data Consortium are focusing on the rapid collection and dissemination of very large corpora, with relatively little attention to the range of registers; older corpora, such as the Brown Corpus and LOB Corpus, are small by present-day standards but are explicitly structured to 'represent a wide range of styles and varieties' (Francis and Ku~era 1964 \[cf. Johansson, Leech, and Goodluck 1978\]). Projects such as the COBUILD Corpus, Longman/Lancaster Corpus, and British National Corpus (BNC) combine both emphases to varying extents.</Paragraph>
    <Paragraph position="3"> Although all of these corpora could be considered &amp;quot;representative&amp;quot; of at least some varieties of English, it is important to address the question of whether the varieties represented match the intended uses of a corpus. For example, studies of a single sublanguage are legitimately based on corpora representing only that variety, such as journal articles on lipoprotein kinetics (Sager 1986), Navy telegraphic messages (Fitzpatrick et al. 1986), weather reports (Lehrberger 1982), and aviation maintenance manuals (Kittredge 1982).</Paragraph>
    <Paragraph position="4"> One of the main issues addressed here, though, is whether general language studies must be based on a corpus that is register-diversified as well as large. Some proponents of very large corpora have suggested that size can compensate for a lack of diversity--that if a corpus is large enough, it will represent the range of linguistic patterns in a language, even though it represents only a restricted range of registers. Given this assumption, linguistic analyses of any very large corpus could be generalized to the entire language.</Paragraph>
    <Paragraph position="5"> In contrast, I argue here that analyses must be based on a diversified corpus representing a wide range of registers in order to be appropriately generalized to the language as a whole, as in a dictionary or grammar of English, or a general purpose tagging program for English. 2 In fact, global generalizations are often not accurate at all, because there is no adequate overall linguistic characterization of the entire language; rather, there are marked linguistic differences across registers (or sublanguages; cf. Kittredge 1982). Thus a complete description of the language often entails a composite analysis of features as they function in various registers. Such analyses must be based on corpora representing the range of registers.</Paragraph>
    <Paragraph position="6"> In the following discussion, I first briefly illustrate the extent of cross-register differences from consideration of individual grammatical and lexical features. Section 2.1 focuses on the marked differences in the distribution of dependent clauses across registers. This section then shows that register differences are also important for probabilistic part-of-speech taggers and syntactic parsers, because the probabilities associated with grammatically ambiguous forms are often markedly different across registers. Section 2.2 focuses on adjectives marking 'certainty' to illustrate how lexical patterns are also distributed differently across registers.</Paragraph>
    <Paragraph position="7"> Section 3 makes this point more strongly by describing a multidimensional anal- null ing dimensions of variation, and that there are systematic and important linguistic differences among registers with respect to these dimensions. The extent of those differences clearly shows that linguistic analyses based on a restricted corpus cannot be generalized to the language as a whole.</Paragraph>
    <Paragraph position="8"> Section 4, then, shows how multidimensional analyses of register variation can be used to address additional computational issues. First, Section 4.1 discusses the application of the multidimensional model of English to predict the register category of texts automatically with a high degree of accuracy. The predictive power of the model at three levels of abstraction is tested. In Section 4.2, then, I turn to cross-linguistic patterns of register variation, illustrating how multidimensional analyses of register-diversified corpora in English, Nukulaelae Tuvaluan, Korean, and Somali enable register comparisons of a kind not otherwise possible, providing the background for more detailed investigations of particular subregisters or sublanguages.</Paragraph>
    <Paragraph position="9"> 2. Particular Grammatical and Lexical Features The analyses in this section illustrate the fact that there are systematic and important grammatical and lexical differences among the registers of English; Section 2.1 treats grammatical features and Section 2.2 discusses lexical features.</Paragraph>
    <Section position="1" start_page="220" end_page="225" type="sub_section">
      <SectionTitle>
2.1 Grammatical Issues
</SectionTitle>
      <Paragraph position="0"> 2.1.1 Descriptive Analyses. One of the main uses of general text corpora has been to provide grammatical descriptions of particular linguistic features, such as nominal premodification structures, relative clauses, verb and particle combinations, and clefts and pseudoclefts (see the numerous entries in the bibliography of corpus-based studies compiled by Altenberg \[1991\]). Two findings repeatedly come out of this literature: first, individual linguistic features are distributed differently across registers, and second, the same (or similar) linguistic features can have different functions in different registers.</Paragraph>
      <Paragraph position="1"> The linguistic description of dependent clauses in English illustrates these patterns. Although these constructions are often treated as a single coherent system, the various types of structural dependency actually have quite different distributions and functions in English (cf. Biber 1988, 1992). For example, Table 1 shows that relative clauses are quite frequent in official documents and prepared speeches but quite rare in conversation. In contrast, causative adverbial subordination occurs most frequently in conversation and is quite rare in official documents and press reports. Finally, that complement clauses occur most frequently in prepared speeches, and with moderate frequencies in conversations and press reports, but they are rare in official documents.</Paragraph>
      <Paragraph position="2"> There is further variation within these structural categories. For example, most rel- null Computational Linguistics Volume 19, Number 2 ative clauses in official documents have WH rather than that relative pronouns (7.7 out of 8.6 total), while the two types are evenly split in conversation (1.6 WH relative clauses versus 1.3 that relatives). Although causative adverbial clauses are generally more frequent in spoken registers, clauses headed by as~since occur almost exclusively in writing (although they are relatively rare in both modes); clauses headed by because are much more frequent in speech (Tottie 1986).</Paragraph>
      <Paragraph position="3"> Biber (1992) uses confirmatory factor analysis, building on corpus-based frequency counts of this type, to show that discourse complexity is itself a multi-dimensional construct; different types of structural elaboration reflect different discourse functions, and different registers are complex in different ways (in addition to being more or less complex). The analysis further identifies a fundamental distinction between the discourse complexities of written and spoken registers: written registers exhibit many complexity profiles, differing widely in both the extent and the kinds of complexity, while spoken registers manifest a single major pattern differing only in extent.</Paragraph>
      <Paragraph position="4">  tion 2.1.1 have important implications for probabihstic tagging and parsing techniques, which depend on accurate estimates of the relative likelihood of grammatical categories in particular contexts. Two kinds of probabilistic information are commonly used in part-of-speech taggers: 1) for ambiguous lexical items, the relative probability of each grammatical category (e.g., abstract as a noun, adjective, and verb); and 2) for groups of ambiguous words, the relative probability of various tag sequences (e.g., the likelihood of a noun being followed by a verb, adjective, or another noun).</Paragraph>
      <Paragraph position="5"> To investigate whether grammatically ambiguous words have different distributions across registers, I compiled two separate on-line dictionaries from the LOB Corpus: one based on the expository registers, and one from the fiction registers. Table 2 presents descriptive statistics from a comparison of these two dictionaries. The first observation is that many words occurred in only one of the dictionaries. This is not surprising in the case of exposition, since there were over twice as many lexical entries in the exposition dictionary. However, it is more surprising that there were over 6000 words that occurred only in fiction. These included many common words, such as cheek, kissed, grandpa, sofa, wallet, briefcase, intently, and impatiently.</Paragraph>
      <Paragraph position="6"> A comparison of the probabilities of words occurring in both dictionaries is even more revealing. One thousand ten words had probability differences greater than 50%, while another nine hundred eighty words had probability differences greater than 30%.</Paragraph>
      <Paragraph position="7"> These words represented many lexical types. For example, the first group of words listed on Table 2 are past participle forms. There is a strong likelihood that the -ed forms (i.e., admitted, observed, remembered, expected) will function as past tense verbs in fiction (probabilities of 77%, 91%, 89%, and 54%) but as passive verbs in exposition (probabilities of 67%, 45%, 72%, and 77%). The word observed shows a slightly different pattern in that it has a relatively high probability of occurring as an adjective in exposition (33%). In fiction, the -ed forms never occur as adjectives, apart from the 4% likelihood for remembered. Finally, the two true participles (known and given) are most likely to occur as perfect aspect verbs in fiction (65% and 77% likelihood), but they are similar to the -ed forms in typically occurring as passive verbs in exposition (65% and 71%).</Paragraph>
      <Paragraph position="8"> The words in the second group on Table 2 represent noun/verb/adjective ambiguities. These include noun/verb ambiguities (trust, rule), -ing participles (thinking, breathing), verb/adjective ambiguities (secure), and noun/adjective ambiguities (major, representative). Apart from the last two types, these forms all show a strong likelihood to function as verbs in fiction (62% to 92% probability). In contrast, these forms are  Comparison of the probabilities of grammatically ambiguous words in a dictionary based on Exposition versus a dictionary based on Fiction. (Both dictionaries are derived from the one million words of the LOB Corpus.) Overall Statistics: Total lexical entries in the Fiction dictionary = 22,043 Total lexical entries in the Expository dictionary = 50,549 Total words occurring in the Fiction dictionary only: 6,204 Total words occurring in the Expository dictionary only: 31,476 Words having probability differences of &gt; 50%: 1,010 Words having probability differences of &gt; 30%: 980 Examples of marked differences in probabilities for common words: (Note: The probabilities for some words do not add up to 100% because minor categories are not listed.)  more likely to occur as nouns in exposition (all greater than 80% likelihood except for thinking), secure shows a similar likelihood of having an adjectival function in exposition (80%), and the two noun/titular noun/adjective ambiguities are much more likely to occur as adjectives in exposition than fiction.</Paragraph>
      <Paragraph position="9"> The third group of ambiguous forms are function words. The probability differences here are less large than in the other two groups, but they are still important given the central grammatical role that these items serve. The first three of these words (until, before, as) are considerably more likely to occur as a subordinator in fiction than in exposition, while they are more likely to occur as a preposition in exposition. The  relative pronoun 14 11 word that is quite complex. It has roughly the same likelihood of occurring as a relative pronoun in fiction and exposition, but it is more likely to occur as a demonstrative in fiction, and more likely as a complementizer in exposition.</Paragraph>
      <Paragraph position="10"> Table 3 illustrates the same kinds of comparison for tag sequences. Although the differences are not as striking, several of them are large enough to be relevant for automatic tagging. For example, prepositions and nouns are considerably more likely to follow singular nouns in exposition than in fiction. Similarly, nouns are more likely to follow adjectives in exposition than in fiction. Passive verbs are considerably more likely to follow the copula be in exposition than in fiction, while progressive verb forms are more likely to follow be in fiction. Other differences are not great, but they are consistent across tag sequences.</Paragraph>
      <Paragraph position="11"> Finally, comparisons of this type are also important for syntactic ambiguities, which are the bane of probabilistic parsers. To illustrate the importance of register differences in this arena, Table 4 presents frequency counts for prepositional phrase  Douglas Biber Using Register-Diversified Corpora for General Language Studies Table 3 Comparison of probabilities for selected tag-sequences in Exposition versus Fiction (derived from analysis of tags in the LOB Corpus). (Note: The probabilities do not add up to 100% because minor categories are not listed.)  attachment, as a nominal versus verbal modifier, in ten-text subsamples taken from editorials and fiction. (The tex+_s are from the LOB Corpus; these counts were done by hand.) For editorials, this table shows that there is nearly a 50/50 split for prepositional phrases attached as nominal versus verbal modifiers, with a slight preference for nominal modifiers. In fiction, on the other hand, there is a much greater likelihood that a prepositional phrase will be attached as a verbal modifier (78.7%) rather than a nominal modifier (21.3%).</Paragraph>
      <Paragraph position="12"> For any automated language processing that depends on probabilistic techniques, whether part-of-speech tagging or syntactic parsing, the input probabilities are crucial. The analyses in this section suggest that it might be advantageous to store separate probabilities for different major registers, rather than using a single set of probabilities for a general-purpose tool. Minimally, these analyses show that input probabilities must be based on the distribution of forms in a diversified corpus representing the major register distinctions; probabilities derived from a single register are likely to produce skewed results when applied to texts from markedly different registers.</Paragraph>
    </Section>
    <Section position="2" start_page="225" end_page="228" type="sub_section">
      <SectionTitle>
2.2 Lexicographic Issues
</SectionTitle>
      <Paragraph position="0"> Text corpora have proven to be invaluable resources for research on word use and meaning, as in Sinclair's pioneering work on the COBUILD Dictionary (Sinclair 1987).</Paragraph>
      <Paragraph position="1"> In fact, corpus-based research shows that our intuitions about lexical patterns are often incorrect (Sinclair 1991; 112 ff). However, similar to the patterns for grammatical structures, for many words there is no general pattern of use that holds across the whole language; rather, different word senses and collocational patterns are strongly preferred in different registers.</Paragraph>
      <Paragraph position="2"> This point can be illustrated from an analysis of certainty adjectives in English (exploring in more detail some of the findings of Biber and Finegan 1989). Table 5 presents overall frequencies for three certainty adjectives--certain, sure, and de/inite-in two text corpora: the Longman/Lancaster Corpus, including written texts from ten  * Collocations much more common in social science than in fiction.</Paragraph>
      <Paragraph position="3"> ** Collocations much more common in fiction than in social science.</Paragraph>
      <Paragraph position="4"> *** These collocations represent that complement clauses where the complementizer has  been deleted.</Paragraph>
      <Paragraph position="5"> **** These collocations are all tokens of the idiom for sure.</Paragraph>
      <Paragraph position="6"> major text categories, and the London/Lund Corpus, made up of spoken texts from six major text categories. The overall pattern shows certain and sure occurring with approximately the same frequency in the written (Longman/Lancaster) corpus. In the spoken (London/Lund) corpus, sure occurs more frequently than certain, and both words are more common than in the written corpus. The word definite is relatively rare in both corpora, although it is slightly more common in the written corpus. Further, there are striking differences across written registers in the use of these words. In social science, certain is quite common, sure is relatively rare, and definite is common relative to its frequency in the whole written corpus. Fiction shows the opposite pattern: certain is relatively rare, sure is relatively common, and definite is quite rare. These patterns alone show that the semantic domain of certainty in English could not be adequately described without considering the patterns in complementary registers.</Paragraph>
      <Paragraph position="7"> Table 6 shows, however, that the actual patterns of use are even more complex. This table presents normalized frequencies (per 1 million words of text) of the major collocational patterns for certain and sure, comparing the distributions in social science and fiction. A single * is used to mark collocations that are much more common in social science, while ~ is used to mark collocations that are much more common  Computational Linguistics Volume 19, Number 2 in fiction. The collocational patterns (confirmed by concordance listings) identify two surprising facts about these words: 1) certain is commonly used to mark uncertainty rather than certainty; and 2) certainty is rarely expressed in social science at all. Thus, the most common collocations for certain in social science reflect a kind of vagueness, marking a referent as possibly being known (by someone) but not specified in the text (e.g., a certain kind of..., in certain cases .... there are certain indications that... ). These collocations are relatively rare in fiction. In contrast, those collocational patterns for certain that directly state that someone or something is certain--he/she/they/you/it + BE + certain and I/we + BE + certain--are extremely rare in social science but relatively common in fiction.</Paragraph>
      <Paragraph position="8"> Unlike the word certain, the term sure is most typically used to express certainty. This is apparently the reason why the overall frequency of sure is so low in social science. The difference between certain and sure, and between social science and fiction, is perhaps most striking for the collocations of pronoun + BE + certain~sure. These collocations are very common for sure in fiction; moderately common for certain in fiction; but quite rare for either sure or certain in social science.</Paragraph>
      <Paragraph position="9"> These examples illustrate the point that a corpus restricted to only one register would enable at best a partial analysis of lexical use; and if the results were generalized to the entire language, they would be incorrect. Thus, the major registers of English must be treated on their own terms in order to provide a comprehensive analysis of either grammatical structures or lexical patterns of use.</Paragraph>
      <Paragraph position="10">  3. Multidimensional Differences among Registers in English  The inherent nature of register variation in English is illustrated even more clearly in a series of studies using the multidimensional framework (e.g., Biber 1988, 1989, 1992). These studies have shown that there are systematic patterns of variation among registers; that these patterns can be analyzed in terms of underlying dimensions of variation; and that it is necessary to recognize the existence of a multidimensional space in order to capture the overall relations among registers.</Paragraph>
      <Paragraph position="11"> Each dimension comprises a set of linguistic features that co-occur frequently in texts. The dimensions are identified from a quantitative analysis of the distribution of 67 linguistic features in the texts of the LOB and London-Lund Corpora. There is space here for only a brief methodological overview of this approach; interested readers are referred to Biber (1988; especially Chapters 4 and 5) for a more detailed presentation. First, texts were automatically tagged for linguistic features representing several major grammatical and functional characteristics: tense and aspect markers, place and time adverbials, pronouns and pro-verbs, nominal forms, prepositional phrases, adjectives, adverbs, lexical specificity, lexical classes (e.g., hedges, emphatics), modals, specialized verb classes, reduced forms and discontinuous structures, passives, stative forms, dependent clauses, coordination, and questions. All texts were post-edited by hand to correct mis-tags.</Paragraph>
      <Paragraph position="12"> The frequency of each linguistic feature in each text was counted, and all counts were normalized to their occurrence per 1000 words of text. Then a factor analysis was run to identify the major co-occurrence patterns among the features. (Factor analysis is a statistical procedure that identifies groupings of linguistic features that co-occur frequently in texts.) So that texts and registers could be compared with respect to the dimensions, dimension scores were computed for each text by summing the major linguistic features grouped on each dimension. Finally, the dimensions were interpreted functionally based on the assumption that linguistic features co-occur in texts because  Douglas Biber Using Register-Diversified Corpora for General Language Studies they share underlying communicative functions. Similarly, the patterns of variation among registers were interpreted from both linguistic and functional perspectives. Five major dimensions are identified and interpreted in Biber (1988; especially Chapters 6 and 7). 3 Each comprises a distinct set of co-occurring linguistic features; each defines a different set of similarities and differences among spoken and written registers; and each has distinct functional underpinnings. The five dimensions are  interpretively labeled: 1. Informational versus Involved Production 4 2. Narrative versus Nonnarrative Concerns 3. Elaborated versus Situation-Dependent Reference 4. Overt Expression of Persuasion 5. Abstract versus Nonabstract Style  The primary communicative functions, major co-occurring features, and characteristic registers associated with each dimension are summarized in Table 7. As this table shows, registers differ systematically along each of these dimensions, relating to functional considerations such as interactiveness, involvement, purpose, and production circumstances, all of which have marked correlates in linguistic structure. 5 To illustrate these differences more concretely, Figure 1 presents the differences among nine spoken and written registers within the two-dimensional space defined by Dimension 1: 'Involved versus Informational Production' and Dimension 3: 'Elaborated  The register characterizations on Figure 1 reflect different relative frequencies of the linguistic features summarized in Table 7. For example, academic prose and newspaper reportage have the largest positive scores on Dimension 1, reflecting very frequent occurrences of nouns, adjectives, prepositional phrases, long words, etc. (the 'informational' features grouped on Dimension 1), together with markedly infrequent occurrences of 1st and 2nd person pronouns, questions, reductions, etc. (the 'involved' features on Dimension 1). On Dimension 3, academic prose and professional letters have the largest positive scores, reflecting very frequent occurrences of WH relative clause constructions (the features associated with 'elaborated reference'), together with markedly infrequent occurrences of time and place adverbials (the 'situationdependent' features). At the other extreme, conversations have the largest negative score on Dimension 1, reflecting very frequent occurrence of the 'involved' features grouped on that dimension (lst and 2nd person pronouns, questions, etc.) together with markedly few occurrences of the 'informational' features (nouns, adjectives, etc.). Conversations also have a quite large negative score on Dimension 3, although broadcasts have the largest negative score, reflecting very frequent occurrences of time and place adverbials together with markedly few WH relative clauses, etc.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML