File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1005_intro.xml

Size: 9,593 bytes

Last Modified: 2025-10-06 14:06:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1005">
  <Title>Automatic Detection of Text Genre</Title>
  <Section position="2" start_page="0" end_page="33" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Computational linguists have been concerned for the most part with two aspects of texts: their structure and their content. That is. we consider texts on the one hand as formal objects, and on the other as symbols with semantic or referential values. In this paper we want to consider texts from the point of view of genre: that is. according to the various functional roles they play.</Paragraph>
    <Paragraph position="1"> Genre is necessarily a heterogeneous classificatory principle, which is based among other things on the way a text was created, the way it is distributed, the register of language it uses, and the kind of audience it is addressed to. For all its complexity, this attribute can be extremely important for many of the core problems that computational linguists are concerned with. Parsing accuracy could be increased by taking genre into account (for example, certain object-less constructions occur only in recipes in English). Similarly for POS-tagging (the frequency of uses of trend as a verb in the Journal of Commerce is 35 times higher than in Sociological Abstracts). In word-sense disambiguation, many senses are largely restricted to texts of a particular style, such as colloquial or formal (for example the word pretty is far more likely to have the meaning &amp;quot;rather&amp;quot; in informal genres than in formal ones). In information retrieval, genre classification could enable users to sort search results according to their immediate interests. People who go into a bookstore or library are not usually looking simply for information about a particular topic, but rather have requirements of genre as well: they are looking for scholarly articles about hypnotism, novels about the French Revolution, editorials about the supercollider, and so forth.</Paragraph>
    <Paragraph position="2"> If genre classification is so useful, why hasn't it figured much in computational linguistics before now? One important reason is that, up to now, the digitized corpora and collections which are the subject of much CL research have been for the most part generically homogeneous (i.e., collections of scientific abstracts or newspaper articles, encyclopedias, and so on), so that the problem of genre identification could be set aside. To a large extent, the problems of genre classification don't become salient until we are confronted with large and heterogeneous search domains like the World-Wide Web.</Paragraph>
    <Paragraph position="3"> Another reason for the neglect of genre, though, is that it can be a difficult notion to get a conceptual handle on. particularly in contrast with properties of structure or topicality, which for all their complications involve well-explored territory. In order to do systematic work on automatic genre classification.</Paragraph>
    <Paragraph position="4"> by contrast, we require the answers to some basic theoretical and methodological questions. Is genre a single property or attribute that can be neatly laid out in some hierarchical structure? Or are we really talking about a muhidimensional space of properties that have little more in common than that they are more or less orthogonal to topicality? And once we have the theoretical prerequisites in place, we have to ask whether genre can be reliably identified by means of computationally tractable cues.</Paragraph>
    <Paragraph position="5"> In a broad sense, the word &amp;quot;genre&amp;quot; is merely a literary substitute for &amp;quot;'kind of text,&amp;quot; and discussions of literary classification stretch back to Aris- null totle. We will use the term &amp;quot;'genre&amp;quot; here to refer to any widely recognized class of texts defined by some common communicative purpose or other functional traits, provided the function is connected to some formal cues or commonalities and that the class is extensible. For example an editorial is a shortish prose argument expressing an opinion on some matter of immediate public concern, typically written in an impersonal and relatively formal style in which the author is denoted by the pronoun we.</Paragraph>
    <Paragraph position="6"> But we would probably not use the term &amp;quot;genre&amp;quot; to describe merely the class of texts that have the objective of persuading someone to do something, since that class -- which would include editorials, sermons, prayers, advertisements, and so forth -has no distinguishing formal properties. At the other end of the scale, we would probably not use &amp;quot;genre&amp;quot; to describe the class of sermons by John Donne, since that class, while it has distinctive formal characteristics, is not extensible. Nothing hangs in the balance on this definition, but it seems to accord reasonably well with ordinary usage.</Paragraph>
    <Paragraph position="7"> The traditional literature on genre is rich with classificatory schemes and systems, some of which might in retrospect be analyzed as simple attribute systems. (For general discussions of literary theories of genre, see, e.g., Butcher (1932), Dubrow (1982), Fowler (1982), Frye (1957), Hernadi (1972), Hobbes (1908), Staiger (1959), and Todorov (1978).) We will refer here to the attributes used in classifying genres as GENERIC FACETS. A facet is simply a property which distinguishes a class of texts that answers to certain practical interests~ and which is moreover associated with a characteristic set of computable structural or linguistic properties, whether categorical or statistical, which we will describe as &amp;quot;generic cues.&amp;quot; In principle, a given text can be described in terms of an indefinitely large number of facets. For example, a newspaper story about a Balkan peace initiative is an example of a BROADCAST as opposed to DIRECTED communication, a property that correlates formally with certain uses of the pronoun you. It is also an example of a NARRATIVE, as opposed to a DIRECTIVE (e.g..</Paragraph>
    <Paragraph position="8"> in a manual), SUASXVE (as in an editorial), or DE-SCRIPTIVE (as in a market survey) communication; and this facet correlates, among other things, with a high incidence of preterite verb forms.</Paragraph>
    <Paragraph position="9"> Apart from giving us a theoretical framework for understanding genres, facets offer two practical advantages. First. some applications benefit from categorization according to facet, not genre. For example, in an information retrieval context, we will want to consider the OPINION feature most highly when we are searching for public reactions to the supercollider, where newspaper columns, editorials.</Paragraph>
    <Paragraph position="10"> and letters to the editor will be of roughly equal interest. For other purposes we will want to stress narrativity, for example in looking for accounts of the storming of the Bastille in either novels or histories. null Secondly. we can extend our classification to genres not previously encountered. Suppose that we are presented with the unfamiliar category FINAN-CIAL ANALYSTS' REPORT. By analyzing genres as bundles of facets, we can categorize this genre as INSTITUTIONAL (because of the use of we as in editorials and annual reports) and as NON-SUASIVE or non-argumentative (because of the low incidence of question marks, among other things), whereas a system trained on genres as atomic entities would not be able to make sense of an unfamiliar category.</Paragraph>
    <Section position="1" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
1.1 Previous Work on Genre Identification
</SectionTitle>
      <Paragraph position="0"> The first linguistic research on genre that uses quantitative methods is that of Biber (1986: 1988; 1992; 1995), which draws on work on stylistic analysis, readability indexing, and differences between spoken and written language. Biber ranks genres along several textual &amp;quot;dimensions&amp;quot;, which are constructed by applying factor analysis to a set of linguistic syntactic and lexical features. Those dimensions are then characterized in terms such as &amp;quot;informative vs.</Paragraph>
      <Paragraph position="1"> involved&amp;quot; or &amp;quot;'narrative vs. non-narrative.&amp;quot; Factors are not used for genre classification (the values of a text on the various dimensions are often not informative with respect to genre). Rather, factors are used to validate hypotheses about the functions of various linguistic features.</Paragraph>
      <Paragraph position="2"> An important and more relevant set of experiments, which deserves careful attention, is presented in Karlgren and Cutting {1994). They too begin with a corpus of hand-classified texts, the Brown corpus. One difficulty here. however, is that it is not clear to what extent the Brown corpus classification used in this work is relevant for practical or theoretical purposes. For example, the category &amp;quot;Popular Lore&amp;quot; contains an article by the decidedly highbrow Harold Rosenberg from Commentary. and articles from Model Railroader and Gourmet, surely not a natural class by any reasonable standard. In addition, many of the text features in Karlgren and Cutting are structural cues that require tagging. We will replace these cues with two new classes of cues that are easily computable: character-level cues and deviation cues.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML