File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1019_intro.xml

Size: 7,040 bytes

Last Modified: 2025-10-06 14:06:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1019">
  <Title>Exploring the Use of Linguistic Features in Domain and Genre Classification</Title>
  <Section position="3" start_page="0" end_page="142" type="intro">
    <SectionTitle>
2 Linguistic Cues to Genre
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="142" type="sub_section">
      <SectionTitle>
2.1 What is genre?
</SectionTitle>
      <Paragraph position="0"> The term &amp;quot;genre&amp;quot; is more frequent in philology and media studies than in mainstream linguistics (Swales, 1990, p.38). When it is not used synonymously with the terms &amp;quot;register&amp;quot; or &amp;quot;style&amp;quot;, genre is defined on the basis of non-linguistic criteria.</Paragraph>
      <Paragraph position="1"> For example, (Biber, 1988) characterises genres in terms of author/speaker purpose, while text types classify texts on the basis of text-internal criteria.</Paragraph>
      <Paragraph position="2"> Swales phrases this more precisely: Genres are collections of communicative events with shared communicative purposes which can vary in their prototypicality. These communicative purposes are determined by the discourse community which produces and reads texts belonging to a genre.</Paragraph>
      <Paragraph position="3"> But how can we extract its communicative purpose from a given text? First of all, we need to define the genres we want to detect. The definitions which were used in this study are summarised in section 3.1. If we assume that the culture-specific conventions which form the basis for assigning a given text to a certain genre are reflected in the style of the text, and if that style can be characterised quantitatively as a tendency to favour certain linguistic options over others (Herdan, 1960), we can then proceed to search for linguistic features which both discriminate well between our genres and can also be computed reliably from unannotated text. Potential sources for such options are comparative genre studies (Biber, 1988), authorship attribution research (Holmes, 1998; Forsyth and Holmes, 1996), content analy- null sis (Martindale and MacKenzie, 1995), and quantitative stylistics (Pieper, 1979). For the last step, classification, we need a robust statistical method which should preferably work well on sparse and noisy data. This aspect will be discussed in more detail in section 5.</Paragraph>
      <Paragraph position="4"> In their paper on genre categorization, (Kessler et al., 1997) take a somewhat different approach.</Paragraph>
      <Paragraph position="5"> They classify texts according to generic facets.</Paragraph>
      <Paragraph position="6"> Those facets express distinctions that &amp;quot;answer to certain practical interests&amp;quot; (p. 33). The &amp;quot;brow&amp;quot; facet roughly corresponds to register, and the &amp;quot;narrative&amp;quot; facet is taken from text type theory, while the &amp;quot;genre&amp;quot; facet most closely correspond to our usage of the term.</Paragraph>
    </Section>
    <Section position="2" start_page="142" end_page="142" type="sub_section">
      <SectionTitle>
2.2 Choice of features
</SectionTitle>
      <Paragraph position="0"> There are two basic types of features: ratios and frequencies. Typical ratios are the type/token ratio, sentence length (in words per sentence), or word length (in characters per words). More elaborate ratios which have been found to be useful in quantitative stylistics (Ross and Hunter, 1994) are e.g. the ratio of determiners to nouns or that of auxiliaries to VP heads.</Paragraph>
      <Paragraph position="1"> The most common features to be counted are words, or, more precisely, word stems. While most text categorisation research focusses on content words, function words have proved valuable in authorship attribution. The rationale behind this is that authors monitor their use of the most frequent words less carefully than that of other words. But this is not the reason why function words might prove to be useful in genre analysis. Rather, they indicate dimensions such as personal involvement (heavy use of first and second person pronouns), or argumentativity (high frequency of specific conjunctions). Content analysis counts the frequency of words which belong to certain diagnostic classes, such as for example aggressivity markers. The frequency of other linguistic features such as part-0f-speech (POS), noun phrases, or infinitive clauses, has been examined selectively in quantitative stylistics. In his comparative analysis of written and spoken genres in English, Biber (Biber, 1988) lists an impressive array of 67 linguistically motivated features which can be extracted reliably from text. However, he sometimes relies heavily on the fixed word order of English for their computation, which makes them difficult to transfer to a language with a more flexible word order, such as German. (Karlgren and Cutting, 1994) reports good results in a genre classification task based on a subset of these features, while (Kessler et al., 1997) show that a prudent selection of cues based on words, characters, and ratios can perform at least equally well.</Paragraph>
      <Paragraph position="2"> In our paper, we explore a hybrid approach.</Paragraph>
      <Paragraph position="3"> Starting from the classical information retrieval representation of texts as vectors of word frequencies (Salton and McGill, 1983), we explore how performance is affected if we include function word frequencies. For example, texts which aim at generalisable statements may contain more indefinite articles and pronouns and less definite articles.</Paragraph>
      <Paragraph position="4"> POS frequencies. (This essentially condenses information implicitly available in the word vector.) For example, nominal style should lead to a higher frequency of nouns, whereas descriptive texts may show more adjectives and adverbials than others.</Paragraph>
      <Paragraph position="5"> Note that we do not experiment with sophisticated feature selection strategies, which might be worthwhile for the POS information (cf. Sec. 4).</Paragraph>
      <Paragraph position="6"> POS frequency information is the only higher-level linguistic information which is encoded explicitly. Most current POS-taggers are reliable enough (at least for English) for their output to be used as the basis for a classification, whereas robust, reliable parsers are hard to find. Another source of information would have been the position of a word in a sentence, but incorporating this would have lead to substantially larger feature spaces and will be left to future work. Semantic classes were not examined, because defining, building, fine-tuning, and maintaining such word lists can be an arduous task (cf. e.g. (Klavans and Kan, 1998)), which should therefore only be undertaken for corpora with both well-defined and well-represented genres, where inherently fuzzy class boundaries are less likely to counteract the effect of careful feature selection.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML