File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1005_metho.xml

Size: 10,733 bytes

Last Modified: 2025-10-06 14:14:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1005">
  <Title>Automatic Detection of Text Genre</Title>
  <Section position="3" start_page="33" end_page="33" type="metho">
    <SectionTitle>
2 Identifying Genres: Generic Cues
</SectionTitle>
    <Paragraph position="0"> This section discusses generic cues, the &amp;quot;'observable'&amp;quot; properties of a text that are associated with facets.</Paragraph>
    <Section position="1" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.1 Structural Cues
</SectionTitle>
      <Paragraph position="0"> Examples of structural cues are passives, nominalizations, topicalized sentences, and counts of the frequency of syntactic categories (e.g.. part-of-speech tags). These cues are not much discussed in the traditional literature on genre, but have come to the fore in recent work (Biber, 1995; Karlgren and Cutting, 1994). For purposes of automatic classification they have the limitation that they require tagged or parsed texts.</Paragraph>
    </Section>
    <Section position="2" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.2 Lexical Cues
</SectionTitle>
      <Paragraph position="0"> Most facets are correlated with lexical cues. Examples of ones that we use are terms of address (e.g., Mr., Ms.). which predominate in papers like the New ~brk Times: Latinate affixes, which signal certain highbrow registers like scientific articles or scholarly works; and words used in expressing dates, which are common in certain types of narrative such as news stories.</Paragraph>
    </Section>
    <Section position="3" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.3 Character-Level Cues
</SectionTitle>
      <Paragraph position="0"> Character-level cues are mainly punctuation cues and other separators and delimiters used to mark text categories like phrases, clauses, and sentences (Nunberg, 1990). Such features have not been used in previous work on genre recognition, but we believe they have an important role to play, being at once significant and very frequent. Examples include counts of question marks, exclamations marks, capitalized and hyphenated words, and acronyms.</Paragraph>
    </Section>
    <Section position="4" start_page="33" end_page="33" type="sub_section">
      <SectionTitle>
2.4 Derivative Cues
</SectionTitle>
      <Paragraph position="0"> Derivative cues are ratios and variation measures derived from measures of lexical and character-level features.</Paragraph>
      <Paragraph position="1"> Ratios correlate in certain ways with genre, and have been widely used in previous work. We represent ratios implicitly as sums of other cues by transforming all counts into natural logarithms. For example, instead of estimating separate weights o, 3, and 3' for the ratios words per sentence (average sentence length), characters per word (average word length) and words per type (token/type ratio), respectively, we express this desired weighting:</Paragraph>
      <Paragraph position="3"> (where W = word tokens. S = sentences. C =characters, T = word types). The 55 cues in our experiments can be combined to almost 3000 different ratios. The log representation ensures that. all these ratios are available implicitly while avoiding overfitting and the high computational cost of training on a large set of cues.</Paragraph>
      <Paragraph position="4"> Variation measures capture the amount of variation of a certain count cue in a text (e.g.. the standard deviation in sentence length). This type of useful metric has not been used in previous work on genre.</Paragraph>
      <Paragraph position="5"> The experiments in this paper are based on 55 cues from the last three groups: lexical, character-level and derivative cues. These cues are easily computable in contrast to the structural cues that have figured prominently in previous work on genre.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="33" end_page="34" type="metho">
    <SectionTitle>
3 Method
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="33" end_page="34" type="sub_section">
      <SectionTitle>
3.1 Corpus
</SectionTitle>
      <Paragraph position="0"> The corpus of texts used for this study was the Brown Corpus. For the reasons mentioned above, we used our own classification system, and eliminated texts that did not fall unequivocally into one of our categories. W'e ended up using 499 of the 802 texts in the Brown Corpus. (While the Corpus contains 500 samples, many of the samples contain several texts.) For our experiments, we analyzed the texts in terms of three categorical facets: BROW, NARRA-TIVE, and GENRE. BROW characterizes a text in terms of the presumptions made with respect to the required intellectual background of the target audience. Its levels are POPULAR, MIDDLE. UPPER-MIDDLE, and HIGH. For example, the mainstream American press is classified as MIDDLE and tabloid newspapers as POPULAR. The ,NARRATIVE facet is binary, telling whether a text is written in a narrative mode, primarily relating a sequence of events. The GENRE facet has the values REPORTAGE, ED-ITORIAL, SCITECH, LEGAL. NONFICTION, FICTION.</Paragraph>
      <Paragraph position="1"> The first two characterize two types of articles from the daily or weekly press: reportage and editorials.</Paragraph>
      <Paragraph position="2"> The level SCITECH denominates scientific or technical writings, and LEGAL characterizes various types of writings about law and government administration. Finally, NONFICTION is a fairly diverse category encompassing most other types of expository writing, and FICTION is used for works of fiction.</Paragraph>
      <Paragraph position="3"> Our corpus of 499 texts was divided into a train&amp;quot;ing subcorpus (402 texts) and an evaluation subcorpus (97). The evaluation subcorpus was designed  to have approximately equal numbers of all represented combinations of facet levels. Most such combinations have six texts in the evaluation corpus, but due to small numbers of some types of texts, some extant combinations are underrepresented. Within this stratified framework, texts were chosen by a pseudo random-number generator. This setup results in different quantitative compositions of training and evaluation set. For example, the most frequent genre level in the training subcorpus is RE-PORTAGE, but in the evaluation subcorpus NONFICTION predominates.</Paragraph>
    </Section>
    <Section position="2" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.2 Logistic Regression
</SectionTitle>
      <Paragraph position="0"> We chose logistic regression (LR) as our basic numerical method. Two informal pilot studies indicated that it gave better results than linear discrimination and linear regression.</Paragraph>
      <Paragraph position="1"> LR is a statistical technique for modeling a binary response variable by a linear combination of one or more predictor variables, using a logit link function:</Paragraph>
      <Paragraph position="3"> and modeling variance with a binomial random variable, i.e., the dependent variable log(r~(1 - ,7)) is modeled as a linear combination of the independent variables. The model has the form g(,'r) = zi,8 where ,'r is the estimated response probability (in our case the probability of a particular facet value), xi is the feature vector for text i, and ~q is the weight vector which is estimated from the matrix of feature vectors. The optimal value of fl is derived via maximum likelihood estimation (McCullagh and Netder, 1989), using SPlus (Statistical Sciences, 1991).</Paragraph>
      <Paragraph position="4"> For binary decisions, the application of LR was straightforward. For the polytomous facets GENRE and BROW, we computed a predictor function independently for each level of each facet and chose the category with the highest prediction.</Paragraph>
      <Paragraph position="5"> The most discriminating of the 55 variables were selected using stepwise backward selection based on the AIC criterion (see documentation for STEP.GLM in Statistical Sciences (1991)). A separate set of variables was selected for each binary discrimination task.</Paragraph>
      <Paragraph position="6">  In order to see whether our easily-computable surface cues are comparable in power to the structural cues used in Karlgren and Cutting (1994), we also ran LR with the cues used in their experiment. Because we use individual texts in our experiments instead of the fixed-length conglomerate samples of Karlgren and Cutting, we averaged all count features over text length.</Paragraph>
    </Section>
    <Section position="3" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
3.3 Neural Networks
</SectionTitle>
      <Paragraph position="0"> Because of the high number of variables in our experiments, there is a danger that overfitting occurs.</Paragraph>
      <Paragraph position="1"> LR also forces us to simulate polytomous decisions by a series of binary decisions, instead of directly modeling a multinomial response. Finally. classical LR does not model variable interactions.</Paragraph>
      <Paragraph position="2"> For these reasons, we ran a second set of experiments with neural networks, which generally do well with a high number of variables because they protect against overfitting. Neural nets also naturally model variable interactions. We used two architectures, a simple perceptron (a two-layer feed-forward network with all input units connected to all output units), and a multi-layer perceptron with all input units connected to all units of the hidden layer, and all units of the hidden layer connected to all output units. For binary decisions, such as determining whether or not a text is :NARRATIVE, the output layer consists of one sigmoidal output unit: for polytomous decisions, it consists of four (BRow) or six (GENRE) softmax units (which implement a multinomial response model} (Rumelhart et al., 1995).</Paragraph>
      <Paragraph position="3"> The size of the hidden layer was chosen to be three times as large as the size of the output layer (3 units for binary decisions, 12 units for BRow, 18 units for GENRE).</Paragraph>
      <Paragraph position="4"> For binary decisions, the simple perceptron fits a logistic model just as LR does. However, it is less prone to overfitting because we train it using three-fold cross-validation. Variables are selected by summing the cross-entropy error over the three validation sets and eliminating the variable that if eliminated results in the lowest cross-entropy error.</Paragraph>
      <Paragraph position="5"> The elimination cycle is repeated until this summed cross-entropy error starts increasing. Because this selection technique is time-consuming, we only apply it to a subset of the discriminations.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML