File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1026_intro.xml

Size: 7,067 bytes

Last Modified: 2025-10-06 14:02:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1026">
  <Title>Linguistic Profiling for Author Recognition and Verification</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Tasks and Application Scenarios
</SectionTitle>
    <Paragraph position="0"> Traditionally, work on the attribution of a text to an author is done in one of two environments.</Paragraph>
    <Paragraph position="1"> The first is that of literary and/or historical research where attribution is sought for a work of unknown origin (e.g. Mosteller &amp; Wallace, 1984; Holmes, 1998). As secondary information generally identifies potential authors, the task is authorship recognition: selection of one author from a set of known authors. Then there is forensic linguistics, where it needs to be determined if a suspect did or did not write a specific, probably incriminating, text (e.g. Broeders, 2001; Chaski, 2001). Here the task is authorship verification: confirming or denying authorship by a single known author. We would like to focus on a third environment, viz. that of the handling of large numbers of student essays.</Paragraph>
    <Paragraph position="2"> For some university courses, students have to write one or more essays every week and submit them for grading. Authorship recognition is needed in the case the sloppy student, who forgets to include his name in the essay. If we could link such an essay to the correct student ourselves, this would prevent delays in handling the essay. Authorship verification is needed in the case of the fraudulous student, who has decided that copying is much less work than writing an essay himself, which is only easy to spot if the original is also submitted by the original author.</Paragraph>
    <Paragraph position="3"> In both scenarios, the test material will be sizable, possibly around a thousand words, and at least several hundred. Training material can be sufficiently available as well, as long as text collection for each student is started early enough.</Paragraph>
    <Paragraph position="4"> Many other authorship verification scenarios do not have the luxury of such long stretches of test text. For now, however, we prefer to test the basic viability of linguistic profiling on such longer stretches. Afterwards, further experiments can show how long the test texts need to be to reach an acceptable recognition/verification quality.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Quality Measures
</SectionTitle>
      <Paragraph position="0"> For recognition, quality is best expressed as the percentage of correct choices when choosing between N authors, where N generally depends on the attribution problem at hand. We will use the percentage of correct choices between two authors, in order to be able to compare with previous work. For verification, quality is usually expressed in terms of erroneous decisions. When the system is asked to verify authorship for the actual author of a text and decides that the text was not written by that author, we speak of a False Reject. The False Reject Rate (FRR) is the percentage of cases in which this happens, the percentage being taken from the cases which should be accepted. Similarly, the False Accept Rate (FAR) is the percentage of cases where somebody who has not written the test text is accepted as having written the text. With increasing threshold settings, FAR will go down, while FRR goes up. The behaviour of a system can be shown by one of several types of FAR/FRR curve, such as the Receiver Operating Characteristic (ROC).</Paragraph>
      <Paragraph position="1"> Alternatively, if a single number is preferred, a popular measure is the Equal Error Rate (EER), viz. the threshold value where FAR is equal to FRR. However, the EER may be misleading, since it does not take into account the consequences of the two types of errors. Given the example application, plagiarism detection, we do not want to reject, i.e. accuse someone of plagiarism, unless we are sure. So we would like to measure the quality of the system with the False Accept Rate at the threshold at which the False Reject Rate becomes zero.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Test Corpus
</SectionTitle>
      <Paragraph position="0"> Before using linguistic profiling for any real task, we should test the technique on a benchmark corpus. The first component of the Dutch Authorship Benchmark Corpus (ABC-NL1) appears to be almost ideal for this purpose. It contains widely divergent written texts produced by first-year and fourth-year students of Dutch at the University of Nijmegen. The ABC-NL1 consists of 72 Dutch texts by 8 authors, controlled for age and educational level of the authors, and for register, genre and topic of the texts. It is assumed that the authors' language skills were advanced, but their writing styles were as yet at only weakly developed and hence very similar, unlike those in literary attribution problems.</Paragraph>
      <Paragraph position="1"> Each author was asked to write nine texts of about a page and a half. In the end, it turned out that some authors were more productive than others, and that the text lengths varied from 628 to 1342 words. The authors did not know that the texts were to be used for authorship attribution studies, but instead assumed that their writing skill was measured. The topics for the nine texts were fixed, so that each author produced three argumentative non-fiction texts, on the television program Big Brother, the unification of Europe and smoking, three descriptive non-fiction texts, about soccer, the (then) upcoming new millennium and the most recent book they read, and three fiction texts, namely a fairy tale about Little Red Riding Hood, a murder story at the university and a chivalry romance.</Paragraph>
      <Paragraph position="2"> The ABC-NL1 corpus is not only well-suited because of its contents. It has also been used in previously published studies into authorship attribution. A 'traditional' authorship attribution method, i.e. using the overall relative frequencies of the fifty most frequent function words and a  correlation matrix of the corresponding 50dimensional vectors, fails completely (Baayen et al., 2002). The use of Linear Discriminant Analysis (LDA) on overall frequency vectors for the 50 most frequent words achieves around 60% correct attributions when choosing between two authors, which can be increased to around 80% by the application of cross-sample entropy weighting (Baayen et al., 2002). Weighted Probability Distribution Voting (WPDV) modeling on the basis of a very large number of features achieves 97.8% correct attributions (van Halteren et al., To Appear). Although designed to produce a hard recognition task, the latter result show that very high recognition quality is feasible. Still, this appears to be a good test corpus to examine the effectiveness of a new technique.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML