XML Viewer - w06-1657

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1657_intro.xml
Size: 4,045 bytes
Last Modified: 2025-10-06 14:03:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1657">
  <Title>Markov Chains and Author Unmasking: An Investigation</Title>
  <Section position="4" start_page="482" end_page="483" type="intro">
    <SectionTitle>
2 Markov Chain Based Approaches
</SectionTitle>
    <Paragraph position="0"> Theopiniononhowlikelyagiventext X waswritten by author A, rather than any other author, can be found by a log likelihood ratio:</Paragraph>
    <Paragraph position="2"> where z [?] {words,chars}, ez(X) extracts an ordered set of items from X (where the items are either words or characters, indicated by z), |ez(X)|[?]1 is used as a normalisation for varying number of items, while pA(ez(X)) and pG(ez(X)) estimate the likelihood of the text having been written by author A and a generic author1, G, respectively.</Paragraph>
    <Paragraph position="3"> Given a threshold t, text X is classified as having been written by author A when OA,G (X) &gt; t, or as written by someone else when OA,G (X) [?] t.</Paragraph>
    <Paragraph position="4"> The |ez(X)|[?]1 normalisation term allows for the use of a common threshold (i.e. shared by all authors), which facilitates the interpretation of performance (e.g. via the use of the Equal Error Rate (EER) point on a Receiver Operating Characteristic (ROC) curve (Ortega-Garcia et al., 2004)). Appropriating a technique originally used in language modelling (Chen and Goodman, 1999), the likelihood of author A having written a particular sequence of items, X = `i1,i2,*** ,i|X|', can be approximated using the joint probability of all</Paragraph>
    <Paragraph position="6"> where ij[?]1j[?]m is a shorthand for ij[?]m ***ij[?]1 and m indicates the length of the history. Given training material for author A, denoted as XA, the maximum likelihood (ML) probability estimate for a particular m-th order Markov chain is:</Paragraph>
    <Paragraph position="8"> (2) where C`ijj[?]m|XA' is the number of times the sequence ijj[?]m occurs in XA. For chains that have not been seen during training, elaborate smoothing techniques (Chen and Goodman, 1999) are utilised to avoid zero probabilities in Eqn. (1). The probabilities for the generic author are estimated from a dataset comprised of texts from many authors.</Paragraph>
    <Paragraph position="9"> In this work we utilise interpolated Moffat smoothing2, where the probability of an m-th or- null der chain is a linear interpolation of its ML estimate and the smoothed probability estimate of the corresponding (m-1)-th order chain:</Paragraph>
    <Paragraph position="11"> Here, ,,ij : C`ij[?]1j[?]mij|XA' &gt; 0,, is the number of unique (m+1)-grams that have the same ij[?]1j[?]m history items. Further elucidation of this method is given in (Chen and Goodman, 1999; Witten and Bell, 1991).</Paragraph>
    <Paragraph position="12"> The (m-1)-th order probability will typically correlate with the m-th order probability and has the advantage of being estimated from a larger number of examples (Chen and Goodman, 1999).</Paragraph>
    <Paragraph position="13"> The 0-th order probability is interpolated with the uniform distribution, given by: punifA = 1/|VA|, where |VA |is the vocabulary size (Chen and Goodman, 1999).</Paragraph>
    <Paragraph position="14"> When an m-th order chain has a history (i.e. the items ij[?]1j[?]m) which hasn't been observed during training, a back-off to the corresponding reduced order chain is done3: if C</Paragraph>
    <Paragraph position="16"> Note that if the 0-th order chain also hasn't been observed during training, we are effectively backing off to the uniform distribution.</Paragraph>
    <Paragraph position="17"> A caveat: the training dataset for an author can be much smaller (and hence have a smaller vocabulary) than the combined training dataset for the generic author, resulting in punifA &gt; punifG . Thus when a previously unseen chain is encountered there is a dangerous bias towards author A, i.e., pmofA `ij|ij[?]1j[?]m' &gt; pmofG `ij|ij[?]1j[?]m'. To avoid this, punifA must be set equal to punifG .</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML