XML Viewer - p04-1026

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1026_metho.xml
Size: 19,484 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1026">
  <Title>Linguistic Profiling for Author Recognition and Verification</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Linguistic Profiling
</SectionTitle>
    <Paragraph position="0"> In linguistic profiling, the occurrences in a text are counted of a large number of linguistic features, either individual items or combinations of items. These counts are then normalized for text length and it is determined how much (i.e. how many standard deviations) they differ from the mean observed in a profile reference corpus. For the authorship task, the profile reference corpus consists of the collection of all attributed and non-attributed texts, i.e. the entire ABC-NL1 corpus. For each text, the deviation scores are combined into a profile vector, on which a variety of distance measures can be used to position the text in relation to any group of other texts.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Features
</SectionTitle>
      <Paragraph position="0"> Many types of linguistic features can be profiled, such as features referring to vocabulary, lexical patterns, syntax, semantics, pragmatics, information content or item distribution through a text.</Paragraph>
      <Paragraph position="1"> However, we decided to restrict the current experiments to a few simpler types of features to demonstrate the overall techniques and methodology for profiling before including every possible type of feature. In this paper, we first show the results for lexical features and continue with syntactic features, since these are the easiest ones to extract automatically for these texts. Other features will be the subject of further research.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Authorship Score Calculation
</SectionTitle>
      <Paragraph position="0"> In the problem at hand, the system has to decide if an unattributed text is written by a specific author, on the basis of attributed texts by that and other authors. We test our system's ability to make this distinction by means of a 9-fold cross-validation experiment. In each set of runs of the system, the training data consists of attributed texts for eight of the nine essay topics. The test data consists of the unattributed texts for the ninth essay topic. This means that for all runs, the test data is not included in the training data and is about a different topic than what is present in the training material. During each run within a set, the system only receives information about whether each training text is written by one specific author. All other texts are only marked as &amp;quot;not by this author&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Raw Score
</SectionTitle>
      <Paragraph position="0"> The system first builds a profile to represent text written by the author in question. This is simply the featurewise average of the profile vectors of all text samples marked as being written by the author in question. The system then determines a raw score for all text samples in the list. Rather than using the normal distance measure, we opted for a non-symmetric measure which is a weighted combination of two factors: a) the difference between sample score and author score for each feature and b) the sample score by itself.</Paragraph>
      <Paragraph position="1"> This makes it possible to assign more importance to features whose count deviates significantly from the norm. The following distance formula is used:</Paragraph>
      <Paragraph position="3"> are the values for the i th feature for the text sample profile and the author profile respectively, and D and S are the weighting factors that can be used to assign more or less importance to the two factors described. We will see below how the effectiveness of the measure varies with their setting. The distance measure is then transformed into a score by the formula</Paragraph>
      <Paragraph position="5"> In this way, the score will grow with the similarity between text sample profile and author profile. Also, the first component serves as a correction factor for the length of the text sample profile vector.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Normalization and Renormalization
</SectionTitle>
      <Paragraph position="0"> The order of magnitude of the score values varies with the setting of D and S. Furthermore, the values can fluctuate significantly with the sample collection. To bring the values into a range which is suitable for subsequent calculations, we express them as the number of standard deviations they differ from the mean of the scores of the text samples marked as not being written by the author in question.</Paragraph>
      <Paragraph position="1"> In the experiments described in this paper, a rather special condition holds. In all tests, we know that the eight test samples are comparable in that they address the same topic, and that the author to be verified produced exactly one of the eight test samples. Under these circumstances, we should expect one sample to score higher than the others in each run, and we can profit from this knowledge by performing a renormalization, viz. to the number of standard deviations the score differs from the mean of the scores of the unattributed samples. However, this renormalization only makes sense in the situation that we have a fixed set of authors who each produced one text for each topic. This is in fact yet a different task than those mentioned above, say authorship sorting. Therefore, we will report on the results with renormalization, but only as additional information. The main description of the results will focus on the normalized scores.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Profiling with Lexical Features
</SectionTitle>
    <Paragraph position="0"> The most straightforward features that can be used are simply combinations of tokens in the text.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Lexical features
</SectionTitle>
      <Paragraph position="0"> Sufficiently frequent tokens, i.e. those that were observed at least a certain amount of times (in this case 5) in some language reference corpus (in this case the Eindhoven corpus; uit den Boogaart, 1975) are used as features by themselves. For less frequent tokens we determine a token pattern consisting of the sequence of character types, e.g., the token &amp;quot;Uefa-cup&amp;quot; is represented by the pattern &amp;quot;#L#6+/CL-L&amp;quot;, where the first &amp;quot;L&amp;quot; indicates low frequency, 6+ the size bracket, and the sequence &amp;quot;CL-L&amp;quot; a capital letter followed by one or more lower case letters followed by a hyphen and again one or more lower case letters. For lower case words, the final three letters of the word are included too, e.g. &amp;quot;waarmaken&amp;quot; leads to &amp;quot;#L#6+/L/ken&amp;quot;. These patterns have been originally designed for English and Dutch and will probably have to be extended when other languages are being handled.</Paragraph>
      <Paragraph position="1"> In addition to the form of the token, we also use the potential syntactic usage of the token as a feature. We apply the first few modules of a morphosyntactic tagger (in this case Wotan-Lite; Van Halteren et al., 2001) to the text, which determine which word class tags could apply to each token. For known words, the tags are taken from a lexicon; for unknown words, they are estimated on the basis of the word patterns described above. The three (if present) most likely tags are combined into a feature, e.g. &amp;quot;niet&amp;quot; leads to &amp;quot;#H#Adv(stell,onverv)-N(ev,neut)&amp;quot; and &amp;quot;waarmaken&amp;quot; to &amp;quot;#L#V(inf)-N(mv,neut)-V(verldw, onverv)&amp;quot;. Note that the most likely tags are determined on the basis of the token itself and that the context is not consulted. The modules of the tagger which do context dependent disambiguation are not applied.</Paragraph>
      <Paragraph position="2"> Op top of the individual token and tag features we use all possible bi- and trigrams which can be built with them, e.g. the token combination &amp;quot;kon niet waarmaken&amp;quot; leads to features such as &amp;quot;wcw=#H#kon#H#Adv(stell,onverv)-N(ev,neut) #L#6+/L/ken&amp;quot;. Since the number of features quickly grows too high for efficient processing, we filter the set of features by demanding that a feature occurs in a set minimum number of texts in the profile reference corpus (in this case two).</Paragraph>
      <Paragraph position="3"> A feature which is filtered out instead contributes to a rest category feature, e.g. the feature above would contribute to &amp;quot;wcw=&lt;OTHER&gt;&amp;quot;. For the current corpus, this filtering leads to a feature set of about 100K features.</Paragraph>
      <Paragraph position="4"> The lexical features currently also include features for utterance length. Each utterance leads to two such features, viz. the exact length (e.g.</Paragraph>
      <Paragraph position="5"> &amp;quot;len=15&amp;quot;) and the length bracket (e.g. &amp;quot;len=1019&amp;quot;). null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Results with lexical features
</SectionTitle>
      <Paragraph position="0"> A very rough first reconnaissance of settings for D and S suggested that the best results could be achieved with D between 0.1 and 2.4 and S between 0.0 and 1.0. Further examination of this area leads to FAR FRR=0 scores ranging down to around 15%. Figure 1 shows the scores at various settings for D and S. The z-axis is inverted (i.e. 1</Paragraph>
      <Paragraph position="2"> is used) to show better scores as peaks rather than troughs.</Paragraph>
      <Paragraph position="3"> The most promising area is the ridge along the trough at D=0.0, S=0.0. A closer investigation of this area shows that the best settings are D=0.575 and S=0.15. The FAR FRR=0 score here is 14.9%, i.e. there is a threshold setting such that if all texts by the authors themselves are accepted, only 14.9% of texts by other authors are falsely accepted.</Paragraph>
      <Paragraph position="4"> The very low value for S is surprising. It indicates that it is undesirable to give too much attention to features which deviate much in the sample being measured; still, in the area in question, the score does peak at a positive S value, indicating that some such weighting does have effect. Successful low scores for S can also be seen in the hill leading around D=1.0, S=0.3, which peaks at an FAR FRR=0 score of around 17 percent. From the shape of the surface it would seem that an investigation of the area across the S=0.0 divide might still be worthwhile, which is in contradiction with the initial finding that negative values produce no useful results.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Beyond Lexical Features
</SectionTitle>
    <Paragraph position="0"> As stated above, once the basic viability of the technique was confirmed, more types of features would be added. As yet, this is limited to syntactic features. We will first describe the system quality using only syntactic features, and then describe the results when using lexical and syntactic features in combination.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Syntactic Features
</SectionTitle>
      <Paragraph position="0"> We used the Amazon parser to derive syntactic constituent analyses of each utterance (Coppen,  the right hand side, in their actual order, examining dominance and linear precedence null For each label, two representations are used. The first is only the syntactic constituent label, the second is the constituent label plus the head word. This is done for each part of the N-grams independently, leading to 2, 4 and 8 features respectively for the three types of N-gram. Furthermore, each feature is used once by itself, once with an additional marking for the depth of the rewrite in the analysis tree, once with an additional marking for the length of the rewrite, and once with both these markings. This means another multiplication factor of four for a total of 8, 16 and 32 features respectively. After filtering for minimum number of observations, again at least an observation in two different texts, there are about 900K active syntactic features, nine times as many as for the lexical features.</Paragraph>
      <Paragraph position="1"> Investigation of the results for various settings has not been as exhaustive as for the lexical features. The best settings so far, D=1.3, S=1.4,</Paragraph>
      <Paragraph position="3"> of 24.8%, much worse than the 14.9% seen for lexical features.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Combining Lexical and Syntactic Fea-
</SectionTitle>
      <Paragraph position="0"> as a function of D and S, with D ranging from 0.1 to 2.4 and S from 0.0 to 1.0.</Paragraph>
      <Paragraph position="1"> ther, since they perform much worse than lexical ones. However, they might still be useful if we combine their scores with those for the lexical features. For now, rather than calculating new combined profiles, we just added the scores from the two individual systems. The combination of the best two individual systems leads to an FAR FRR=0 of 10.3%, a solid improvement over lexical features by themselves. However, the best individual systems are not necessarily the best combiners. The best combination systems produce</Paragraph>
      <Paragraph position="3"> measurements down to 8.1%, with settings in different parts of the parameter space. It should be observed that the improvement gained by combination is linked to the chosen quality measure. If we examine the ROC-curves for several types of systems (plotting the FAR against the FRR; Figure 2), we see that the combination curves as a whole do not differ much from the lexical feature curve. In fact, the EER for the 'best' combination system is worse than that for the best lexical feature system. This means that we should be very much aware of the relative importance of FAR and FRR in any specific application when determining the 'optimal' features and parameters.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Parameter Settings
</SectionTitle>
    <Paragraph position="0"> A weak point in the system so far is that there is no automatic parameter selection. The best results reported above are the ones at optimal settings. One would hope that optimal settings on training/tuning data will remain good settings for new data. Further experiments on other data will have to shed more light on this. Another choice which cannot yet be made automatically is that of a threshold. So far, the presentation in this paper has been based on a single threshold for all author/text combinations. That there is an enormous potential for improvement can be shown by assuming a few more informed methods of threshold selection.</Paragraph>
    <Paragraph position="1"> The first method uses the fact that, in our experiments, there are always one true and seven false authors. This means we can choose the threshold at some point below the highest of the eight scores. We can hold on to the single threshold strategy if we first renormalize, as described in Section 3.4, and then choose a single value to threshold the renormalized values against. The second method assumes that we will be able to find an optimal threshold for each individual run of the system. The maximum effect of this can be estimated with an oracle providing the optimal threshold. Basically, since the oracle threshold will be at the score for the text by the author, we  varying threshold at good settings of D and S for different types of features. The top pane shows the whole range (0 to 1) for FAR and FRR. The bottom pane shows the area from 0.0 to 0.2.</Paragraph>
    <Paragraph position="2"> are examining how many texts by other authors score better than the text by the actual author.</Paragraph>
    <Paragraph position="3"> Table 1 compares the results for the best settings for these two new scenarios with the results presented above. Renormalizing already greatly improves the results. Interestingly, in this scenario, the syntactic features outperform the lexical ones, something which certainly merits closer investigation after the parameter spaces have been charted more extensively. The full potential of profiling becomes clear in the Oracle threshold scenario, which shows extremely good scores.</Paragraph>
    <Paragraph position="4"> Still, this potential will yet have to be realized by finding the right automatic threshold determination mechanism.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Comparison to Previous Authorship
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Attribution Work
</SectionTitle>
      <Paragraph position="0"> Above, we focused on the authorship verification task, since it is the harder problem, given that the potential group of authors is unknown. However, as mentioned in Section 2, previous work with this data has focused on the authorship recognition problem, to be exact on selecting the correct author out of two potential authors. We repeat the previously published results in Table 2, together with linguistic profiling scores, both for the 2-way and for the 8-way selection problem.</Paragraph>
      <Paragraph position="1"> To do attribution with linguistic profiling, we calculated the author scores for each author from the set for a given text, and then selected the author with the highest score. The results are shown in Table 2, using lexical or syntactic features or both, and with and without renormalization. The Oracle scenario is not applicable as we are comparing rather than thresholding.</Paragraph>
      <Paragraph position="2"> In each case, the best results are not just found at a single parameter setting, but rather over a larger area in the parameter space. This means that the choice of optimal parameters will be more robust with regard to changes in authors and text types.</Paragraph>
      <Paragraph position="3"> We also observe that the optimal settings for recognition are very different from those for verification. A more detailed examination of the results is necessary to draw conclusions about these differences, which is again not possible until the parameter spaces have been charted more exhaustively.</Paragraph>
      <Paragraph position="4">  methods.</Paragraph>
      <Paragraph position="5"> All results with normalized scores are already better than the previously published results.</Paragraph>
      <Paragraph position="6"> When applying renormalization, which might be claimed to be justified in this particular authorship attribution problem, the combination system reaches the incredible level of making no mistakes at all.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML