File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1657_concl.xml
Size: 3,819 bytes
Last Modified: 2025-10-06 13:55:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1657"> <Title>Markov Chains and Author Unmasking: An Investigation</Title> <Section position="10" start_page="489" end_page="489" type="concl"> <SectionTitle> 6 Main Findings and Future Directions </SectionTitle> <Paragraph position="0"> In this paper we investigated the use of character and word sequence kernels for the task of authorship attribution and compared their performance with two probabilistic approaches based on Markov chains of characters and words. The evaluations were done on a relatively large dataset (50 authors), where each author covered several topics. Rather than using the restrictive closed set identification setup, a verification setup was used which takes into account the realistic case of texts whicharenotwrittenbyanyhypothesisedauthors.</Paragraph> <Paragraph position="1"> We also appraised the applicability of the recently proposed author unmasking approach for dealing with relatively short texts.</Paragraph> <Paragraph position="2"> In the framework of Support Vector Machines, several configurations of the sequence kernels were studied, showing that word sequence kernels do not achieve better performance than a bag-of-words kernel. Character sequence kernels (using sequences with a length of 4) generally have better performance than the bag-of-words kernel and also have comparable performance to the two probabilistic approaches.</Paragraph> <Paragraph position="3"> Apossibleadvantageofcharactersequencekernels over word-based kernels is their inherent ability to do partial matching of words. Let us consider two examples. (i) Given the words &quot;negotiation&quot; and &quot;negotiate&quot;, the character sequence kernel can match &quot;negotiat&quot;, while a standard word-based kernel requires explicit word stemming beforehand in order to match the two related words (as done in our experiments). (ii) Given the words &quot;negotiation&quot; and &quot;desalination&quot;, a character sequence kernel can match the common ending &quot;ation&quot;. Particular word endings may be indicative of a particular author's style; such information would not be picked up by a standard word-based kernel.</Paragraph> <Paragraph position="4"> Interestingly, the bag-of-words kernel based approach obtains worse performance than the corresponding word based Markov chain approach.</Paragraph> <Paragraph position="5"> Apart from the issue of sparse feature space representation, factors such as the chunk size and the setting of the C parameter in SVM training can also affect the generalisation performance.</Paragraph> <Paragraph position="6"> The results also show that the amount of training material has more influence on discrimination performance than the amount of test material; about 5000 training words are required to obtain relatively good performance when using between 1250 and 5000 test words.</Paragraph> <Paragraph position="7"> Further experiments suggest that the author unmaskingapproachislessusefulwhendealingwith null relatively short texts, due to the unmasking effect being considerably less pronounced than for long texts and also due to different-author unmasking curves having close similarities to the same-author curves.</Paragraph> <Paragraph position="8"> In future work it would be useful to appraise composite kernels (Joachims et al., 2001) in order to combine character and word sequence kernels. If the two kernel types use (partly) complementary information, better performance could be achieved. Furthermore, more sophisticated character sequence kernels can be evaluated, such as mismatch string kernels used in bioinformatics, where mutations in the sequences are allowed (Leslie et al., 2004).</Paragraph> </Section> class="xml-element"></Paper>