File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-2035_evalu.xml
Size: 1,234 bytes
Last Modified: 2025-10-06 13:59:40
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2035"> <Title>Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach</Title> <Section position="6" start_page="139" end_page="139" type="evalu"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The system performs better with the intra-token segmentation because the system is forced to guess unseen events on fewer occasions. For instance given the input URL www.ipodipodipod.com in the system which segments solely on punctuation both the spam and the good model will have to guess the probability of ipodipodipod and the results depend merely on the smoothing technique.</Paragraph> <Paragraph position="1"> Even if we reached the average accuracy of humans we expect to be able to improve the system further as the maximum accuracy among the human annotators is 90%. Among the errors of the segmenter the most common are related to plural nouns ('girl*s' vs. 'girls') and past tense of verbs ('dedicate*d' vs. 'dedicated') .</Paragraph> <Paragraph position="2"> The proposed approach has ramifications for splog filtering systems that want to consider the outward links from a weblog.</Paragraph> </Section> class="xml-element"></Paper>