File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2805_metho.xml

Size: 20,588 bytes

Last Modified: 2025-10-06 14:10:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2805">
  <Title>Learning to Recognize Blogs: A Preliminary Exploration</Title>
  <Section position="4" start_page="24" end_page="25" type="metho">
    <SectionTitle>
3 Binary blog classification
</SectionTitle>
    <Paragraph position="0"> In our first experiment, we attempted binary blog classification (&amp;quot;is this a blog or not?&amp;quot;) using a small manually annotated dataset and a large variety of algorithms. The aim of this experiment was to discover what the performance of readily available, off-the-shelf algorithms is given this task.</Paragraph>
    <Paragraph position="1"> We used a broad spectrum of learners implemented in the well-known Weka machine learning toolkit (Witten and Frank, 2005).</Paragraph>
    <Section position="1" start_page="24" end_page="24" type="sub_section">
      <SectionTitle>
3.1 Dataset
</SectionTitle>
      <Paragraph position="0"> For our later resampling experiments, a large amount of data was gathered, as will be explained further on in this paper. To create a data-set for this experiment, 201 blog / blog-like pages were randomly selected from the collection, processed into Weka's arff format and manually annotated. These instances were then excluded from the rest of the collection. This yielded a small but reliable dataset, which we hoped would be sufficient for this task.</Paragraph>
    </Section>
    <Section position="2" start_page="24" end_page="25" type="sub_section">
      <SectionTitle>
3.2 Attribute selection
</SectionTitle>
      <Paragraph position="0"> All pages were processed into instances described by a variety of attributes. For binary blog classification to succeed, we had to find a large number of characteristics with which to accurately describe the data. This was done by manually browsing the HTML source code of several blogs as well as some simple intuition. These attributes range from &amp;quot;number of posts&amp;quot; and &amp;quot;post length&amp;quot; to checking for characteristic phrases such as &amp;quot;Comments&amp;quot; or &amp;quot;Archives&amp;quot; or checking for the use of style sheets. Interesting attributes are the &amp;quot;firstLine&amp;quot; / &amp;quot;lastLine&amp;quot; attributes, which calculate a score depending on the number of tokens found in those lines, which frequently occur in those lines in verified blog posts. The &amp;quot;contentType&amp;quot; attribute does something very similar, but based on the complete clean text of a page rather than particular lines in posts. It counts how many of the 100 most frequent tokens in clean text versions of actual blogs, are found in a page and returns a true  value if more than 60% of these are found, in which case the page is probably a blog. The &amp;quot;frequent terms&amp;quot;-lists for these attributes were generated using a manually verified list gathered from a general purpose dataset used for earlier experiments. A &amp;quot;host&amp;quot;-attribute is also used, which we binarised into a large number of binary host name attributes as most machine learning algorithms cannot cope with string attributes. For this purpose we took the 30 most common hosts in our dataset, which included Livejournal,  etc., but also a number of hosts that are obviously not blog sites (but host many pages that resemble blogs). Negative indicators on common hosts that don't serve blogs are just as valuable to the machine learner as the positive indicators of common blog hosts. Last but not least a binary attribute was added that acts as a class label for the instance. This process left us with the following 46 attributes:</Paragraph>
    </Section>
    <Section position="3" start_page="25" end_page="25" type="sub_section">
      <SectionTitle>
3.3 Experimental setup
</SectionTitle>
      <Paragraph position="0"> For this experiment, we trained a wide range of learners using the manually annotated data and tested using ten-fold cross-validation. We then compared the results to a baseline.</Paragraph>
      <Paragraph position="1"> This baseline is based mostly on simple heuristics, and is an extended version of the WWWBlog-Identify null  perl module that is freely available online. First of all, a URL check is done which looks for a large number of the well-known blog hosts as an indicator. Should this fail, a search is done for metatags which indicate the use of well-known blog creation tools such as  etc.</Paragraph>
      <Paragraph position="2"> Should this also fail, an actual content search is done for other indicators such as particular icons blog creation tools leave on pages (&amp;quot;created using... .gif&amp;quot; etc). Next, the module checks for an RSS feed, and as a very last resort checks the number of times the term &amp;quot;blog&amp;quot; is used on the page as an indicator.</Paragraph>
      <Paragraph position="3"> In earlier research, our version of the module was manually tested by a small group of individuals and found to have an accuracy of roughly 80% which means it is very useful as a target to aim for with our machine learning algorithms and a good baseline.</Paragraph>
      <Paragraph position="4">  rect predictions for each algorithm tested. It is clear that all algorithms bar ZeroR perform well, most topping 90%. ZeroR achieves no more than 73%, and is the only algorithm that actually performs worse than our baseline. The best algorithm for this task, and on this dataset, is clearly the support vector-based algorithm SMO, which scores 94.75%. These scores can be considered excellent for a classification task, and the wide success across the range of algorithms shows that our attribute selection has been a success. The attributes clearly describe the data well.</Paragraph>
      <Paragraph position="5"> Full results of this experiment can be found in</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="25" end_page="28" type="metho">
    <SectionTitle>
Appendix A.
4 Resampling
</SectionTitle>
    <Paragraph position="0"> Now we turn to the second of our research questions: to what extent can resampling methods help create better blog classifiers.</Paragraph>
    <Paragraph position="1"> As reported earlier, the blogosphere today contains millions of blogs and therefore potentially plenty of data for our classifier. However, this data is all unlabeled. Furthermore, we have a distinct lack of reliably labeled data. Resampling may provide us with a solution to this problem and allow us to reliably label the data from our unlabeled data source and further improve upon the results gained using our very small manually annotated dataset.</Paragraph>
    <Paragraph position="2"> For these experiments we selected two resampling methods. The first is ordinary bootstrapping, which we chose because it is the simplest way of relabeling unlabeled data on the basis of a machine learning model. Additionally, we chose a modified form of co-training, as co-training is also a well-known resampling method, which was easily adaptable to our problem and seemingly offered a good approach.</Paragraph>
    <Section position="1" start_page="25" end_page="27" type="sub_section">
      <SectionTitle>
4.1 Data set
</SectionTitle>
      <Paragraph position="0"> To gather a large data set containing both blogs and non-blogs, a crawler was developed that included a blog detection module based on the heuristics in our baseline module mentioned earlier.</Paragraph>
      <Paragraph position="1"> After downloading a page judged likely to be a blog by the module on the basis of its URL, several additional checks were done by the blog detection module based on several other characteristics, most importantly the presence of date-entry combinations. Pages judged to be a blog and those judged not to be even though the URL looked promising, were consequently stored separately. Blogs were stored in html, clean text and single entry (text) formats. For non-blogs only the html was stored to conserve space while still allowing the documents to be fully analysed post-crawling.</Paragraph>
      <Paragraph position="2"> Using this system, 227.380 blog- and 285.337 non-blog pages (often several pages were gathered from the same blog, so the actual number of blogs gathered is significantly lower) were gathered in the period from July 7 until November 3, 2005. This amounts to roughly 30Gb of HTML and text, and includes blogs from all the well-known blog sites as well as personal hand-written blogs and in many different languages.</Paragraph>
      <Paragraph position="3"> The blog detection module in the crawler was used purely for the purpose of filtering out URLs and webpages that bear no resemblence to a blog. By performing this pre-classification, we were able to gather a dataset containing only blogs and pages that in appearance closely resemble blogs so that our dataset contained both positive examples and useful negative examples.</Paragraph>
      <Paragraph position="4"> This approach should force the machine learner to make a clear distinction between blogs and non-blogs. However, even though this data was pre-classified by our baseline, we treat it as unlabeled data in our experiments and make no further use of this pre-classification whatsoever.</Paragraph>
      <Paragraph position="5"> For our resampling experiments, we randomly divided the large dataset into small subsets containing 1000 instances, one for each iteration.</Paragraph>
      <Paragraph position="6"> This figure ensures that the training set grows at a reasonable rate at every iteration while preventing the training set from becoming too large too quickly which would mean a lot of unlabeled  instances being labeled on the basis of very few labeled instances and the model building process would take too long after only a few iterations.</Paragraph>
      <Paragraph position="7"> For training and test data we turned back to our manually annotated dataset used previously.</Paragraph>
      <Paragraph position="8"> Of this set, 100 instances were used for the initial training and the remaining 101 for testing.</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.2 Experimental setup: bootstrapping
</SectionTitle>
      <Paragraph position="0"> Generally, bootstrapping is an iterative process where at every iteration unlabeled data is labeled using predictions made by the learner model based on the previously available training set (Jones et al., 1999). These newly labeled instances are then added to the training set and the whole process repeats. Our expectation was that the increase in available training instances should improve the algorithm's accuracy, especially as it proved quite accurate to begin with so the algorihm's predictions should prove quite reliable.</Paragraph>
      <Paragraph position="1"> For this experiment we used the best performing algorithm from Section 3, the SMO support-vector based algorithm. The bootstrapping method is applied to this problem as follows: - Initialisation: use the training set containing 100 manually annotated instances to predict the labels of the first subset of 1000 unlabeled instances.</Paragraph>
      <Paragraph position="2"> - Iterations: Label the unlabeled instances according to the algorithm's prediction and add these instances to the previous training set to form a new training set.</Paragraph>
      <Paragraph position="3"> Build a new model based on the new training set and use it to predict the labels of the next subset.</Paragraph>
    </Section>
    <Section position="3" start_page="27" end_page="27" type="sub_section">
      <SectionTitle>
4.3 Results: bootstrapping
</SectionTitle>
      <Paragraph position="0"> We now present the results of our experiment using normal bootstrapping. After every iteration, the model built by the learner was tested on our manually annotated test set.</Paragraph>
      <Paragraph position="1">  strapping.</Paragraph>
      <Paragraph position="2"> After 36 iterations, the experiment was halted as there was clearly no more gain to be expected from any further iterations. Clearly, ordinary bootstrapping does not offer any advantages for our binary blog classification problem. Also, the availability of larger amounts of training instances does nothing to improve results as the results are best using only the very small training set.</Paragraph>
      <Paragraph position="3"> Generally, both precision and recall slowly decrease as the training set grows, showing that classifier accuracy as a whole declines. However, recall of instances with class label &amp;quot;no&amp;quot; (nonblogs) remains constant throughout. Clearly the classifier is able to easily detect non-blog pages on the basis of the attributes provided, and is thwarted only by a small number of outliers. This can be explained by the fact that the learner recognizes non-blogs mostly on the basis of the first few attributes having zero values (nrOfPosts, minPostLength, maxPostLength etc.). The outliers consistently missed by the classifier are probably blog-like pages in which date-entry combinations have been found but which nevertheless have been manually classified as nonblogs. Examples of this are calendar pages commonly associated with blogs (but which do not contain blog content), or MSN Space pages on which the user is using the photo album but hasn't started a blog yet. In this case the page is recognized as a blog, but contains no blog content and is therefore manually labeled a nonblog. null</Paragraph>
    </Section>
    <Section position="4" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.4 Experimental setup: co-training
</SectionTitle>
      <Paragraph position="0"> As mentioned in Section 2, we will use the predictions of several of the most successful learning algorithms from Section 3 as our indicators  in this experiment. The goal of our co-training experiment is to take unanimous predictions from the three best performing algorithms from Section 3, and use those predictions, which we assume to have a very high degree of confidence, to bootstrap the training set. We will then test to see if it offers an improvement over the SMO algorithm by itself. By unanimous predictions we mean the predictions of those instances, on which all the algorithms agree unanimously after they have been allowed to predict labels using their respective models.</Paragraph>
      <Paragraph position="1"> As instances for which the predictions are unanimous can be reasoned to have a very high level of confidence, the predictions for those instances are almost certainly correct. Therefore we expect this method to offer substantial improvements over any single algorithm as it potentially yields a very large number of correctly labeled instances for the learner to train on.</Paragraph>
      <Paragraph position="2">  tation of the co-training method.</Paragraph>
      <Paragraph position="3"> We chose to adapt the co-training idea in this fashion as we believe it to be a good way of radically reducing the fuzziness of potential predictions and a way to gain a very high degree of confidence in the labels attached to previously unlabeled data. Should the algorithms disagree on a large number of instances there would still not be a problem as we have a very large pool of unlabeled instances (133.000, we only used part of our corpus for our experiments as our dataset was so large that there was no need to use all the data available). The potential maximum of 133 iterations should prove quite sufficient even if the growth of the training set per iteration proves to be very small.</Paragraph>
      <Paragraph position="4"> The algorithms we chose for this experiment were SMO (support vector), J48 (decision tree, a C4.5 implementation) and Jrip (rule based). We chose not to use nearest neighbour algorithms for this experiment even though they performed well individually as we feared it would prove a less successful approach given the large training set sizes. Indeed, an earlier experiment done during our blog classification research showed the performance of near neighbour algorithms bottomed out very quickly so no real improvement can be expected from those algorithms given larger training sets and given the unanimous nature of this method of co-training it may spoil any gain that might otherwise be achieved.</Paragraph>
      <Paragraph position="5"> The process started with the manually annotated training set and used the predictions from the three algorithms, for unlabeled instances they agree unanimously on, to label those instances.</Paragraph>
      <Paragraph position="6"> Those instances were subsequently added to the trainingset and using this new trainingset, a number of the instances in another unlabeled set (1000 instances per set) were to be labeled (again, only those instances on which the algorithms agree unanimously). Once again, those instances are added to the training set and so on and so forth for as many iterations as possible.</Paragraph>
    </Section>
    <Section position="5" start_page="28" end_page="28" type="sub_section">
      <SectionTitle>
4.5 Results: co-training
</SectionTitle>
      <Paragraph position="0"> We now turn to the results of our experiment using our unanimous co-training method described above. The experiment was halted after 30 iterations, as Weka ran out of memory. The experiment was not re-run with altered memory settings as it was clear that no more gain was to be expected by doing so. Again, testing after each iteration was performed by building a model using the SMO support-vector learning algorithm and testing classifier accuracy on the manually annotated test set.</Paragraph>
      <Paragraph position="1">  mous co-training method.</Paragraph>
      <Paragraph position="2"> Even though the &amp;quot;steps&amp;quot; in test percentages shown represent only one more blog being classified correctly (or incorrectly), the classifier does perform better than it did using only the manually annotated training set at some stages of the experiment. This means that gains in classifier accuracy can be achieved by using this method of co-training on this problem. Also the classifier generally performs better than in our bootstrapping experiment, which shows that the instances unanimously agreed on by all three algorithms are certainly more reliable than the predictions of even the best algorithm by itself, as predicted.</Paragraph>
      <Paragraph position="3"> Clearly this method offers potential for an improvement even though the SMO algorithm was already very accurate in our first binary blog classification experiment.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="28" end_page="28" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> As the title suggests, these experiments are of a preliminary and exploratory nature. The high accuracy achieved by almost all algorithms in our binary classification experiment show that our attribute set clearly defines the subject well.</Paragraph>
    <Paragraph position="1"> However, these results must be viewed with an air of caution as they were obtained using a small subset and as such the data may not represent the nature of the complete dataset well. Indeed, how stable are the results obtained? Later experiments using a (disjoin, but) larger manually annotated dataset containing 700 instances show that the results obtained here are optimistic. The extremely diverse nature of the blogosphere means that describing an entire dataset using a relatively small subset is very difficult and as such both the performance and ranking of off-the-shelf machine learning algorithms will vary among different datasets. Off-the-shelf algorithms do however still perform far better than our baseline and the best performing algorithms still achieve accuracy rates in excess of 90%.</Paragraph>
    <Paragraph position="2"> Two aspects of our attribute set that need to be worked on in future are date detection and content checks. Outliers are almost always caused by the date detection algorithm not detecting certain date formats, and pages containing date-entry combinations but no real blog content. Therefore, although it is possible to perform binary blog classification based purely on the particular characteristics of blog pages with high accuracy, content checks are invaluable. The rise of blogspam, which cannot be separated from real blogs on the basis of page characteristics at all, further emphasises this. We have already developed a document frequency profile and replaced the contentType attribute used in these experiments, to extend the content-based attributes in our dataset and hopefully improve blog recognition.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML