File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3310_metho.xml

Size: 15,880 bytes

Last Modified: 2025-10-06 14:10:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3310">
  <Title>Exploring Text and Image Features to Classify Images in Bioscience Literature</Title>
  <Section position="4" start_page="74" end_page="76" type="metho">
    <SectionTitle>
3 Image Classification
</SectionTitle>
    <Paragraph position="0"> We explored supervised machine-learning methods to automatically classify images according to our image taxonomy (Table 1). Since it is straightforward to distinguish table separately by applying surface cues (e.g., &amp;quot;Table&amp;quot; and &amp;quot;Figure&amp;quot;), we have decided to exclude it from our experiments.</Paragraph>
    <Section position="1" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
3.1 Support Vector Machines
</SectionTitle>
      <Paragraph position="0"> We explored supervised machine-learning systems using Support Vector Machines (SVMs) which have shown to out-perform many other supervised machine-learning systems for text categorization tasks (Joachims, 1998). We applied the freely available machine learning MATLAB package The Spider to train our SVM systems (Sable and Weston, 2005; MATLAB). The Spider implements many learning algorithms including a multi-class SVM classifier which was used to learn our discriminative classifiers as described below in section 3.4.</Paragraph>
      <Paragraph position="1"> A fundamental concept in SVM theory is the projection of the original data into a high-dimensional space in which separating hyperplanes can be found. Rather than actually doing this projection, kernel functions are selected that efficiently compute the inner products between data in the high-dimensional space. Slack variables are introduced to handle non-separable cases and this requires an upper bound variable, C.</Paragraph>
      <Paragraph position="2"> Our experiments considered three popular kernel function families over five different variants and five different values of C. The kernel function implementations are explained in the software documentation. We considered kernel functions in the forms of polynomial, radial basis function, and Gaussian. The adjustable parameter for polynomial functions is the order of the polynomial. For radial basis function and Gaussian functions, sigma is the adjustable parameter. A grid search was performed over the adjustable parameter for values 1 to 5 and for values of C equal to [10^0, 10^1, 10^2, 10^3, 10^4].</Paragraph>
    </Section>
    <Section position="2" start_page="74" end_page="74" type="sub_section">
      <SectionTitle>
3.2 Text Features
</SectionTitle>
      <Paragraph position="0"> Previous work in the context of newswire image classification show that text features in image captions are efficient for image categorization (Sable, 2000, 2002, 2003). We hypothesize that image captions provide certain lexical cues that efficiently represent image content. For example, the words &amp;quot;diameter&amp;quot;, &amp;quot;gene-expression&amp;quot;, &amp;quot;histogram&amp;quot;, &amp;quot;lane&amp;quot;, &amp;quot;model&amp;quot;, &amp;quot;stained&amp;quot;, &amp;quot;western&amp;quot;, etc are strong indicators for image classes and therefore can be used to classify an image into categories.</Paragraph>
      <Paragraph position="1"> The features we explored are bag-of-words and n-grams from the image captions after processing the caption text by the Word Vector Tool (Wurst).</Paragraph>
    </Section>
    <Section position="3" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
3.3 Image Features
</SectionTitle>
      <Paragraph position="0"> We also investigated image features for the tasks of image classification. We started with four types of image features that include intensity histogram features, edge-direction histogram features, edge-based axis features, and the number of 8-connected regions in the binary-valued image obtained from thresholding the intensity.</Paragraph>
      <Paragraph position="1"> The intensity histogram was created by quantizing the gray-scale intensity values into the range 0255 and then making a 256-bin histogram for these values. The histogram was then normalized by dividing all values by the total sum. For the purpose of entropy calculations, all zero values in the histogram are set to one. From this adjusted, normalized histogram, we calculated the total entropy as the sum of the products of the entries with their logarithms. Additionally, the mean, 2nd moment, and 3rd moment are derived. The combination of the total entropy, mean, 2nd, and 3rd moments constitute a robust and concise representation of the image intensity.</Paragraph>
      <Paragraph position="2"> Edge-Direction Histogram (Jain and Vailaya, 1996) features may help distinguish images with predominantly straight lines such as those found in graphs, diagrams, or charts from other images with more variation in edge orientation. The EDH begins by convolving the gray-scale image with both  tor finds vertical gradients while the other finds horizontal gradients. The inverse tangent of the ratio of the vertical to horizontal gradient yields continuous orientation values in the range of -pi to +pi. These values are subsequently converted into degrees in the range of 0 to 179 degrees (we consider 180 and 0 degrees to be equal). A histogram is counted over these 180 degrees. Zero values in the histogram are set to one in order to anticipate entropy calculations and then the modified histogram is normalized to sum to one. Finally, the total  entropy, mean, 2nd and 3rd moments are extracted to summarize the EDH.</Paragraph>
      <Paragraph position="3"> The edge-based axis features are meant to help identify images containing graphs or charts. First, Sobel edges are extracted above a sensitivity threshold of 0.10 from the gray-scale image. This yields a binary-valued intensity image with 1's occurring in locations of all edges that exceed the threshold and 0's occurring otherwise. Next, the vertical and horizontal sums of this intensity image are taken yielding two vectors, one for each axis.</Paragraph>
      <Paragraph position="4"> Zero values are set to one to anticipate the entropy calculations. Each vector is then normalized by dividing each element by its total sum. Finally, we find the total entropy, mean, 2nd , and 3rd moments to represent each axis for a total of eight axis features.</Paragraph>
      <Paragraph position="5"> The last image feature under consideration was the number of 8-connected regions in the binaryvalued, thresholded Sobel edge image as described above for the axis features. An 8-connected region is a group of edge pixels for which each member touches another member vertically, horizontally, or diagonally in the eight adjacent pixel positions surrounding it. The justification for this feature is that the number of solid regions in an image may help separate classes.</Paragraph>
      <Paragraph position="6"> A preliminary comparison of various combinations of these image features showed that the intensity histogram features used alone yielded the best classification accuracy of approximately 54% with a quadratic kernel SVM using an upper slack limit of C = 10^4.</Paragraph>
    </Section>
    <Section position="4" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
3.4 Fusion
</SectionTitle>
      <Paragraph position="0"> We integrated both image and text features for the purpose of image classification. Multi-class SVM's were trained separately on the image features and the text features. A multi-class SVM attempts to learn the boundaries of maximal margin in feature space that distinguishes each class from the rest.</Paragraph>
      <Paragraph position="1"> Once the optimal image and text classifiers were found, they were used to process a separate set of images in the fusion set. We extracted the margins from each data point to the boundary in feature space.</Paragraph>
      <Paragraph position="2"> Thus, for a five-class classifier, each data point would have five associated margins. To make a fair comparison between the image-based classifier and the text-based classifier, the margins for each data point were normalized to have unit magnitude.</Paragraph>
      <Paragraph position="3"> So, the set of five margins for the image classifier constitutes a vector that then gets normalized by dividing each element by its L2 norm. The same is done for the vector of margins taken from the text classifier. Finally, both normalized vectors are concatenated to form a 10-dimensional fusion vector. To fuse the margin results from both classifiers, these normalized margins were used to train another multi-class SVM.</Paragraph>
      <Paragraph position="4"> A grid search through parameter space with cross validation identified near-optimal parameter settings for the SVM classifiers. See Figure 6 for our system flowchart.</Paragraph>
      <Paragraph position="5">  We randomly selected a subset of 554 figure images from the total downloaded image pool. One author of this paper is a biologist who annotated figures under five classes; namely, Gel_Image (102), Graph (179), Image_of_Thing (64), Mix (106), and Model (103).</Paragraph>
      <Paragraph position="6"> These images were split up such that for each category, roughly a half was used for training, a quarter for fusion, and a quarter for testing (see  fiers for the image-based and text-based features. The fusion set was used to train a classifier on top of the results of the image-based and text-based classifiers. The testing set was used to evaluate the final classification system.</Paragraph>
      <Paragraph position="7"> For each division of data, 10 folds were generated. Thus within the training and fusion data sets, there are 10 folds which each have a randomized partitioning into 90% for training and 10% for testing. The testing data set did not need to be partitioned into folds since all of it was used to test the final classification system. (See Figure 8).</Paragraph>
      <Paragraph position="8"> In the 10-fold cross-validation process, a classifier is trained on the training partition and then measured for accuracy (or error rate) on the testing partition. Of the 10 resulting algorithms, the one which performs the best is chosen (or just one which ties for the best accuracy).</Paragraph>
    </Section>
    <Section position="5" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
3.6 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> We report the widely used recall, precision, and F-score (also known as F-measure) as the evaluation metrics for image classification. Recall is the total number of true positive predictions divided by the total number of true positives in the set (true pos + false neg). Precision is the fraction of the number of true positive predictions divided by the total number of positive predictions (true pos + false pos). F-score is the harmonic mean of recall and precision equal to (C. J. van Rijsbergen, 1979):</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="76" end_page="77" type="metho">
    <SectionTitle>
Fusion Datasets
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Table 2 shows the Confusion Matrix for the image feature classifier obtained from the testing part of the training data. The actual categories are listed vertically and predicted categories are listed horizontally. For instance, of 26 actual GEL images, 18 were correctly classified as GEL, 4 were mis-classified as GRAPH, 2 as IMAGE_OF_THING, 0 as MIX, and 2 as MODEL.</Paragraph>
    <Paragraph position="1">  A near-optimal parameter setting for the classifier based on image features alone used a polynomial kernel of order 2 and an upper slack limit of C = 10^4. Table 3 shows the performance of image classification with image features. True Positives,  cording to the F-score scores, this classifier does best on distinguishing IMAGE_OF_THING images. The overall accuracy = sum of true positives / total number of images = (18+39+12+3+3)/138 = 75/138 = 54%. This can be compared with the baseline of (3+39+1+1)/138 = 32% if all images  were classified as the most popular category, GRAPH. Clearly, the image-based classifier does</Paragraph>
    <Section position="1" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
Text Classifier
</SectionTitle>
      <Paragraph position="0"> The text-based classifier excels in finding GEL, GRAPH, and IMAGE_OF_THING images. It achieves an accuracy of (22+36+11+12+14)/138 = 95/138 = 69%.</Paragraph>
      <Paragraph position="1"> A near-optimal parameter setting for the fusion classifier based on both image features and text</Paragraph>
    </Section>
    <Section position="2" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
Classifier
</SectionTitle>
      <Paragraph position="0"> From Table 7, it is apparent that the fusion classifier does best on IMAGE_OF_THING and also performs well on GEL and GRAPH. These are substantial improvements over the classifiers that were based on image or text feature alone. Average F-scores and accuracies are summarized below in Table 8.</Paragraph>
      <Paragraph position="1"> The overall accuracy for the fusion classifier = sum of true positives / total number of image =</Paragraph>
      <Paragraph position="3"> can be compared with the baseline of 44/138 = 32% if all images were classified as the most popular category, GRAPH.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="77" end_page="78" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> It is not surprising that the most difficult category to classify is Mix. This was due to the fact that Mix images incorporate multiple categories of other image types. Frequently, one other image type that appears in a Mix image dominates the image features and leads to its misclassification as the other image type. For example, Figure 9 shows that a Mix image was misclassified as Gel_Image.</Paragraph>
    <Paragraph position="1"> This mistake is forgivable because the image does contain sub-images of gel-images, even though the entire figure is actually a mix of gel-images and diagrams. This type of result highlights the overlap between classifications and the difficulty in defining exclusive categories.</Paragraph>
    <Paragraph position="2"> For both misclassifications, it is not easy to state exactly why they were classified wrongly based on their image or text features. This lack of  intuitive understanding of discriminative behavior of SVM classifiers is a valid criticism of the technique. Although generative machine learning methods (such as Bayesian techniques or Graphical Models) offer more intuitive models for explaining success or failure, discriminative models like SVM are adopted here due to their higher performance and ease of use.</Paragraph>
    <Paragraph position="3"> Figure 10 shows an example of a MIX figure that was mislabeled by the image classifier as GRAPH and as GEL_IMAGE by the text classifier. However, it was correctly labeled by the fusion classifier. This example illustrates the value of the fusion classifier for being able to improve upon its component classifiers.</Paragraph>
  </Section>
  <Section position="7" start_page="78" end_page="78" type="metho">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> From the comparisons in Table 8, we see that fusing the results of classifiers based on text and image features yields approximately 5% improvement over the text -based classifier alone with respect to both average F-score and Accuracy.</Paragraph>
    <Paragraph position="1"> In fact, the F-score improved for all categories except for MODEL which experienced a 6% drop.</Paragraph>
    <Paragraph position="2"> The natural conclusion is that the fusion classifier combines the classification performance from the text and image classifiers in a complementary fashion that unites the strengths of both.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML