File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/e99-1019_evalu.xml

Size: 4,816 bytes

Last Modified: 2025-10-06 14:00:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1019">
  <Title>Exploring the Use of Linguistic Features in Domain and Genre Classification</Title>
  <Section position="8" start_page="146" end_page="146" type="evalu">
    <SectionTitle>
5.4.2 Results
</SectionTitle>
    <Paragraph position="0"> The results on the larger corpus differ substantially from that on the smaller corpus. It is far easier to determine if a text belongs to one of the three major domains covered in a corpus than to assign a text to a minor domain which covers only 4% of the complete corpus. If the class itself is not considerably more homogeneous (with respect to the classifier used) than the rest of the corpus, this will be a difficult task indeed. Our results suggest that the classes were indeed not homogeneous enough to ensure reliable classification. The reason for this is that LIMAS was designed to be as representative as possible, and consequently to be as heterogeneous as possible. This explains why we never achieved 100% precision and recall on any data set again. In fact, results became much worse, and varied a tot depending mainly on the type of classifier and the task. Again, if classes are very inhomogeneous, any change in the way similarity between data items is computed can have strong effects on the composition of the neighbourhood, and the erratic behaviour observed here is a vivid testimony of this. We therefore chose not to present general summaries, but to document some typical patterns of variation.</Paragraph>
    <Paragraph position="1"> Parameter settings: LVQ gives best results in terms of both precision and recall for even initialisation of codebook vectors, which makes sense because the number of positive examples has now become rather small in comparison to the rest of the corpus. A good codebook size appears to be  gories H and S, 50 codebook vectors, even initialization. null For RIBL, restricting the size of the relevant neighbourhood to 1 or 2 gives by far the best results in terms of both precision and recall, but not in terms of accuracy - the negative effect of false positives is too strong.</Paragraph>
    <Paragraph position="2"> IBL is also sensitive to the size of the neighbourhood; again, precision and recall are highest for k--1. For this size, incorporating information gain into the distance measure leads to a clear decrease in performance.</Paragraph>
    <Paragraph position="3"> Overall performance: Unsurprisingly, performance in terms of precision and recall is rather poor. Average LVQ performance under the best parameter settings in terms of precision and recall only improves on the baseline for two genres: H (baseline 78%, accuracy for feature set WSPOS 88%) and FL (feature sets CONT and CONTPOS, baseline 94%, accuracy 95%). Under matched conditions (same genre, same feature set, same number of features, optimal settings), IBL and RIBL both perform significantly worse than LVQ, which can interpolate between data points and so smooth out at least some of the noise. For example, IBL accuracy on task H is 69,1% for both WS and WSPOS, while accuracy on FL never much exceeds 92% and thus remains just below baseline.</Paragraph>
    <Paragraph position="4"> RIBL performs best on FL for condition CWPOS, but even then accuracy is only 90%.</Paragraph>
    <Paragraph position="5"> Size of Feature Vector: The number of features used did not significantly affect the performance of IBL. For LVQ, both precision and recall decrease sharply as the number of features increases (average precision for 50 lemma features 29.5%, for 200 24.8%; average recall for 50 9.1%, for 200 7.1%). But this was not the case for all genres, as Tab. 3 shows. The categories H and S are chosen for comparison because they are the largest. For H, the precision under conditions CW and CWPOS decreases, all others increase; for S, it is exactly the other way around.</Paragraph>
    <Paragraph position="6"> Composition of feature vectors: Another lesson of Tab. 3 is that the effect of the composition of the feature vectors can vary depending both on the task and on the size of the feature vector. The dramatic fall in precision for condition FWPOS, category S, shows that very clearly. Here, additional function word information has blurred the class boundaries, whereas for H, it has sharpened them considerably. Because of the large amount of noise in the results, we would be very hesitant to identify any condition as optimal or indeed claim that our hypotheses about the role of POS information or content vs. function words could be verified. However, what these results do confirm is that sometimes, comparing different representations might well pay off, as we have seen in the case of task H, where WSPOS indeed emerges as optimal feature set choice.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML