File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1019_metho.xml
Size: 20,464 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1019"> <Title>Exploring the Use of Linguistic Features in Domain and Genre Classification</Title> <Section position="4" start_page="142" end_page="143" type="metho"> <SectionTitle> 3 The LIMAS corpus of German </SectionTitle> <Paragraph position="0"> Since our focus is on genre detection, we decided not to use common benchmark collections such as Reuters 1 and OHSUMED 2 because they are rather homogenous with respect to genre.</Paragraph> <Paragraph position="1"> LIMAS is a comprehensive corpus of contemporary written German, modelled on the Brown corpus (Ku~era and Francis, 1967) and collected in the early 1970s. It consists of 500 sources with around 2000 words each. It has been completely tagged with POS tags using the MALAGA system (Beutel, 1998). MALAGA is based on the</Paragraph> </Section> <Section position="5" start_page="143" end_page="143" type="metho"> <SectionTitle> STTS tagset for German which consists of 54 cat- </SectionTitle> <Paragraph position="0"> egories (Schiller et al., 1995). The corpus has atready been used for text classification by (vonder Grfin, 1999).</Paragraph> <Paragraph position="1"> Since the corpus is rather heterogeneous, we defined two sets of tasks, one based on the full corpus (CL), the other based on all texts from the categories law, politics, and economy (LPE) (104 sources in all). In the LPE experiments, emphasis was on searching for good parameters for the various learning algorithms as well as on the contribution of POS and punctuation information to classification accuracy. The experiments on the complete corpus, on the other hand, focus more on composition of the feature vectors.</Paragraph> <Section position="1" start_page="143" end_page="143" type="sub_section"> <SectionTitle> 3.1 Genre Classes </SectionTitle> <Paragraph position="0"> LIMAS is based on the 33 main categories of the Deutsche Bibliographie (German bibliography). Each of the bibliography's categories is represented according to its frequency in the texts published in 1970/1971, so that the corpus can be considered representative of the written German of that time (Bergenholtz and Mugdan, 1989).</Paragraph> <Paragraph position="1"> Furthermore, the corpus designers took care to cover a wide range of genres within each subcategory. As a result, groups of more than 10 documents taken from LIMAS will be rather heterogeneous. For example, press reports can be taken from broadsheets or tabloids, they can be commentaries, news reports, or reviews of cultural events.</Paragraph> <Paragraph position="2"> Many of the main categories correspond to domains such as &quot;mathematics&quot; or &quot;history&quot;. Although not evident from the category label, genre distinctions can also be quite important for domain classification, because some domains have developed specific genres for communication within the associated community. There are three such domain categories in our experiments, politics (P), law (L), and economy (E). Two further categories are academic texts from the humanities (H) and from the field of science and technology (S). In the LPE corpus, this distinction is collapsed into &quot;academic&quot; (A), the set of all scholarly texts in the corpus. Four categories are based on genre only. On one hand, we have press texts (N), and more specifically NH, press texts from high quality broadsheets and magazines, on the other hand, fiction (F) and FL, a low-quality subset of F. For LPE, we defined a category D consisting of articles from quality broadsheets. Table 1 gives an overview of the categories and the number of documents in each category for each corpus. In all subsequent experiments, we assume as base-line the classification accuracy which we get when and classification accuracy acc. if each document is judged not to belong to that category.</Paragraph> <Paragraph position="3"> all documents are assigned to the majority class.</Paragraph> <Paragraph position="4"> The baselines are specified in Tab. I.</Paragraph> </Section> </Section> <Section position="6" start_page="143" end_page="144" type="metho"> <SectionTitle> 4 Validating the Features </SectionTitle> <Paragraph position="0"> If the frequency of POS features does not vary significantly between categories, adding such information increases both random variation in the data as well as its dimensionality. To check for this, we conducted a series of non-parametric tests on CL for each POS tag.</Paragraph> <Paragraph position="1"> In addition, binary classification trees were grown on the complete set of documents for each category, and the structure of the tree was subsequently examined. Classification trees basically represent an ordered series of tests. Each tree node corresponds to one test, and the order of the tests is specified by the tree's branches. All tests are binary. The outcome of a test higher up in the tree determines which test to perform next.</Paragraph> <Paragraph position="2"> A data item which reaches a leaf is assigned the class of the majority of the items which reached it during training. The trees were grown using recursive partitioning; the splitting criterion was reduction in deviance. Using the Gini index led to larger trees and higher misclassification rates.</Paragraph> <Paragraph position="3"> Since the primary purpose of the trees was not prediction of unseen, but analysis of seen data, they were not pruned. There were no separate test sets.</Paragraph> <Paragraph position="4"> We tested for 12 categories and all STTS POS tags if the distribution of a tag significantly differs between documents in a given category and documents not in that category. These categories consist of the nine defined in Sec. 3 plus the content-based domains (Hi) and religion (R), and texts from tabloids and similar publications (PL).</Paragraph> <Paragraph position="5"> Choice of Feature Values: The value of a feature is its relative frequency in a given text. The frequencies were standardised using z-scores, so that the resulting random variables have a mean of 0 and a variance of 1. The z-scores were rounded down to the next integer, so that all features whose frequency does not deviate greatly from the mean have a value of 0. Z-scores were computed on the basis of all documents to be compared.</Paragraph> <Paragraph position="6"> This makes sense if we view style as deviation from a default, and such defaults should be computed relative to the complete corpus of documents used, not relative to specific classification tasks.</Paragraph> <Paragraph position="7"> Results: In general, only 7 of all 54 tags show significant differences in distribution for more than half of the categories, and the actual differences are far smaller than a standard deviation. However, for most tasks, there are at least 15 POS tags with characteristic distributions, so that including POS frequency information might well be beneficial.</Paragraph> <Paragraph position="8"> The four most important content word classes are VVFIN (finite forms of full verbs), NN (nouns), ADJD (adverbial adjectives), and ADJA (attributive adjectives). Importance is measured by the number of significant differences in distribution. A higher incidence of VVFIN characterises F, FL, and NL, whereas texts from academia or about politics and law show significantly less VVFIN. The difference between the means is around 0.2 for F and FL, and below 0.1 for the rest. (Numbers relate to the z-scores).</Paragraph> <Paragraph position="9"> Note that we cannot claim that more VVFIN means less nouns (NN): scholarly texts both show less VVFIN and less NN than the rest of the corpus. For adjectives, we find that academic texts are significantly richer in ADJA (differences between 0.02-0.04), while FL contains more adverbial adjectives (difference 0.04).</Paragraph> <Paragraph position="10"> But function words can be equally important indicators, especially personal pronouns, which are usually part of the stop word list. They are significantly less frequent in academic texts and categories E, L, NH, and P, and more frequent in fiction, NL, and R. Again, all differences are at or below 0.1. A lower frequency of personal pronouns can indicate both less interpersonal involvement and shorter reference chains.</Paragraph> <Paragraph position="11"> Other valuable categories are, for example, pronominal adverbs (PAV) and infinitives of auxiliary verbs (VAINF), where the difference between the means usually lies between 0.2 and 0.4 for significant differences. (We restrict ourselves to discussing these in more detail for reasons of space.) Pronominal adverbs such as &quot;deswegen&quot; (because of this) are especially frequent in texts from law and science, both of which tend to contain texts of argumentative types. The frequency of infinitives of auxiliaries reflects both the use of passive voice, which is formed with the auxiliary &quot;warden&quot; in German, and the use of present perfect or pluperfect tense (auxiliary &quot;haben'). In this corpus, texts from the domains of law and economy contain more VAINF than others.</Paragraph> <Paragraph position="12"> The potential meaning of common punctuation marks is quite clear: the longer the sentences an author constructs, the fewer full stops and the more commata and subordinating conj unctions we find. However, the frequency of full stops is distinctive only for four categories: L, E, and H have significantly fewer full stops, NL has significantly more. We also find significantly more commata in fiction than in non-fiction, Possible sources for this are infinitive clauses and lists of adjectives.</Paragraph> <Paragraph position="13"> With regard to the trees, we examined only those splits that actually discriminate well between positive and negative examples with less than 40% false positives or negatives. We will not present our analyses in detail, but illustrate the type of information provided by such trees with the category F. For this category, PPER, KOMMA, PTKZU (&quot;to&quot; before infinitive), PTKNEG (negation particle), an~t PWS (substituting interrogative pronoun) discriminate well in the tree. In the case of PTKZU and PTKNEG, this difference in distribution is conditional, it was not observed in the significance tests and surfaced only through the tree experiments.</Paragraph> </Section> <Section position="7" start_page="144" end_page="146" type="metho"> <SectionTitle> 5 Text Categorisation Experiments </SectionTitle> <Paragraph position="0"> For our categorisation experiments, we chose a relational k-nearest-neighbour (k-NN) classifier, RIBL (Emde and Wettschereek, 1996; Bohnebeck et al., 1998), and two feature-based k-NN algorithms, learning vector quantisation (LVQ, (Kohonen et al., 1996)), and IBLI(-IG) (Daelemans et al., 1997; Aha et al., 1991). The reason for choosing k-NN-based approaches is that this algorithm has been very successful in text categorisation (Yang, 1997).</Paragraph> <Paragraph position="1"> We first ran the experiments on the LPEcorpus, which had mainly exploratory character, then on the complete corpus.</Paragraph> <Paragraph position="2"> In the LPE-experiments, we distinguished six feature sets: CW, CWPOS, CWPP, WS, WS-POS, and WSPP, where CW stands for content word lemmata, WS for all lemmata, POS for POS information, and PP for POS and punctuation information. null In the CL-experiments, we did not control for the potential contribution of punctuation features to the results, but on the type of lemma from which the features were derived. We again explored 6 feature sets, CW, CWPOS, WS, WSPOS, FW, and FWPOS, where FW stands for function word lemmata. Punctuation was included in conditions WS, WSPOS, FW, and FWPOS, but not in CW and CWPOS. In addition to feature type, we also varied the length of the feature vectors. In the following subsections, we outline our general method for feature selection and evaluation and give a brief description of the algorithms used. We then report on the results of the two suites of experiments.</Paragraph> <Section position="1" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 5.1 Feature Selection </SectionTitle> <Paragraph position="0"> The set of all potential features is large - there are more than 29000 lemmata in the LPE corpus, and more than 80000 in the full corpus.</Paragraph> <Paragraph position="1"> In a first step we excluded for the LPE corpus, all lemmata occuring less than 5 times in the texts, and for the CL corpus, all lemmata occurring in less than 10 sources, which left us with 4857 lemmata for LPE and 5440 lemmata and punctuation marks for CL. We then determined the relevance of each of these lemmata for a given classification task by their gain ratio (Yang and Pedersen, 1997). From this ranked list of lemmata, we constructed the final feature sets.</Paragraph> </Section> <Section position="2" start_page="145" end_page="145" type="sub_section"> <SectionTitle> 5.2 The Algorithms </SectionTitle> <Paragraph position="0"> RIBL: RIBL is a k-NN classification algorithm where each object is represented as a set of ground facts, which makes encoding highly structured data easier. The underlying first-order logic distance measure is described in (Emde and Wettschereck, 1996; Bohnebeck et al., 1998). Features were not weighted because using Kononenko's Relief feature weighting (Kononenko, 1994) did not significantly affect performance in preliminary experiments.</Paragraph> <Paragraph position="1"> The input for RIBL consists of three relations lemma(di,lemma,v), pos(di,POS-Tag,v), and document(all), with di the document index and v the standardised frequency, rounded to the next integer value. In the CL experiments, the lemma tag covers both real lemmata and punctuation marks, in LPE, punctuation marks had a separate precidate. Relations with a feature value of 0 are omitted, reducing the size of the input considerably. For these features, a true relational representation is not necessary, but that might change for more complex features such as syntactic relations. null IBL: IBL stores all training set vectors in an instance base. New feature vectors are assigned the class of the most similar instancc. We use the Fuclidean distance metric for determining nearest ncighbours. All experiments were run with (IBL-IG) or without (IBL) weighting the contribution of each feature with its gain ratio.</Paragraph> <Paragraph position="2"> LVQ: LVQ also classifies incoming data based on prototype vectors. However, the prototypes are not selected, but interpolated from the training data so as to maximise the accuracy of a nearest-neighbour classifier based on these vectors. During learning, the prototypes are shifted gradually towards members of the class they represent and away from members of different classes. There are three main variants of the algorihm, two of which only modify codebook vectors at the decision boundary between classes.</Paragraph> </Section> <Section position="3" start_page="145" end_page="146" type="sub_section"> <SectionTitle> 5.3 LPE-Experiments 5.3.1 Procedure </SectionTitle> <Paragraph position="0"> From the complete set of documents, we constructed three pairs of training and test sets for training the feature classifiers. The test sets are mutually disjunct; each of them contains 5 positive and 5 negative examples. The corresponding training sets contain the remaining 95 documents.</Paragraph> <Paragraph position="1"> For RIBL, test set performance is determined using leave-one-out cross validation. Feature vectors contained either 100,500, or 1000 lemma features.</Paragraph> <Paragraph position="2"> On the basis of test set performance, we determined precision, recall, and accuracy. Instead of determining recall/precision breakeven point as in (Joachims, I998) or average precision over different recall values as in (Yang, 1997), we provide both values to determine which type of error an algorithm is more susceptible to. Tab. 2 summarizes the results.</Paragraph> <Paragraph position="3"> Condition IBL-IG resulted in significantly higher precision (+0.5%) than IBL, but lower recall and accuracy (difference not significant). The number of neighbouring vectors was also varied (k = 1,3, 5, 7). For precision, recall, and accuracy, best results were achieved with k = 3. A pure nearest-neighbour approach led to classifying all examples as negative. The number of neighbours k was also varied for RIBL. Contrary to 1BL, it performs best for k = 1.</Paragraph> <Paragraph position="4"> For the LVQ runs, we used the variant OLVQI.</Paragraph> <Paragraph position="5"> In this algorithm, one codebook vector is adapted at a time; the rate of codebook vector adaptation is optimised for fast convergence. The resulting codebook was not tuned afterwards to avoid overfitting. We varied both the number of codebook vectors (10,20,50,90) and the initialisation procedure: during one set of runs, each class receives the same number of vectors, during the other, the number of codebook vectors is proportional to class size. Performance increases if codebook w~.c null runs for each task and for the best combination of feature set and number of features, precision and recall having equal weight.</Paragraph> <Paragraph position="6"> Key: all: ws/wspos/wspp/cw/cwpos/cwpp, cw*: cw/cwpos/cwpp, ws*: ws/wspos/wspp tors are assigned proportionally to each class and deteriorates with the number of codebook vectors, a clear sign of overfitting.</Paragraph> <Paragraph position="7"> LVQ achieves a performance ceiling of 100% precision and recall on nearly all tasks except for genre task A. The low average performance of IBL is due to bad results for k = 1; for higher k, IBL performs as well as LVQ. Overall, performance decreases with increasing number of features. IBL is rather robust regarding the choice of feature set. LVQ tends to perform better on data sets derived from both content and function words, with the exception of task A. Because of the ceiling effect, it almost never matters if the additional linguistic features are included or not. Recall is significantly better than precision for most tasks.</Paragraph> <Paragraph position="8"> RIBL shows the greatest variation in performance. Although it performs fairly well, Tab. 2 shows differences of up to -5% on precision and -23% on recall. Overall, ws-based feature sets outperform cw-based ones. Performance declines sharply with the number of features. POS features almost always have a clear positive effect on recall (on average +28%, cw* and +16%, ws*), but an even larger negative effect on precision (38%, cw* and -39%,ws*), which only shows for 500 and 1000 lemma features. Lemma and POS frequency information apparently conflict, with POS frequency leading to overgeneralization. Maybe semantic features describe the class boundaries more adequately. They may be covered implicitly in large vectors containing lemmata from that class. For 100 lemmafeatures, where the representation is extremely sparse, we find that including POS information does indeed boost performance, especially for the two genre tasks, as we would have predicted.</Paragraph> </Section> <Section position="4" start_page="146" end_page="146" type="sub_section"> <SectionTitle> 5.4 CL Experiments 5.4.1 Procedure </SectionTitle> <Paragraph position="0"> In this set of experiments, RIBL and IBL were both evaluated using leave-one-out cross validation. The performance of LVQ is reported on the basis of ten-fold cross validation for reasons of computing time. Training and test sets were also constructed somewhat differently. The test set contained the same proportion of positive examples as the training set. If we had balanced the test set as above, this would have resulted in 4 pairs of sets instead of 10, and much smaller test sets, because some classes, such as L, are very small. This problem was not so grave for the LPE experiments because of the ceiling effect and the small size of the complete data set, therefore, we did not rerun the corresponding experiments.</Paragraph> <Paragraph position="1"> Furthermore, the number of codebook vectors for LVQ was now varied between 10, 50, 100, and 200 in order to take into account the increased training set sizes.</Paragraph> </Section> </Section> class="xml-element"></Paper>