File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1153_metho.xml
Size: 13,804 bytes
Last Modified: 2025-10-06 14:08:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1153"> <Title>Learning Greek Verb Complements: Addressing the Class Imbalance</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Data Collection </SectionTitle> <Paragraph position="0"> The corpora used in our experiments were:</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 1. The ILSP/ELEFTHEROTYPIA (Hatzigeor- </SectionTitle> <Paragraph position="0"> giu et al. 2000) and ESPRIT 860 (Partners of ESPRIT-291/860 1986) Corpora (a total of 300,000 words). Both these corpora are balanced and manually annotated with complete morphological information. The former also provides adverb type information (temporal, of manner etc.).</Paragraph> <Paragraph position="1"> Further (phrase structure) information is obtained automatically.</Paragraph> </Section> <Section position="6" start_page="1" end_page="2" type="metho"> <SectionTitle> 2. The DELOS Corpus (Kermanidis et al. 2002) </SectionTitle> <Paragraph position="0"> is a collection of economic domain texts of approximately five million words and of varying genre. It has been automatically annotated from the ground up. Morphological tagging on DELOS was performed by the analyzer of Sgarbas et al. (2000).</Paragraph> <Paragraph position="1"> Accuracy in pos tagging reaches 98%. Case and voice tagging reach 94% and 84% accuracy respectively. Further (phrase structure) information is again obtained automatically. DELOS also contains subject-verb-object information limited to nominal and prepositional objects and detected automatically by a shallow parser that reaches 70% precision and recall.</Paragraph> <Paragraph position="2"> All the corpora have been phrase-analyzed by the chunker described in detail in Stamatatos et al. (2000). Noun (NP), verb (VP), prepositional (PP), adverbial phrases (ADP) and conjunctions (CON) are detected via multi-pass parsing. Precision and recall reach 94.5% and 89.5% respectively.</Paragraph> <Paragraph position="3"> Phrases are non-overlapping. Concerning phrase structure, complements (except for weak personal pronouns) are not included in the verb phrase, nominal modifiers in the genitive case are included within the noun phrase they modify, coordinated simple noun and adverbial phrases are grouped into one phrase.</Paragraph> <Paragraph position="4"> The next step is empirical headword identification. NP headwords are determined based on the pos and case of the phrase constituents. For VPs, the headword is the main verb or the conjunction if they are introduced by one. For PPs it is the preposition introducing them.</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.1 Data Formation </SectionTitle> <Paragraph position="0"> To take into account the freedom of the language structure, context information of every verb in the corpus focuses on the two phrases preceding and the three phrases following it. Only one out of 200 complements in the corpus appears outside this window. Each of these phrases is in turn the focus phrase (the candidate complement or adjunct) and an instance of twenty nine features (28 features plus the class label) is formed for every focus phrase (fp). So a maximum of five instances per verb occurrence are formed. Forming of these instances from a corpus sentence is shown in Figure 1.</Paragraph> <Paragraph position="1"> The first five features are the verb lemma (VERB), its mode (F1), whether it is (im)personal (F2), its copularity (F3), and its voice (F4). Two features encode the presence of a personal pronoun in the accusative (F5) or genitive (F6) within the VP. For every fp (fps are in bold), apart from the seven features described above, a context window of three phrases preceding the fp and three phrases following it is taken into account. Each of these six phrases (as well as the fp itself) is encoded into a set of three features (a total of twenty one features). These triples appear next in each instance, from the leftmost (-3) to the rightmost phrase (+3).</Paragraph> <Paragraph position="2"> For each feature triple, the first feature is the type of the phrase. The second is the pos of the head-word for NPs and ADPs. The third feature for NPs is the case of the headword. For ADPs it is the type of the adverb, if available. If VPs are introduced by a conjunction, the second feature is its type (coordinating/subordinating) and the third is the conjunction itself. Otherwise the second feature is the verb's pos and the third empty. For PPs, the second feature is empty and the third is the preposition.</Paragraph> <Paragraph position="3"> [the Labros] CON[and] VP[believes] PP[in God.]) (Labros is a good boy and believes in God.) VERB F1 F2 F3 F4 F5 F6 FP -3 -2 -1 +1 +2 +3 LABEL ei uai , O, P, C, P, F, F, NP,N,n, -,-,-, -,-,-, VP,V,-, NP,N,n, VP,V,-, PP,-,se , C ei uai , O, P, C, P, F, F, NP,N,n, -,-,-, VP,V,-, NP,N,n, VP,V,-, PP,-,se , -,-,-, A pisteuo , E, P, NC, A, F, F, NP,N,n, -,-,-, -,-,-, VP,V,-, NP,N,n, VP,V,-, PP,-,se , A pisteuo , E, P, NC, A, F, F, NP,N,n, -,-,-, VP,V,-, NP,N,n, VP,V,-, PP,-,se , -,-,-, A pisteuo , E, P, NC, A, F, F, PP,-,se , NP,N,n, NP,N,n, VP,V,-, -,-,-, -,-,-, -,-,-, C words.</Paragraph> <Paragraph position="4"> The first instance is for the verb ei uai and the candidate complement/adjunct is the fp NP . In the second instance, for the same verb, the candidate complement/adjunct is the fp NP . There are only two instances for this verb because 1. there are no phrases preceding it, and 2. the third phrase following it (consisting only of the coordinating conjunction) has not much to contribute and is disregarded altogether forcing us to consider the next phrase in the sentence. As the next phrase is a verb phrase that is not introduced by a subordinating conjunction (and therefore cannot be a dependent of the verb ei uai ), it is also disregarded and no further phrases are tested. In the same way, for the verb pisteuo we have an instance with fp the NP and one with PP as the fp.</Paragraph> <Paragraph position="5"> We experimented with various window sizes regarding the context of the fp, i.e. [fp], [-1, fp], [-2, fp], [-2, +1], [-3, +3].</Paragraph> <Paragraph position="6"> The formatting described in the previous section was applied to the ILSP and ESPRIT corpora and to part (approximately 500,000 words) of the DELOS corpus. For the first two corpora, the class of each fp for every created instance was hand-labeled by two linguists by looking up the verb in its context, based on the detailed descriptions for complements and adjuncts by Klairis and Babiniotis (1999). For DELOS, which already contained automatically detected verb-object information to an extent, existing erroneous complement information was manually corrected, while clausal complements were manually detected. The dataset consisted of 63,000 instances. The imbalance ratio is 1:6.3 (one complement instance for every 6.3 adjunct instances).</Paragraph> </Section> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Addressing the Imbalance </SectionTitle> <Paragraph position="0"> From the ratio given above, the complement class is underrepresented compared to the adjunct class in the data. As the number of examples of the majority class increases, the more likely it becomes for the nearest neighbor of a complement to be an adjunct. Therefore, complements are prone to misclassifications. We address this problem with One-sided Sampling, i.e. pruning out redundant adjunct (negative) examples while keeping all the complement (positive) examples. Instances of the majority class can be categorized into four groups (Figure 2): Noisy are instances that appear within a cluster of examples of the opposite class, borderline are instances close to the boundary region between two classes, redundant are instances that can be already described by other examples of the same class and safe are instances crucial for determining the class. Instances belonging to one of the three first groups need to be eliminated as they do not contribute to class prediction.</Paragraph> <Paragraph position="1"> Noisy and borderline examples can be detected using Tomek links: Two examples, x and y, of opposite classes have a distance of d (x,y). This pair of instances constitutes a Tomek link if no other example exists at a smaller distance to x or y than d (x,y).</Paragraph> <Paragraph position="2"> Redundant instances may be removed by creating a consistent subset of the initial training set. A subset C of training set T is consistent with T, if, when using the nearest neighbor (1-NN) algorithm, it correctly classifies all the instances in T. To this end we start with a subset C consisting of all complement examples and one adjunct example. We train a learner with C and try to classify the rest of the instances of the initial training set. All misclassified instances are added to C, which is the final reduced dataset.</Paragraph> <Paragraph position="3"> The exact process of the proposed algorithm is: 1. Let T be the original training set, where the size of the negative examples outnumbers that of the positive examples.</Paragraph> <Paragraph position="4"> 2. Construct a dataset C, containing all positive instances plus one randomly selected negative instance.</Paragraph> <Paragraph position="5"> 3. Classify T with 1-NN using the training examples of C and move all misclassified items to C. C is consistent with T, only smaller.</Paragraph> <Paragraph position="6"> 4. Remove all negative examples participating in Tomek links. The resulting set T</Paragraph> <Paragraph position="8"> for classification instead of T.4.1 Distance func-</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> tions 4.1 Distance functions </SectionTitle> <Paragraph position="0"> The distance functions used to determine the instances participating in Tomek links are described in this section.</Paragraph> <Paragraph position="1"> The most commonly used distance function is the Euclidean distance. One drawback of the Euclidean distance is that it is not very flexible regarding nominal attributes. The value difference metric (VDM) is more appropriate for this type of attributes, as it considers two nominal values to be closer if they have more similar classifications, i.e. more similar correlations with the output class. The N is the number of times value a of attribute A was found in the training set, ,,A a c N is the number of times value a co-occurred with output class c and C is the set of class labels.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 The reduced dataset </SectionTitle> <Paragraph position="0"> We used the above distance metrics to detect examples that are safe to remove, and then applied the methodology of the previous section to our data. Figure 3 depicts the reduction in the number of negative instances for both metrics and every fp context window. The more phrases are considered (the higher the vector dimension), the noisier the instances, and the more redundant examples are removed. For small windows, the positive effect of VDM is clear (more redundant examples are detected and removed). As the window size increases, the Euclidean distance becomes smoother (depending on more features) and leads to the removal of as many examples as VDM.</Paragraph> <Paragraph position="1"> It is interesting to observe the type of instances which are removed from the initial dataset after balancing. Redundant instances are usually those with as fp headword a punctuation mark, a symbol etc. Such fps could never constitute a complement and appear in the dataset due to errors in the automatic nature of pre-processing. Borderline instances are usually formed by fps that have a syntactically ambiguous headword like a noun in the accusative case, an adjective in the nominative case if the verb is copular, certain prepositional phrases. The following negative instance of the initial dataset (with window [fp]) shows the difference between the two distances.</Paragraph> <Paragraph position="2"> antikathisto , E, P, NC, A, F, F, PP,-,se , A This instance appears only as negative throughout the whole dataset. If the verb antikathisto (to replace) were omitted, the remaining instance appears several times in the data as positive with a variety of other verbs. The Euclidean distance between these instances is small, while the VDM is greater, because the verb is a feature with a high correlation to the output class. So the above instance is removed with the Euclidean distance as being borderline, while it remains untouched with VDM.</Paragraph> </Section> </Section> <Section position="8" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Classifying new instances </SectionTitle> <Paragraph position="0"> For classification we experimented with a set of algorithms that have been broadly used in several domains and their performance is well-known: instance-based learning (IB1), decision trees (an implementation of C4.5 with reduced error pruning) and Naive Bayes were used to classify new, unseen instances as complements or adjuncts. Unlike previous approaches that test their methodology on only a few new verb examples, we performed 10-fold cross validation on all our data: the dataset (whether initial or reduced) was divided into ten sets of equal size, making sure that the proportion of the examples of the two classes remained the same. For guiding the C4.5 pruning process, one of the ten subsets was used as the held-out validation set.</Paragraph> </Section> class="xml-element"></Paper>