File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1161_metho.xml
Size: 9,465 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1161"> <Title>Acquisition of Semantic Classes for Adjectives from Distributional Evidence</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"> We used a 16.5 million word Catalan corpus, semi-automatically morphologically tagged and hand-corrected (Rafel, 1994). The corpus contains modern written samples (1960-1988) from most topics and genres. We selected all adjectives in the corpus with more than 50 occurences (2283 lemmata), including some gerunds and participles with a predominant modifying function (for more details on the selection criteria, cf. Sanrom`a (2003)).</Paragraph> <Paragraph position="1"> In all the experiments, we clustered the whole set of 2283 adjectives, as the set of objects alters the vector space and thus the classification results. We therefore clustered always the same set and chose different subsets of the data in the evaluation and testing phases in order to analyse the results.</Paragraph> <Paragraph position="2"> tag gloss tag gloss *cd clause delimiter aj adjective *dd def. determiner av adverb *id indef. det. cn common noun *pe preposition co coordinating elem.</Paragraph> <Paragraph position="3"> *ve verb np noun phrase ey empty Phrase boundary markers signaled with *.</Paragraph> <Paragraph position="4"> In the evaluation phase we used a manually classified subset of 100 adjectives (tuning subset from now on). Two judges classified them along the two parameters explained in Section 2 and their judgements were merged by one of the authors of the paper. In the testing phase, we used a different subset with 80 adjectives as Gold Standard against which we could compare the clustering results (see Section</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 for details on the manual annotation process). 3.1 Feature representation </SectionTitle> <Paragraph position="0"> Although we already had some hypotheses with respect to what features could be relevant, as discussed in Section 2, we wanted to proceed as empirically as possible. Recall also from the Introduction that we wanted to restrict ourselves to shallow distributional features. For both reasons, we modelled the data in terms of blind n-gram distribution and then selected the features.</Paragraph> <Paragraph position="1"> The lemmata were modelled using pairs of bigrams: in a 4-word window (three to the left and one to the right of the adjective), the first two tags formed a feature and the second two tags another feature. They were encoded separately due to sparse data considerations. This window should be enough for the kind of information we gather, because of the locality of the relationships which most adjectives establish with their arguments (see Section 2). We subsumed the information in the original morphological tags in order to have the minimal number of categories needed for our task, listed in Table 1.2 In order to further reduce the number of features in a linguistically principled way, we took phrase boundaries into account: All words beyond a POS considered to be a phrase boundary marker (see Table 1) were assigned the tag empty.</Paragraph> <Paragraph position="2"> Examples 1 and 2 show the representation that would be obtained for two imaginary English sen2Clause delimiters are punctuation marks other than commata, relative pronouns and subordinating conjunctions. Coordinating elements are commata and coordinating conjunctions. Noun phrases are proper nouns and personal pronouns. Clitic pronouns were tagged as verbs, for they always immediately precede or follow a verb.</Paragraph> <Paragraph position="3"> tences (target adjective in bold face, word window in italics; negative numbers indicate positions to the left, positive ones positions to the right): 1. He says that the red ball is the one on the left.</Paragraph> <Paragraph position="4"> -3ey-2cd, -1dd+1cn.</Paragraph> <Paragraph position="5"> 2. Hey, this teacher is jealous of Mary! -3ey-2ey, -1ve+1pe.</Paragraph> <Paragraph position="6"> The representation for sentence 1 states that the first element of the 5-gram (-3; third word to the left of the adjective) is empty (because the second element is a phrase boundary marker), that the second element is a clause delimiter (conjunction that), the third one (-1; word preceding the adjective) is a definite determiner, and the fourth one (+1; word following the adjective) is a common noun.</Paragraph> <Paragraph position="7"> This representation schema produced a total of 240 different feature (bigram) types, 164 of which had a prior probability a10 0.001 and were discarded. In order to choose the most adequate features for each of the parameters (that is, features that allowed us to distinguish unary from binary adjectives, on the one hand, and basic from event and from object adjectives, on the other), we checked the distributions of their values in the tuning subset. Features were chosen if they had different distributions in the different classes of each parameter and they made linguistic sense. We found that both criteria usually agreed, so that the selected features are consistent with the predictions made in Section 2, as will be discussed in Section 4. An alternative, more objective selection method would be to perform ANOVA, which we plan to test in the near future.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Gold Standard </SectionTitle> <Paragraph position="0"> Recall that we could not use any previously well-established classification. We therefore built our own Gold Standard, as has been mentioned at the beginning of this section.</Paragraph> <Paragraph position="1"> The 80 lemmata were independently annotated by three human judges (PhD students in Computational Linguistics, two of which had done research on adjectives), who had to classify each adjective as either unary or binary, on the one hand, and either basic, event or object-denoting, on the other. They received instructions which referred only to semantic characteristics, not to the expected syntactic behaviour. For example, &quot;check whether the state denoted by the adjective is necessarily related to a previous or simultaneous event&quot;. In addition, they were provided with (the same randomly chosen) 18 examples from the corpus for each of the adjectives to be tagged.</Paragraph> <Paragraph position="2"> The judges were allowed to assign a lemma to a second category in case of polysemy (e.g. econ`omic has an object meaning, 'economic', and a basic one, 'cheap', less frequent in the corpus). However, the agreement scores for polysemy judgments were not significant at all. We cannot perform any analysis on the clustering results with respect to polysemy until reliable scores are obtained. 3 We therefore ignored polysemy judgements and considered only the main (first) class assigned by each judge for all subsequent analyses.</Paragraph> <Paragraph position="3"> The three classifications were again merged by one of the authors of the paper into a single Gold Standard set (GS from now on). The agreement of the judges amongst themselves and with the GS with respect to the main class of each adjective can be found in Tables 2 and 3.</Paragraph> <Paragraph position="4"> rameter: inter-judge (J1, J2, J3), and with GS As can be seen, the agreement among judges is remarkably high for a lexical semantics task: All but one values of the kappa statistics are above 0.6 (+/-0.13 for a 95% confidence interval). The lowest agreement scores are those of J2, the only judge who had not done research on adjectives. This suggests that this judge is an outsider and that the level of expertise needed for humans to perform this kind of classification is quite high. However, there are too few data for this suspicion to be statistically testable.</Paragraph> <Paragraph position="5"> Landis and Koch (1977) consider values a11a13a12 0.61 to indicate a substancious agreement, whereas 3The low agreement is probably the result of both the fuzziness of the limits between polysemy and vagueness for adjectives, and the way the instructions were written, as they induced judges to make hard choices and did not state clearly enough the conditions under which an item could be classified in more than one class.</Paragraph> <Paragraph position="6"> Carletta (1996) says that 0.67 a10a14a11a15a10 0.8 allows just &quot;tentative conclusions to be drawn&quot;. Merlo and Stevenson (2001) report inter-judge a11 values of 0.53 to 0.66 for a task we consider to be comparable to ours, that of classifying verbs into unergative, unaccusative and object-drop, and argue that Carletta's &quot;is too stringent a scale for our task, which is qualitatively quite different from content analysis&quot; (Merlo and Stevenson, 2001, 396).</Paragraph> <Paragraph position="7"> The results reported in Tables 2 and 3 are significantly higher tan those of Merlo and Stevenson (2001). Although they are still not all above 0.8, as would be desirable according to Carletta, we consider them to be strong enough to back up both the classification and the feasibility of the task by humans. Thus, we will use GS as the reference for clustering analysis.</Paragraph> </Section> </Section> class="xml-element"></Paper>