XML Viewer - e06-3008

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-3008_metho.xml
Size: 23,568 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-3008">
  <Title>Towards Robust Animacy Classification Using Morphosyntactic Distributional Features</Title>
  <Section position="4" start_page="47" end_page="48" type="metho">
    <SectionTitle>
2 Features of animacy
</SectionTitle>
    <Paragraph position="0"> As mentioned above, animacy is highly correlated with a number of other linguistic concepts, such as transitivity, agentivity, topicality and discourse salience. The expectation is that marked configurations along these dimensions, e.g. animate objects or inanimate agents, are less frequent in the data. However, these are complex notions to translate into extractable features from a corpus. In the following we will present some morphological andsyntactic features which, indifferent ways, approximate the multi-faceted property of animacy: Transitive subject and (direct) object As mentioned earlier, a prototypical transitive relation involves an animate subject and an inanimate object. In fact, a corpus study of animacy distribution in simple transitive sentences in Norwegian revealed that approximately 70% of the subjects of these types of sentences were animate, whereas as many as 90% of the objects were inanimate (Ovrelid, 2004). Although this corpus study involved all types of nominal arguments, including pronouns and proper nouns, it still seems that the frequency with which a certain noun occurs as a subject or an object of a transitive verb might be an indicator of its animacy.</Paragraph>
    <Paragraph position="1"> Demoted agent in passive Agentivity is another related notion to that of animacy, animate beings are usually inherently sentient, capable of acting volitionally and causing an event to take place - all properties of the prototypical agent (Dowty, 1991). The passive construction, or rather the property of being expressed as the demoted agent in a passive construction, is a possible approximator of agentivity. It is well known that transitive constructions tend to passivize better (hence more frequently) if the demoted subject bears a prominent thematic role, preferably agent.</Paragraph>
    <Paragraph position="2"> Anaphoric reference by personal pronoun Anaphoric reference is a phenomenon where the animacy of areferent isclearly expressed.</Paragraph>
    <Paragraph position="3"> The Norwegian personal pronouns distinguish their antecedents along the animacy dimension - animate han/hun 'he/she' vs.</Paragraph>
    <Paragraph position="4"> inanimate den/det 'it-MASC/NEUT'.</Paragraph>
    <Paragraph position="5"> Anaphoric reference by reflexive pronoun Reflexive pronouns represent another form of anaphoric reference, and, may, in contrast to the personal pronouns locate their antecedent locally, i.e. within the same clause.</Paragraph>
    <Paragraph position="6"> In the prototypical reflexive construction the subject and the reflexive object are coreferent and it describes an action directed at oneself. Although the reflexive pronoun in Norwegian does not distinguish for animacy, the agentive semantics of the construction might still favour an animate subject.</Paragraph>
    <Paragraph position="7"> Genitive -s There is no extensive case system for common nouns in Norwegian and the only  distinction that is explicitly marked on the noun is the genitive case by addition of -s.</Paragraph>
    <Paragraph position="8"> The genitive construction typically describes possession, arelation whichoften involves an animate possessor.</Paragraph>
    <Section position="1" start_page="48" end_page="48" type="sub_section">
      <SectionTitle>
2.1 Feature extraction
</SectionTitle>
      <Paragraph position="0"> In order to train a classifier to distinguish between animate and inanimate nouns, training data consisting of distributional statistics on the above features were extracted from a corpus. For this end, a 15 million word version of the Oslo Corpus, a corpus of Norwegian texts of approximately 18.5 million words, wasemployed.2 Thecorpus ismorphosyntactically annotated and assigns an under-specified dependency-style analysis to each sentence.3 null For each noun, relative frequencies for the different morphosyntactic features described above were computed from the corpus, i.e. the frequency of the feature relative to this noun is divided by the total frequency of the noun. For transitive subjects (SUBJ), we extracted the number of instances where the noun in question was unambiguously tagged as subject, followed by a finite verb and an unambiguously tagged object.4 The frequency of direct objects (OBJ) for a given noun was approximated to the number of instances where the noun in question was unambiguously tagged as object.</Paragraph>
      <Paragraph position="1"> We here assume that an unambiguously tagged object implies an unambiguously tagged subject.</Paragraph>
      <Paragraph position="2"> However, by not explicitly demanding that the object is preceded by a subject, we also capture objects with a &amp;quot;missing&amp;quot; subject, such as objects occurring in relative clauses and infinitival clauses. Asmentioned earlier, another context where animate nouns might be predominant is in the by-phrase expressing the demoted agent of a passive verb (PASS). Norwegian has two ways of expressing the passive, a morphological passive (verb + s) and a periphrastic passive (bli + past participle).</Paragraph>
      <Paragraph position="3"> The counts for passive by-phrases allow for both types ofpassives to precede the by-phrase containing the noun in question.</Paragraph>
      <Paragraph position="4">  (Karlsson et al., 1995), and the analysis is underspecified as the nodes are labelled only with their dependency function, e.g. subject or prepositional object, and their immediate heads are not uniquely determined.</Paragraph>
      <Paragraph position="5"> 4The tagger works in an eliminative fashion, so tokens may bear two or more tags when they have not been fully disambiguated.</Paragraph>
      <Paragraph position="6"> With regard to the property of anaphoric reference by personal pronouns, the extraction was bound to be a bit more difficult. The anaphoric personal pronoun is never in the same clause as the antecedent, and often not even in the same sentence. Coreference resolution is a complex problem, and certainly not one that we shall attempt to solve in the present context. However, we might attempt to come up with a metric that approximates the coreference relation in a manner adequate for our purposes, that is, which captures the different coreference relation for animate as opposed to inanimate nouns. To this end, we make useofthecommonassumption thatapersonal pronoun usually refers to a discourse salient element which is fairly recent in the discourse. Now, if a sentence only contains one core argument (i.e.</Paragraph>
      <Paragraph position="7"> an intransitive subject) and it is followed by a sentence initiated byapersonal pronoun, itseems reasonable to assume that these are coreferent (Hale and Charniak, 1998). For each of the nouns then, we count the number of times it occurs as a sub-ject with no subsequent object and an immediately following sentence initiated by (i) an animate personal pronoun (ANAAN) and (ii) an inanimate personal pronouns (ANAIN).</Paragraph>
      <Paragraph position="8"> The feature of reflexive coreference is easier to approximate, as this coreference takes place within the same clause. For each noun, the number of occurrences as a subject followed by a verb and the 3.person reflexive pronoun seg 'him/her-/itself' are counted and its relative frequency recorded. The genitive feature (GEN) simply contains relative frequencies ofthe occurrence of each noun with genitive case marking, i.e. the suffix -s.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="48" end_page="50" type="metho">
    <SectionTitle>
3 Method viability
</SectionTitle>
    <Paragraph position="0"> In order to test the viability of the classification method for this task, and in particular, the chosen features, a set of forty highly frequent nouns were selected - twenty animate and twenty inanimate nouns. A frequency threshold of minimum one thousand occurrences ensured sufficient data for all the features, as shown in table 1, which reports the mean values along with the standard deviation for each class and feature. The total data points for each feature following the data collection are as follows: SUBJ: 16813, OBJ: 24128, GEN: 7830, PASS: 577, ANAANIM: 989, ANAINAN: 944, REFL: 558. As we can see, quite a few of the features express morphosyntactic cues that are  from feature extraction (SUBJ=Transitive Subject, OBJ=Object, GEN=Genitive -s, PASS=Passive byphrase, ANAAN=Anaphoric reference by animate pronoun, ANAIN=Anaphoric reference by inanimate pronoun, REFL=Anaphoric reference by reflexive pronoun).  1. SUBJ OBJ GEN PASS ANAAN ANAIN REFL 87.5 2. OBJ GEN PASS ANAAN ANAIN REFL SUBJ 85.0 3. SUBJ GEN PASS ANAAN ANAIN REFL OBJ 87.5 4. SUBJ OBJ PASS ANAAN ANAIN REFL GEN 85.0 5. SUBJ OBJ GEN ANAAN ANAIN REFL PASS 82.5 6. SUBJ OBJ GEN PASS ANAIN REFL ANAAN 82.5 7. SUBJ OBJ GEN PASS ANAAN REFL ANAIN 87.5 8. SUBJ OBJ GEN PASS ANAAN ANAIN REFL 75.0 9. OBJ PASS ANAAN ANAIN SUBJ GEN REFL 77.5  rather rare. This isin particular true for the passive feature and the anaphoric features ANAAN, ANAIN and REFL. There is also quite a bit of variation in the data (represented by the standard deviation for each class-feature combination), a property which is to be expected as all the features represent approximations of animacy, gathered from an automatically annotated, possibly quite noisy, corpus. Even so, the features all express a difference between the two classes in terms of distributional properties; the difference between the mean feature values for the two classes range from double to five times the lowest class value.</Paragraph>
    <Section position="1" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
3.1 Experiment 1
</SectionTitle>
      <Paragraph position="0"> Based on the data collected on seven different features for our 40 nouns, a set of feature vectors are constructed for each noun. They contain the relative frequencies for each feature along with the name of the noun and its class (animate or inanimate). Note that the vectors do not contain the mean values presented in Table 1 above, but rather the individual relative frequencies for each noun.</Paragraph>
      <Paragraph position="1"> The experimental methodology chosen for the classification experiments is similar to the one described in Merlo and Stevenson (2001) for verb classification. We also make use of leave-one-out training and testing of the classifiers and the same software package for decision tree learning, C5.0(Quinlan, 1998), isemployed. Inaddition, all our classifiers employ the boosting option for constructing classifiers (Quinlan, 1993). For calculation of the statistical significance of differences in the performance of classifiers tested on the same data set, McNemar's test is employed.</Paragraph>
      <Paragraph position="2"> Table 2 shows the performance of each individual feature in the classification of animacy. As we can see, the performance of the features differ quite a bit, ranging from mere baseline performance (ANAIN) to a 70% improvement of the baseline (SUBJ). Thefirstline ofTable3showsthe performance using all the seven features collectively where we achieve an accuracy of 87.5%, a 75% improvement of the baseline. The SUBJ, GEN and REFL features employed individually are the best performing individual features and their classification performance do not differ significantly from the performance of the combined classifier, whereas the rest of the individual features do (at the p&lt;.05 level).</Paragraph>
      <Paragraph position="3"> The subsequent lines (2-8) of Table 3 show the accuracy results for classification using all features except one at a time. This provides an indication of the contribution of each feature to the classification task. In general, the removal of a feature causes a 0% - 12.5% deterioration of results, however, only the difference in performance caused by the removal of the REFL feature is significant (at the p&lt;0.05 level). Since this feature is one of the best performing features individually, it is not surprising that its removal causes a notable difference in performance. The removal of the  ANAIN feature, on the other hand, does not have any effect on accuracy whatsoever. This feature was the poorest performing feature with a baseline, or mere chance, performance. We also see, however, that the behaviour of the features incombination is not strictly predictable from their individual performance, as presented in table 2. The SUBJ, GEN and REFL features were the strongest features individually with a performance that did not differ significantly from that of the combined classifier. However, as line 9 in Table 3 shows, the classifier as a whole is not solely reliant on these three features. When they are removed from the feature pool, the performance (77.5% accuracy) does not differ significantly (p&lt;.05) from that of the classifier employing all features collectively.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="50" end_page="52" type="metho">
    <SectionTitle>
4 Data sparseness and back-off
</SectionTitle>
    <Paragraph position="0"> The classification experiments reported above impose a frequency constraint (absolute frequencies &gt;1000) on the nouns used for training and testing, in order to study the interaction of the different features without the effects of sparse data. In the light of the rather promising results from these experiments, however, it might be interesting to further test theperformance ofour features inclassification as the frequency constraint is gradually relaxed.</Paragraph>
    <Paragraph position="1"> To this end, three sets of common nouns each counting 40 nouns (20 animate and 20 inanimate nouns) were randomly selected from groups of nouns with approximately the same frequency in the corpus. The first set included nouns with an absolute frequency of 100 +/-20 ([?]100), the second of50+/-5 ([?]50)and thethird of10+/-2 ([?]10). Feature extraction followed the same procedure as in experiment 1, relative frequencies for all seven features were computed and assembled into feature vectors, one for each noun.</Paragraph>
    <Section position="1" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
4.1 Experiment 2: Effect of sparse data on
</SectionTitle>
      <Paragraph position="0"> classification In order to establish how much of the generalizing power of the old classifier is lost when the frequency ofthenouns islowered, anexperiment was conducted which tested the performance of the old classifier, i.e. a classifier trained on all the more frequent nouns, on the three groups of less frequent nouns. As we can see from the first column in Table 4, this resulted in a clear deterioration of results, from our earlier accuracy of 87.5% to new accuracies ranging from 70% to 52.5%, barely above the baseline. Not surprisingly, the results decline steadily as the absolute frequency of the classified noun is lowered.</Paragraph>
      <Paragraph position="1"> Accuracy results provide an indication that the classification is problematic. However, it does not indicate what the damage is to each class as such.</Paragraph>
      <Paragraph position="2"> A confusion matrix is in this respect more informative. Confusion matrices for the classification of the three groups of nouns,[?]100,[?]50 and[?]10, are provided in table 5. These clearly indicate that it is the animate class which suffers when data becomes more sparse. The percentage of misclassified animate nouns drop drastically from 50% at[?]100 to 80% at[?]50 and finally 95% at[?]10.</Paragraph>
      <Paragraph position="3"> The classification of the inanimate class remains pretty stable throughout. The fact that a majority of our features (SUBJ, GEN, PASS, ANAAN and REFL) target animacy, in the sense that a higher proportion of animate than inanimate nouns exhibit the feature, gives a possible explanation for this. As data gets more limited, this differentiation becomes harder to make, and the animate feature profiles come to resemble the inanimate more and more. Because the inanimate nouns are expected to have low proportions (compared to the animate) for all these features, the data sparseness is not as damaging. In order to examine the effect on each individual feature of the lowering of the frequency threshold, wealso ran classifiers trained on the high frequency nouns with only individual features on the three groups of new nouns. These resultsaredepicted inTable4. Inourearlier experiment, the performance of a majority of the individual features (OBJ, PASS, ANAAN, ANAIN) was significantly worse (at the p&lt;0.05 level) than the performance of the classifier including all the features. Three ofthe individual features (SUBJ, GEN, REFL) had a performance which did not differ significantly from that of the classifier employing all the features in combination.</Paragraph>
      <Paragraph position="4"> As the frequency threshold is lowered, however, the performance of the classifiers employing all features and those trained only on individual features become more similar. For the[?]100 nouns, only the two anaphoric features ANAAN andthe reflexivefeature REFL, have aperformance that differs significantly (p&lt;0.05) from the classifier employing all features. For the [?]50 and [?]10 nouns, there are no significant differences between the classifiers employing individual fea- null tures only and the classifiers trained on the feature set as a whole. This indicates that the combined classifiers no longer exhibit properties that are not predictable from the individual features alone and they do not generalize over the data based on the combinations of features.</Paragraph>
      <Paragraph position="5"> Interms ofaccuracy, afew ofthe individual features even outperform the collective result. On average, the three most frequent features, the SUBJ, OBJ and GEN features, improve the performance by 9.5% for the [?]100 nouns and 24.6% for the [?]50 nouns. For the lowest frequency nouns ([?]10) we see that the object feature alone improves the result by almost 24%, from 52.5% to 65 % accuracy. In fact, the object feature seems to be the most stable feature of all the features. When examining the means of the results extracted for the different features, the object feature is the feature which maintains thelargest difference between the two classes as the frequency threshold is lowered.</Paragraph>
      <Paragraph position="6"> The second most stable feature in this respect is the subject feature.</Paragraph>
      <Paragraph position="7"> Thegroup of experiments reported above shows that thelowering ofthe frequency threshold forthe classified nouns causes a clear deterioration of results in general, and most gravely when all the features are employed together.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="52" type="sub_section">
      <SectionTitle>
4.2 Experiment 3: Back-off features
</SectionTitle>
      <Paragraph position="0"> The three most frequent features, the SUBJ, OBJ and GEN features, were the most stable in the two experiments reported above and had a performance which did not differ significantly from the combined classifiers throughout. In light of this we ran some experiments where all possible combinations ofthese morefrequent features wereemployed. The results for each of the three groups of nouns ispresented inTable 6. The exclusion ofthe less frequent features has a clear positive effect on the accuracy results, as we can see in table 6. For the[?]100 and[?]50nouns, theperformance has improved compared to the classifier trained both on all the features and on the individual features. The classification performance for these nouns is now identical or only slightly worse than the performance for the high-frequency nouns in experiment 1. For the[?]10 group of nouns, the performance is, at best, the same as for all the features and at worse fluctuating around baseline.</Paragraph>
      <Paragraph position="1"> In general, the best performing feature combinations are SUBJ&amp;OBJ&amp;GEN and SUBJ&amp;OBJ .</Paragraph>
      <Paragraph position="2"> These two differ significantly (at the p&lt;.05 level) from the results obtained by employing all the features collectively for both the[?]100 and the[?]50 nouns, hence indicate a clear improvement. The feature combinations both contain the two most stable features - one feature which targets the animate class (SUBJ) and another which target the inanimate class (OBJ), a property which facilitates differentiation even as the marginals between the two decrease.</Paragraph>
      <Paragraph position="3"> It seems, then, that backing off to the most frequent features might constitute a partial remedy for the problems induced by data sparseness in the classification. The feature combinations SUBJ&amp;OBJ&amp;GEN and SUBJ&amp;OBJ both significantly improve the classification performance and actually enable us to maintain the same accuracy for both the[?]100 and[?]50 nouns as for the higher frequency nouns, as reported in experiment  tions of the most frequent features</Paragraph>
    </Section>
    <Section position="3" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
4.3 Experiment 4: Back-off classifiers
</SectionTitle>
      <Paragraph position="0"> Another option, besides a back-off to more frequent features in classification, is to back off to another classifier, i.e. a classifier trained on nouns with a similar frequency. An approach of this kind will attempt to exploit any group similarities that these nouns may have in contrast to the mores frequent ones, hopefully resulting in a better classification. null In this experiment classifiers were trained and tested using leave-one-out cross-validation on the three groups of lower frequency nouns and employing individual, as well as various other feature combinations. The results for all features as well as individual features are summarized in Table 7. As we can see, the result for the classifier employing allthefeatures hasimproved somewhat compared to the corresponding classifiers in experiment 3 (as reported above in Table 4) for all ourthreegroups ofnouns. Thisindicates thatthere is a certain group similarity for the nouns of similar frequency that is captured in the combination of the seven features. However, backing off to a classifier trained on nouns that are more similar frequency-wise does not cause an improvement in classification accuracy. Apart from the SUBJ feature for the [?]100 nouns, none of the other classifiers trained on individual or all features for the three different groups differ significantly (p&lt;.05) from their counterparts in experiment 3.</Paragraph>
      <Paragraph position="1"> As before, combinations of the most frequent features were employed in the new classifiers trained and tested on each of the three frequencyordered groups of nouns. In the terminology employed above, this amounts to a backing off both classifier- and feature-wise. The accuracy measures obtained for these experiments are summarized in table 8. For these classifiers, the backed off feature combinations do not differ significantly (at the p&lt;.05 level) from their counterparts in experiment 3, where the classifiers were trained on the more frequent nouns with feature back-off.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML