File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1612_metho.xml
Size: 14,112 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1612"> <Title>Learning Information Status of Discourse Entities</Title> <Section position="5" start_page="96" end_page="96" type="metho"> <SectionTitle> 4 Learning Information Status </SectionTitle> <Paragraph position="0"> Our starting point for the automatic assignment of information status are the three already introduced classes: old, mediated and new. Additionally, we experiment with binary classifications, by collapsing mediated entities in turn with old and new ones.</Paragraph> <Paragraph position="1"> For training, developing and evaluating the model we use the split described in Section 2.2 (see Table 1). Performance is evaluated according to overall accuracy and per class precision, recall, and f-score as described in Section 3. To train a C4.5 decision tree model we use the J48 Weka implementation (Witten and Frank, 2000). The choice of features to build the tree is described in the following section.</Paragraph> <Section position="1" start_page="96" end_page="96" type="sub_section"> <SectionTitle> 4.1 Features </SectionTitle> <Paragraph position="0"> The seven features we use are automatically extracted from the annotated data exploiting pre-existing morpho-syntactic markup and using sim- null ple pattern matching techniques. They are summarised in Table 4.</Paragraph> <Paragraph position="1"> The choice of features is motivated by the following observations. The information coming from partial previous mentions is particularly useful for the identification of mediated entities. This should account specifically for cases of mediation via set-relations; for example, &quot;your children&quot; would be considered a partial previous mention of &quot;my children&quot; or &quot;your four children&quot;. The value &quot;na&quot; stands for &quot;non-applicable&quot; and is mainly used for pronouns. Full previous mention is likely to be a good indicator of old entities. Both full and partial previous mentions are calculated within each dialogue without any constraints based on distance.</Paragraph> <Paragraph position="2"> NP type and determiner type are expected to be helpful for all categories, with pronouns, for instance, tending to be old and indefinite NPs being often new. We included the length of NPs (measured in number of words) since linguistic studies have shown that old entities tend to be expressed withlesslexicalmaterial(Wasow,2002). Inexperiments on the development data we also included the NP string itself, on the grounds that it might be of use in cases of general mediated instances (common knowledge entities), such as &quot;the sun&quot;, &quot;people&quot;, &quot;Mickey Mouse&quot;, and so on. However, this feature turned out to negatively affect performance, and was not included in the final model.</Paragraph> </Section> </Section> <Section position="6" start_page="96" end_page="97" type="metho"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> With an overall final accuracy of 79.5% on the evaluation set, C4.5 significantly outperforms the hand-crafted algorithm (65.8%). Although the identification of old entities is quite successful (F=.928), performance is not entirely satisfactory.</Paragraph> <Paragraph position="1"> This is especially true for the classification of new entities, for which the final f-score is .320, mainly due to extremely low recall (.223). Mediated entities, instead, are retrieved with a fairly low precision but higher recall. Table 5 summarises precision, recall, and f-score for each class.</Paragraph> <Paragraph position="2"> The major confusion in the classification arises between mediated and new (the most difficult decision to make for human annotators too, see Section 2.1), which are often distinguished on the basis of world knowledge, not available to the classifier. This is clearly shown by the confusion matrix in Table 6: the highest proportion of mistakes is due to 1,453 new instances classified as mediated.</Paragraph> <Paragraph position="3"> Also significant is the wrong assignment of mediated tags to old entities. Such behaviour of the classifier is to be expected, given the 'in-between' nature of mediated entities.</Paragraph> <Section position="1" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 4.3 Classification with two categories only </SectionTitle> <Paragraph position="0"> Given the above observations, we collapsed mediated entities in turn with old ones (focusing on their non-newness) or new ones (enhancing their non complete givenness), thus reducing the task to a binary classification.</Paragraph> <Paragraph position="1"> Since it appears to be more difficult to distinguish mediated and new rather than mediated and old (Table 6), we expect the classifier to perform better when mediated is binned with new rather than old. Also, in the case where mediated and old entities are collapsed into one single class as opposed to new ones, the distribution of classes becomes highly skewed towards old entities (84.7%) so that the learner is likely to lack sufficient information for identifying new entities.</Paragraph> <Paragraph position="2"> Table 7 shows the final accuracy for the two binary classifications (and the three-way one). As expected, when mediated entities are joint with new ones, the classifier performs best (93.1%), with high f-scores for both old and new, and is significantly better than the alternative binary classification (t-test, p < 0.001). Indeed, the old+med vs new classification is nearly an all-old assignment and its overall final accuracy (85.5%) is not a significant improvement over the all-old baseline (84.7%). Results suggest that mediated NPs are more similar to new than to old entities and might provide interesting feedback for the theoretical assumptions underlying the corpus annotation.</Paragraph> </Section> <Section position="2" start_page="97" end_page="97" type="sub_section"> <SectionTitle> 4.4 Comparison with two categories only </SectionTitle> <Paragraph position="0"> For a fair comparison, we performed a two-way classification using the hand-crafted algorithm, which had to be simplified to account for the lack of a mediated class.</Paragraph> <Paragraph position="1"> In the case where all mediated instances where collapsed together with the old ones, the decision rulesareverysimple: pronouns, propernouns, and common nouns that have been previously fully or partially mentioned are classified as old; first mention common nouns are new; everything else is old. Both precision and recall for old instances are quite high (.868 and .906 respectively), for a resulting f-score of .887. Conversely, the performance on identifying new entities is very poor, with a precision of .337 and a recall of .227, for a combined f-score of .271. The overall accuracy is .803, and this is significantly lower than the performance of C4.5, which achieves an overall accuracy of .850 (t-test, p < 0.001).</Paragraph> <Paragraph position="2"> When mediated entities are collapsed with new ones, rule-based classification is done again with if first mention, old otherwise; common nouns that have been fully previously mentioned are old, otherwise new. Everything else is new, which in the training set is now the most frequent class (51.7%). The overall accuracy of .849 is significantly lower than that achieved by C4.5, which is .931 (t-test, p < 0.001). Differently from the previous case (mediated collapsed with old), the performance on each class is comparable, with a precision, recallandf-scoreof.863, .815, and.838 for old and of .838, .881, and .859 for new.</Paragraph> </Section> </Section> <Section position="7" start_page="97" end_page="98" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="97" end_page="98" type="sub_section"> <SectionTitle> 5.1 Influence of training size </SectionTitle> <Paragraph position="0"> In order to assess the contribution of training size to performance, we experimented with increas-</Paragraph> </Section> </Section> <Section position="8" start_page="98" end_page="99" type="metho"> <SectionTitle> DEV EVAL </SectionTitle> <Paragraph position="0"> classification rules C4.5 rules C4.5 old vs med vs new .658 .796 .644 .795 old+med vs new .810 .861 .803 .855 old vs med+new .844 .926 .849 .931 ingly larger portions of the training data (from 50 to 30,000 instances). For each training size we ran the classifier 5 times, each with a different randomly picked set of instances. This was done for the three-way and the two binary classifications. Reported results are always averaged over the 5 runs. Figure 2 shows the three learning curves.</Paragraph> <Paragraph position="1"> The curve for the three-way classification shows a slight constant improvement, though it appears to reach a plateau after 5,000 instances. The result obtained training on the full set (40865 instances) is significantly better only if compared to a training set of 4,000 or less (t-test, p < 0.05). No other significant difference in accuracy can be observed. Increasing the training size over 5,000 instances when learning to classify old+mediated vs new leads to a slight improvement due to the learner being able to identify some new entities. With a smaller training set the proportion of new entities is far too small to be of use. However, as said, the overall final accuracy of 85.5% (see Table 7) does not significantly improve over the baseline.</Paragraph> <Section position="1" start_page="98" end_page="98" type="sub_section"> <SectionTitle> 5.2 Feature contribution </SectionTitle> <Paragraph position="0"> We are also interested in the contribution of each single feature. Therefore, we ran the classifier again, leaving out one feature at a time. No significant drop or gain was observed in any of the runs (t-test, p < 0.01), though the worst detriments were yielded by removing the grammatical role and the NP type. These two features, however, also appear to be the least informative in single-feature classification experiments, thus suggesting that such information comes very useful only when combined with other evidence (see also Section 5.4. All results for leave-one-out and single-feature classifiers are shown in Table 8.</Paragraph> </Section> <Section position="2" start_page="98" end_page="99" type="sub_section"> <SectionTitle> 5.3 Error Analysis </SectionTitle> <Paragraph position="0"> The overwhelming majority of mistakes (1,453, 56.1% of all errors) in the three-way classification stems from classifying as mediated entities that are in fact new (Table 6). Significant confusion arises from proper nouns, as they are annotated as mediated or new entities, depending on whether they are generally known (such as names of US presidents, for example), or domain/communityspecific (such as the name of a local store that only the speaker knows). This inconsistency in the annotation might reflect well the actual status of entities in the dialogues, but it can be misleading for the classifier.</Paragraph> <Paragraph position="1"> Another large group of errors is formed by old entities classified as mediated (452 cases). This is probably due to the fact that the first node in the decision tree is the &quot;partial mention&quot; feature (see Figure 3). The tree correctly captures the fact that a firstly mentioned entity which has been partially mentioned before is mediated. An entity that has a previous partial mention but also a full previous mention is classified as old only if it is a proper noun or a pronoun, but as mediated if it is a common noun. This yields a large number of mis- null takes, since many common nouns that have been previously mentioned (both in full and partially) are in fact old. Another problem with previous mentions is the lack of restriction in distance: we consider a previous mention any identical mention of a given NP anywhere in the dialogue, and we have no means of checking that it is indeed the same entity that is referred to. A way to alleviate this problem might be exploiting speaker turn information. Using anaphoric chains could also be of help, but see Section 6.</Paragraph> </Section> <Section position="3" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 5.4 Learnt trees meet hand-crafted rules </SectionTitle> <Paragraph position="0"> The learnt trees provide interesting insights on the intuitions behind the choice of hand-crafted rules.</Paragraph> <Paragraph position="2"> looks remarkably different from the rules in Figure 1. We had based our decision of emphasising the importance of the NP type on the linguistic evidence that different syntactic realisations reflect different degrees of availability of discourse entities (Giv'on, 1983; Ariel, 1990; Grosz et al., 1995).</Paragraph> <Paragraph position="3"> In the learnt model, however, knowledge about NP type is only used as subordinate to other features.</Paragraph> <Paragraph position="4"> This is indeed mirrored in the fact that removing NP type information from the feature set causes accuracy to drop, but a classifier building on NP type alone performs poorly (see Table 8).3 Interestingly, though, more informative knowledge about syntactic form seems to be derived from the determiner type, which helps distinguish degrees of oldness among common nouns.</Paragraph> </Section> <Section position="4" start_page="99" end_page="99" type="sub_section"> <SectionTitle> 5.5 Naive Bayes model </SectionTitle> <Paragraph position="0"> Foradditionalcomparison, wealsotrainedaNaive Bayes classifier with the same experimental settings. Results are significantly worse than C4.5's in all three scenarios (t-test, p < 0.005), with an accuracy of 74.6% in the three-way classification, 63.3%forold+mediatedvsnew, and91.0%forold vs mediated+new. The latter distribution appears again to be the easiest to learn.</Paragraph> </Section> </Section> class="xml-element"></Paper>