XML Viewer - w06-0603

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0603_metho.xml
Size: 23,304 bytes
Last Modified: 2025-10-06 14:10:33
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0603">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics How and Where do People Fail with Time: Temporal Reference Mapping Annotation by Chinese and English Bilinguals Yang Ye SS</Title>
  <Section position="5" start_page="13" end_page="15" type="metho">
    <SectionTitle>
3 Chinese Tense Annotation
Experiments
</SectionTitle>
    <Paragraph position="0"> In current section, we present three tense annotation experiments with the following scenarios:  1. Null-control situation by native Chinese speakers where the annotators were provided with the source Chinese sentences but not the English translations; 2. High-control situation by native English speakers where the annotators were provided with the Chinese sentences as well as English translations with specified syntax and lexicons; null 3. Semi-control situation by native English  speakers where the annotators were allowed to choose the syntax and lexicons for the English sentence with appropriate tenses;  All experiments in the paper are approved by Behavioral Sciences Institutional Review Board at the University of Michigan, the IRB file number is B04-00007481-I.</Paragraph>
    <Section position="1" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
3.1 Experiment One
</SectionTitle>
      <Paragraph position="0"> Experiment One presents the first scenario of tense annotation for Chinese verbs in Chinese-to-English cross-lingual situation. In the first scenario, the annotation experiment was carried out on 25 news articles from LDC Xinhua News release with category number LDC2001T11. The articles were divided into 5 groups with 5 articles in each group. There are a total number of 985 verbs.</Paragraph>
      <Paragraph position="1"> For each group, three native Chinese speakers who were bilingual in Chinese and English annotated the tense of the verbs in the articles independently.</Paragraph>
      <Paragraph position="2"> Prior to annotating the data, the annotators underwent brief training during which they were asked to read an example of a Chinese sentence for each tense and make sure they understand the examples. During the annotation, the annotators were asked to read the whole articles first and then select a tense tag based on the context of each verb.</Paragraph>
      <Paragraph position="3"> The tense taxonomy provided to the annotators include the twelve tenses that are different combinations of the simple tenses (present, past and future), the prograssive aspect and the perfect aspect.</Paragraph>
      <Paragraph position="4"> In cases where the judges were unable to decide the tense of a verb, they were instructed to tag it as &amp;quot;unknown&amp;quot;. In this experiment, the annotators were asked to tag the tense for all Chinese words that were tagged as verbs in the Penn Treebank corpora. Conceivably, the task under the current scenario is meta-linguistic in nature for the reason that tense is an elusive notion for Chinese speakers. Nevertheless, the experiment provides a base-line situation for human tense annotation agreement. The following is an example of the annotation where the annotators were to choose an appropriate tense tag from the provided tense tags:  1. simple present tense 2. simple past tense 3. simple future tense 4. present perfect tense 5. past perfect tense 6. future perfect tense 7. present progressive tense 8. past progressive tense 9. future progressive 10. present perfect progressive 11. past perfect progressive</Paragraph>
    </Section>
    <Section position="2" start_page="14" end_page="14" type="sub_section">
      <SectionTitle>
3.2 Experiment Two
</SectionTitle>
      <Paragraph position="0"> Experiment Two was carried out using 25 news articles from the parallel Chinese and English news articles available from LDC Multiple Translation Chinese corpora (MTC catalog number LDC2002T01). In the previous experiment, the annotators tagged all verbs. In the current experimental set-up, we preprocessed the materials and removed those verbs that lose their verbal status in translation from Chinese to English due to nominalization. After this preprocessing, there was a total of 288 verbs annotated by the annotators.</Paragraph>
      <Paragraph position="1"> Three native speakers, who were bilingually fluent in English and Chinese, were recruited to annotate the tense for the English verbs that were translated from Chinese. As in the previous scenario, the annotators were encouraged to pay attention to the context of the target verb when tagging its tense.</Paragraph>
      <Paragraph position="2"> The annotators were provided with the full taxonomy illustrated by examples of English verbs and they worked independently. The following is an example of the annotation where the annotators were to choose an appropriate tense tag from the provided tense tags:</Paragraph>
      <Paragraph position="4"> According to statistics, the cities (achieve) a combined gross domestic product of RMB19 billion last year, an increase of more than 90% over 1991 before their opening.</Paragraph>
      <Paragraph position="5"> A. achieves B. achieved C. will achieve D. are achieving E. were achieving F. will be achieving G. have achieved H. had achieved I. will have achieved J. have been achieving K. had been achieving L. will have been achieving M. would achieve</Paragraph>
    </Section>
    <Section position="3" start_page="14" end_page="15" type="sub_section">
      <SectionTitle>
3.3 Experiment Three
</SectionTitle>
      <Paragraph position="0"> vious section. Since in the MTC corpora, each Chinese article is translated into English by ten human translation teams, conceptually, we could view these ten translation teams as different annotators. They were making decisions about appropriate tense for the English verbs. These annotators differ from those in Experiment Two described above in that they were allowed to choose any syntactic structure and verb lexicon. This is because they were performing tense annotation in a bigger task of sentence translation. Therefore, their tense annotations were performed with much less specification of the annotation context. We manually aligned the Chinese verbs with the English verbs for the 10 translation teams from the MTC corpora and thus obtained our third source of tense annotation results. For the Chinese verbs  that were not translated as verbs into English, we assigned a &amp;quot;Not Available&amp;quot; tag. There are 1505 verbs in total including the ones that lost their verbal status across the language.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="15" end_page="16" type="metho">
    <SectionTitle>
4 Inter-Judge Agreement
</SectionTitle>
    <Paragraph position="0"> Researchers use consistency checking to validate human annotation experiments. There are various ways of performing consistency checking described in the literature, depending on the scale of the measurements. Each has its advantages and disadvantages. Since our tense taxonomy is nominal without any ordinal information, Kappa statistics measurement is the most appropriate choice to measure inter-judge agreement.</Paragraph>
    <Section position="1" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
4.1 Kappa Statistic
</SectionTitle>
      <Paragraph position="0"> Kappa scores were calculated for the three human judges' annotation results. The Kappa score is the de facto standard for evaluating inter-judge agreement on tagging tasks. It reports the agreement rate among multiple annotators while correcting for the agreement brought about by pure chance.</Paragraph>
      <Paragraph position="1"> It is defined by the following formula, where P(A) is the observed agreement among the judges and P(E) is the expected agreement:</Paragraph>
      <Paragraph position="3"> Depending on how one identifies the expected agreement brought about by pure chance, there are two ways to calculate the Kappa score. One is the &amp;quot;Seigel-Castellian&amp;quot; Kappa discussed in (Eugenio, 2004), which assumes that there is one hypothetical distribution of labels for all judges. In contrast, the &amp;quot;Cohen&amp;quot; Kappa discussed in (Cohen, 1960), assumes that each annotator has an individual distribution of labels. This discrepancy slightly affects the calculation of P(E). There is no consensus regarding which Kappa is the &amp;quot;right&amp;quot; one and researchers use both. In our experiments, we use the &amp;quot;Seigel-Castellian&amp;quot; Kappa.</Paragraph>
      <Paragraph position="4"> The Kappa statistic for the annotation results of Experiment One are 0.277 on the full taxonomy and 0.37 if we collapse the tenses into three big classes: present, past and future. The observed agreement rate,that is, P(A), is 0.42.</Paragraph>
      <Paragraph position="5"> The Kappa score for tense resolution from the ten human translation teams for the 52 Xinhua news articles is 0.585 on the full taxonomy; we expect the Kappa score to be higher if we exclude the verbs that are nominalized. Interestingly, the Kappa score calculated by collapsing the 13 tenses into 3 tenses (present, past and future) is only slightly higher: 0.595. The observed agreement rate is 0.72.</Paragraph>
      <Paragraph position="6"> Human tense annotation in the Chinese-to-English restricted translation scenario achieved a Kappa score of 0.723 on the full taxonomy with an observed agreement of 0.798. If we collapse simple past and present perfect, the Kappa score goes up to 0.792 with an observed agreement of 0.893.</Paragraph>
      <Paragraph position="7"> The Kappa score is 0.81 on the reduced taxonomy.</Paragraph>
    </Section>
    <Section position="2" start_page="15" end_page="16" type="sub_section">
      <SectionTitle>
4.2 Accuracy
</SectionTitle>
      <Paragraph position="0"> The Kappa score is a relatively conservative measurement of the inter-judge agreement rate. Conceptually, we could also obtain an alternative measurement of reliability by taking one annotator as the gold standard at one time and averaging over the accuracies of the different annotators across different gold standards. While it is true that numerically, this would yield a higher score than the Kappa score and seems to be inflating the agreement rate, we argue that the difference between the Kappa score and the accuracy-based measurement is not limited to one being more aggressive than the other. The policies of these two measurements are different. The Kappa score is concerned purely with agreement without any consideration of truthfulness or falsehood, while the procedure we described above gives equal weights to each annotator being the gold standard. Therefore, it considers both the agreement and the truthfulness of the annotation. Additionally, the accuracy-based measurement is the same measurement that is typically used to evaluate machine performance; therefore it gives a genuine ceiling for machine performance.</Paragraph>
      <Paragraph position="1"> The accuracy under such a scheme for the three annotators in Experiment One is 43% on the full tense taxonomy.</Paragraph>
      <Paragraph position="2"> The accuracy under such a scheme for tense generation agreement from three annotators in Experiment Two is 80% on the full tense taxonomy.</Paragraph>
      <Paragraph position="3"> The accuracy under such a scheme for the ten translation teams in Experiment Three is 70.8% on the full tense taxonomy.</Paragraph>
      <Paragraph position="4"> Table 1 summarizes the inter-judge agreement for the three experiments.</Paragraph>
      <Paragraph position="5"> Examining the annotation results, we identified the following sources of disagreement. While the</Paragraph>
    </Section>
    <Section position="3" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
Tense Annotation Experiments
</SectionTitle>
      <Paragraph position="0"> first two factors can be controlled for by a clearly pre-defined annotation guideline, the last two factors are intrinsically rooted in natural languages and therefore hard to deal with:  1. Different compliance with Sequence of Tense (SOT) principle among annotators; 2. &amp;quot;Headline Effect&amp;quot;; 3. Ambiguous POS of the &amp;quot;verb&amp;quot;: sometimes it is not clear whether a verb is adjective or past participle. e.g. The Fenglingdu Economic Development Zone is the only one in China that is/was built on the basis of a small town. 4. Ambiguous aspectual property of the verb:  the annotator's view with respect to whether or not the verb is an atelic verb or a telic verb. e.g. &amp;quot;statistics showed/show......&amp;quot; Put abstractly, ambiguity is an intrinsic property of natural languages. A taxonomy allows us to investigate the research problem, yet any clearly defined discrete taxonomy will inevitably fail on boundary cases between different classes.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="16" end_page="18" type="metho">
    <SectionTitle>
5 Significance of Linguistic Factors in
Annotation
</SectionTitle>
    <Paragraph position="0"> In the NLP community, researchers carry out annotation experiments mainly to acquire a gold standard data set for evaluation. Little effort has been made beyond the scope of agreement rate calculations. We propose that not only does feature analysis for annotation experiments fall under the concern of psycholinguists, it also merits investigation within the enterprise of natural language processing. There are at least two ways that the analysis of annotation results can help the NLP task besides just providing a gold standard: identifying certain features that are responsible for the inter-judge disagreement and modeling the situation of associations among the different features. The former attempts to answer the</Paragraph>
    <Section position="1" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
Temporal Modifier
</SectionTitle>
      <Paragraph position="0"> question of where the challenge for human classification comes from, and thereby provides an external reference for an automatic NLP system, although not necessarily in a direct way. The latter sheds light on the structures hidden among groups of features, the identification of which could provide insights for feature selection as well as offer convergent evidence for the significance of certain features confirmed from classification practice based on machine learning.</Paragraph>
      <Paragraph position="1"> In this section, we discuss at some length a feature analysis for the results of each of the annotation experiments discussed in the previous sections and summarize the findings.</Paragraph>
    </Section>
    <Section position="2" start_page="16" end_page="17" type="sub_section">
      <SectionTitle>
5.1 ANOVA analysis of Agreement and
Linguistic Factors in Free Translation
Tense Annotation
</SectionTitle>
      <Paragraph position="0"> This analysis tries to find the relationship between the linguistic properties of the verb and the tense annotation agreement across the ten different translation teams in Experiment Three. Specifically, we use an ANOVA analysis to explore how the overall variance in the inconsistency of the tenses of a particular verb with respect to different translation teams can be attributed to different linguistic properties associated with the Chinese verb. It is a three-way ANOVA with three linguistic factors under investigation: whether the sentence contains a temporal modifier or not; whether the verb is embedded in a relative clause, a sentential complement, an appositive clause or none of the above; and whether the verb is followed by aspect markers or not. The dependent variable is the inconsistency of the tenses from the teams. The  inconsistency rate is measured by the ratio of the number of distinct tenses over the number of tense tokens from the ten translation teams.</Paragraph>
      <Paragraph position="1"> Our ANOVA analysis shows that all of the three main effects, i.e. the embedding structures of the verb (p lessmuch 0.001), the presence of aspect markers (p lessmuch 0.01), and the presence of temporal modifiers (p&lt;0.05) significantly affect the rate of disagreement in tense generation among the different translation teams. The following graphs show the trend: tense generation disagreement rates are consistently lower when the Chinese aspect marker is present, whether there is a temporal modifier present or not (Figure 1). The model also suggested that the presence of temporal modifiers is associated with a lower rate of disagreement for three embedding structures except for verbs in sentential complements (Figure 2, 0: the verb is not in any embedding structures; 1: the verb is embedded in a relative clause; 2: the verb is embedded in an appositive clause; 3: the verb is embedded in sentential complement). Our explanation for this is that the annotators receive varying degrees of prescriptive writing training, so when there is a temporal modifier in the sentence as a confounder, there will be a larger number, a higher incidence of SOT violations than when there is no temporal modifier present in the sentence. On top of this, the rate of disagreement in tense tagging between the case where a temporal modifier is present in the sentence and the case where it is not depends on different types of embedding structures (Figure 2, p value &lt; 0.05).</Paragraph>
      <Paragraph position="2"> We also note that the relative clause embedding structure is associated with a much higher disagreement rate than any other embedding structures (Figure 3).</Paragraph>
    </Section>
    <Section position="3" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
5.2 Logistic Regression Analysis of
Agreement and Linguistic Factors in
Restricted Tense Annotation
</SectionTitle>
      <Paragraph position="0"> The ANOVA analysis in the previous section is concerned with the confounding power of the overt linguistic features. The current section examines the significance of the more latent features on tense annotation agreement when the SOT effect is removed by providing the annotators a clear guideline about the SOT principle. Specifically, we are interested in the effect of verb telicity and punctuality features on tense annotation  ifier and the Syntactic Embedding Structure were obtained through manual annotation based on the situation in the context. The data are from Experiment Two. Since there are only three annotators, the inconsistency rate we discussed in 5.1 would have insufficient variance in the current scenario, making logistic regression a more appropriate analysis. The response is now binary being either agreement or disagreement (including partial agreement and pure disagreement). To avoid a multi-colinearity problem, we model Chinese features and English features separately. In order to truly investigate the effects of the latent features, we keep the overt linguistic features in the model as well. The overt features include: type of syntactic embedding, presence of aspect marker, presence of temporal expression in the sentence, whether the verb is in a headline or not, and the presence of certain signal adverbs including &amp;quot;yijing&amp;quot;(already), &amp;quot;zhengzai&amp;quot; (Chinese pre-verb progressive marker), &amp;quot;jiang&amp;quot;(Chinese pre-verbal adverb indicating future tense). We used backward elimination to obtain the final model.</Paragraph>
      <Paragraph position="1"> The result showed that punctuality is the only factor that significantly affects the agreement rate among multiple judges in both the model of English features and the model of Chinese features.</Paragraph>
      <Paragraph position="2"> The significance level is higher for the punctuality of English verbs, suggesting that the source language environment is more relevant in tense generation. The annotators are roughly four times more likely to fail to agree on the tense for verbs associated with an interval event. This supports the hypothesis that human beings use the latent features for tense classification tasks. Surprisingly, the telicity feature is not significant at all. We sus- null pect this is partly due to the correlation between the punctuality feature and the telicity feature. Additionally, none of the overt linguistic features is significant in the presence of the latent features, which implies that the latent features drive disagreement among multiple annotators.</Paragraph>
    </Section>
    <Section position="4" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
5.3 Log-linear Model Analysis of
</SectionTitle>
      <Paragraph position="0"> Associations between Linguistic Factors in Free Translation Tense Annotation This section discusses the association patterns between tense and the relevant linguistic factors via a log-linear model. A log-linear model is a special case of generalized linear models (GLMs) and has been widely applied in many fields of social science research for multivariate analysis of categorical data. The model reveals the interaction between categorical variables. The log-linear model is different from other GLMs in that it does not distinguish between &amp;quot;response&amp;quot; and &amp;quot;explanatory variables&amp;quot;. All variables are treated alike as &amp;quot;response variables&amp;quot;, whose mutual associations are explored. Under the log-linear model, the expected cell frequencies are functions of all variables in the model. The most parsimonious model that produces the smallest discrepancy between the expected cell and the observed cell frequencies is chosen as the final model. This provides the best explanation of the observed relationships among variables.</Paragraph>
      <Paragraph position="1"> We use the data from Experiment Two for the current analysis. The results show that three linguistic features under investigation are significantly associated with tense. First, there is a strong association between aspect marker presence and tense, independent of punctuality, telicity feature and embedding structure. Second, there is a strong association between telicity and tense, independent of punctuality, aspect marker presence and punctuality feature. Thirdly, there is a strong association between embedding structure and tense, independent of telicity, punctuality feature and aspect marker presence. This result is consistent with (Olsen, 2001), in that the lexical telicity feature, when used heuristically as the single knowledge source, can achieve a good prediction of verb tense in Chinese to English Machine Translation.</Paragraph>
      <Paragraph position="2"> For example, the odds of the verb being atelic in the past tense is 2.5 times the odds of the verb being atelic in the future tense, with a 95% confidence interval of (0.9, 7.2). And the odds of a verb in the future tense having an aspect marker approaches zero when compared to the odds of a verb in the past tense having an aspect marker.</Paragraph>
      <Paragraph position="3"> Putting together the pieces from the logistic analysis and the current analysis, we see that annotators fail to agree on tense selection mostly with apunctual verbs, while the agreed-upon tense is jointly decided by the telicity feature, aspect marker feature and the syntactic embedding structure that are associated with the verb.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML