File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0107_metho.xml
Size: 10,787 bytes
Last Modified: 2025-10-06 14:10:36
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0107"> <Title>Latent Features in Automatic Tense Translation between Chinese and English Yang Ye + , Victoria Li Fossum SS</Title> <Section position="5" start_page="48" end_page="49" type="metho"> <SectionTitle> 3 Problem Definition </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="48" end_page="48" type="sub_section"> <SectionTitle> 3.1 Problem Formulation </SectionTitle> <Paragraph position="0"> The problem we are interested in can be formalized as a standard classification or labeling problem, in which we try to learn a classifier</Paragraph> <Paragraph position="2"> where V is a set of verbs (each described by a feature vector), and T is the set of possible tense tags.</Paragraph> <Paragraph position="3"> Tense and aspect are morphologically merged in English and coarsely defined, there can be twelve combinations of the simple tripartite tenses (present, past and future) with the progressive and perfect grammatical aspects. For our classification experiments, in order to combat sparseness, we ignore the aspects and only deal with the three simple tenses: present, past and future.</Paragraph> </Section> <Section position="2" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 3.2 Data </SectionTitle> <Paragraph position="0"> We use 152 pairs of parallel Chinese-English articles from LDC release. The Chinese articles come from two news sources: Xinhua News Service and Zaobao News Service, consisting of 59882 Chinese characters in total with roughly 350 characters per article. The English parallel articles are from Multiple-Translation Chinese (MTC) Corpus from LDC with catalog number LDC2002T01.</Paragraph> <Paragraph position="1"> We chose to use the best human translation out of 9 translation teams as our gold-standard parallel English data. The verb tenses are obtained through manual alignment between the Chinese source articles and the English translations. In order to avoid the noise brought by errors and be focused on the central question we try to answer in the paper, we did not use automatic tools such as GIZA++ to obtain the verb alignments, which typically comes with significant amount of errors. We ignore Chinese verbs that are not translated into English as verbs because of &quot;nominalization&quot; (by which verbal expressions in Chinese are translated into nominal phrases in English). This exclusion is based on the rationale that another choice of syntactic structure might retain the verbal status in the target English sentence, but the tense of those potential English verbs would be left to the joint decision of a set of disparate features. Those tenses are unknown in our training data. This preprocessing yields us a total of 2500 verb tokens in our data set.</Paragraph> </Section> </Section> <Section position="6" start_page="49" end_page="50" type="metho"> <SectionTitle> 4 Feature Space </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 4.1 Surface Features </SectionTitle> <Paragraph position="0"> There are many heterogeneous features that contribute to the process of tense generation for Chinese verbs in the cross-lingual situation. Tenses in English, while manifesting a distinction in temporal reference, do not always reflect this distinction at the semantic level, as is shown in the sentence &quot;I will leave when he comes.&quot; (Hornstein, 1990) accounts for this phenomenon by proposing the Constraints on Derived Tense Structures. Therefore, the feature space we use includes the features that contribute to the semantic level temporal reference construction as well as those contributing to tense generation from that semantic level. The following is a list of the surface features that are directly extractable from the training data: 1. Feature 1: Whether the verb is in quoted speech or not.</Paragraph> <Paragraph position="1"> 2. Feature 2: The syntactic structure in which the current verb is embedded. Possible structures include sentential complements, relative clauses, adverbial clauses, appositive clauses, and null embedding structure.</Paragraph> <Paragraph position="2"> 3. Feature 3: Which of the following signal adverbs occur between the current verb and the previous verb: yi3jing1(already), ceng2jing1(once), jiang1(future tense marker), zheng4zai4(progressive aspect marker), yi4zhi2(have always been).</Paragraph> <Paragraph position="3"> 4. Feature 4: Which of the following aspect markers occur between the current verb and the subsequent verb: le0, zhe0, guo4.</Paragraph> <Paragraph position="4"> 5. Feature 5: The distance in characters between the current verb and the previously tagged verb (We descretize the continuous distance into three ranges: 0 < distance < 5, 5 [?] distance < 10,or10 [?]distance <[?]).</Paragraph> <Paragraph position="5"> 6. Feature 6: Whether the current verb is in the same clause as the previous verb.</Paragraph> <Paragraph position="6"> Feature 1 and feature 2 are used to capture the discrepancy between semantic tense and syntactic tense. Feature 3 and feature 4 are clues or triggers of certain aspectual properties of the verbs. Feature 5 and feature 6 try to capture the dependency between tenses of adjacent verbs.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 4.2 Latent Features </SectionTitle> <Paragraph position="0"> The bottleneck in Artificial Intelligence is the unbalanced knowledge sources shared by human beings and a computer system. Only a subset of the knowledge sources used by human beings can be formalized, extracted and fed into a computer system. The rest are less accessible and are very hard to be shared with a computer system. Despite their importance in human language processing, latent features have received little attention in feature space exploration in most NLP tasks because they are impractical to extract. Although there have not yet been rigorous psycholinguistic studies demonstrating the extent to which the above knowledge types are used in human temporal relation processing, we hypothesize that they are very significant in assisting human's temporal relation decision. Nevertheless, a quantitative assessment of the utility of the latent features in NLP tasks has yet to be explored. (Olsen, et al., 2001) illustrates the value of latent features by showing how the telicity feature alone can help with tense resolution in Chinese to English machine translation. Given the prevalence of latent features in human language processing, in order to emulate human beings performance of the disambiguation, it is crucial to experiment with the latent features in automatic tense classification.</Paragraph> <Paragraph position="1"> (Pustejovsky, 2004) discusses the four basic problems in event-temporal identification:</Paragraph> <Paragraph position="3"> He said that Henan Province not only possesses the hardwares necessary for foreign investment, but also has, on the basis of the State policies and Henan's specific conditions, formulated its own preferential policies.</Paragraph> </Section> </Section> <Section position="7" start_page="50" end_page="51" type="metho"> <SectionTitle> AK </SectionTitle> <Paragraph position="0"> 1. Time-stamping of events (identifying an event and anchoring it in time) 2. Ordering events with respect to one another 3. Reasoning with contextually under-specified temporal expressions 4. Reasoning about the persistence of events (how long does an event or the outcome of an event last?) While time-stamping of the events and reasoning with contextually under-specified temporal expressions might be too informative to be features in tense classification, information concerning orderings between events and persistence of events are relatively easier to be encoded as features in a tense classification task. Therefore, we experiment with these two latent knowledge sources, both of which are heavily utilized by human beings in tense resolution.</Paragraph> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 4.3 Telicity and Punctuality Features </SectionTitle> <Paragraph position="0"> Following (Vendler, 1947), temporal information encoded in verbs is largely captured by some innate properties of verbs, of which telicity and punctuality are two very important ones. Telicity specifies a verb's ability to be bound in a certain time span, while punctuality specifies whether or not a verb is associated with a point event in time. Telicity and punctuality prepare verbs to be assigned different tenses when they enter the context in the discourse. While it is true that isolated verbs are typically associated with certain telicity and punctuality features, such features are contextually volatile. In reaction to the volatility exhibited in verb telicity and punctuality features, we propose that verb telicity and punctuality features should be evaluated only at the clausal or sentential level for the tense classification task. We manually obtained these two features for both the English and the Chinese verbs. All verbs in our data set were manually tagged as &quot;telic&quot; or &quot;atelic&quot;, and &quot;punctual&quot; or &quot;apunctual&quot;, according to context.</Paragraph> </Section> <Section position="2" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 4.4 Temporal Ordering Feature </SectionTitle> <Paragraph position="0"> (Allen, 1981) defines thirteen relations that could possibly hold between any pair of situations. We experiment with six temporal relations which we think represent the most typical temporal relationships between two events. We did not adopt all of the thirteen temporal relationships proposed by Allen for the reason that some of them would require excessive deliberation from the annotators and hard to implement. The six relationships we explore are as follows: 1. event A precedes event B 2. event A succeeds event B 3. event A includes event B 4. event A subsumes event B 5. event A overlaps with event B 6. no temporal relations between event A and event B For each Chinese verb in the source Chinese texts, we annotate the temporal relation between the verb and the previously tagged verb as belonging to one of the above classes. The annotation of the temporal relation classes mimics a deeper semantic analysis of the Chinese source text. Figure 1 illustrates a sentence in which each verb is tagged by the temporal relation class that holds between it and the previous verb.</Paragraph> </Section> </Section> class="xml-element"></Paper>