XML Viewer - c04-1101

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1101_metho.xml
Size: 24,202 bytes
Last Modified: 2025-10-06 14:08:39
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1101">
  <Title>Combining Linguistic Features with Weighted Bayesian Classifier for Temporal Reference Processing</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Representing Temporal Relations
</SectionTitle>
    <Paragraph position="0"> With the growing interests to temporal information processing in NLP, a variety of temporal systems have been introduced to accommodate the characteristics of temporal information. In order to process temporal reference in a discourse, a formal represensentence relations.</Paragraph>
    <Paragraph position="1"> tation of temporal relations is required. Among those who worked on representing or explaining temporal relations, some have taken the work of Reichenbach (Reichenbach, 1947) as a starting point, while others based their works on Allen's (Allen, 1983).</Paragraph>
    <Paragraph position="2"> Reichenbach proposed a point-based temporal theory. Reichenbach's representation associated English tenses and aspects with three time points, namely event time (E), speech time (S) and reference time (R). The reference of E-R and R-S was either before (or after in reverse order) or simultaneous. This theory was later enhanced by Bruce who defined seven temporal relations (Bruce, 1972). Given two durative events, the interval relations between them were modeled by the order between the greatest lower bounding point and least upper bounding point of the two events. In the other camp, instead of adopting time points, Allen took intervals as temporal primitives to facilitate temporal reasoning and introduced thirteen basic relations. In this interval-based representation, points were relegated to a subsidiary status as &amp;quot;meeting places&amp;quot; of intervals. An extension to Allen's theory, which treated both points and intervals as primitives on an equal footing, was later investigated by Knight and Ma (Knight, 1994).</Paragraph>
    <Paragraph position="3"> In natural languages, events described can be either punctual or durative in nature. A punctual event, e.g., Bao Zha (explore), occurs instantaneously. It takes time but does not last in a sense that it lacks of a process of change. It is adequate to represent a punctual event with a simple point structure. Whilst, a durative event, e.g., Gai Lou (built a house), is more complex and its accomplishment as a whole involves a process spreading in time. Representing a durative event requires an interval representation. For this reason, Knight and Ma's model is adopted in our work (see Figure 1). Taking the sentence &amp;quot;Xiu Cheng Li Jiao Qiao Yi Hou , Ta Men Jie Jue Liao Gai Shi De Jiao Tong Wen Ti (They solved the traffic problem of the city after the street bridge had been built)&amp;quot; as an example, the relation held between building the bridge (i.e., an interval) and solving the problem (i.e., a point) is BEFORE.</Paragraph>
    <Paragraph position="4"> Figure 1 13 relations represented with points and intervals</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="3" type="metho">
    <SectionTitle>
3 Linguistic Background of Temporal Refer-
</SectionTitle>
    <Paragraph position="0"> ence in a Discourse</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Literature Review
</SectionTitle>
      <Paragraph position="0"> There were a number of theories in the literature about how temporal relations between events can be determined in English. Most of the researches on temporal reference were based on Reichenbach's notion of tense/aspect structure, which was known as Basic Tense Structure (BTS). As for relating two events adjoined by a temporal/causal connective, Hornstein (Hornstein, 1990) proposed a neo-Reichenbach structure which organized the BTSs into a Complex Tense Structure (CTS). It has been argued that all sentences containing a matrix and an adjunct clause were subject to linguistic constraints on tense structure regardless of the lexical words included in the sentence. Generally, constraints were used to support syntactic disambiguation (Brent, 1990) or to generate acceptable sentences (Dorr, 2002).</Paragraph>
      <Paragraph position="1"> In a given CTS, a past perfect clause should precede the event described by a simple past clause.</Paragraph>
      <Paragraph position="2"> However, the order of two events in CTS does not necessarily correspond to the order imposed by the interpretation of the connective (Dorr, 2002). Temporal/casual connective, such as &amp;quot;after&amp;quot;, &amp;quot;before&amp;quot; or &amp;quot;because&amp;quot;, can supply explicit information about the temporal ordering of events. Passonneau (Passonneau, 1988), Brent (Brent, 1990 and Sing (Sing, 1997) determined intra-sentential relations by accounting for temporal or causal connectives. Dorr and Gaasterland (Dorr, 2002), on the other hand, studied how to generate the sentences which reflect event temporal relations by selecting proper connecting words. However, temporal connectives can be ambiguous.</Paragraph>
      <Paragraph position="3"> For instance, a &amp;quot;when&amp;quot; clause permits many possible temporal relations.</Paragraph>
      <Paragraph position="4"> Several researchers have developed the models that incorporated aspectual types (such as those distinct from states, processes and events) to interpret temporal relations between clauses connected with &amp;quot;when&amp;quot;. Moens and Steedmen (Moens, 1988) developed a tripartite structure of events  , and emphasized it was the notion of causation and consequence that played a central role in defining temporal relations of events. Webber (Webber, 1988) improved upon the above work by specifying rules for how events are related to one another in a discourse and Sing and Sing defined semantic constraints through which events can be related (Sing, 1997). The importance of aspectual information in retrieving proper aspects and connectives for sentence generation was also recognized by Dorr and Gaasterland (Dorr, 2002).</Paragraph>
      <Paragraph position="5"> Some literature claimed that discourse structures suggested temporal relations. Lascarides and Asher (Lascarides, 1991) investigated various contextual effects on rhetorical relations (such as narration, elaboration, explanation, background and result).</Paragraph>
      <Paragraph position="6"> They corresponded each of the discourse relations to a kind of temporal relation. Later, Hitzeman (Hitzeman, 1995) described a method for analyzing temporal structure of a discourse by taking into account the effects of tense, aspect, temporal adverbials and rhe- null The structure comprises a culmination, an associated preparatory process and a consequence state.</Paragraph>
      <Paragraph position="7"> A punctual event (i.e. represented in time point) A durative event (i.e. represented in time interval)  ral relations was adopted so that they could mutually constrain each other.</Paragraph>
      <Paragraph position="8"> To summarize, the interpretation of temporal relations draws on the combination of various information resources, including explicit tense/aspect and connectives (temporal or otherwise), temporal classes implicit in events, or rhetorical relations hidden in a discourse. This conclusion, although drawn from the studies of English, provides the common understanding on what information is required for determining temporal relations across languages.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Linguistic Features for Determining Tem-
</SectionTitle>
      <Paragraph position="0"> poral Relations in Chinese Thirteen related linguistic features are recognized for determining Chinese temporal relations in this paper (See Table 1). The selected features are scattered in various grammatical categories due to the unique nature of language, but they fall into the fol- null lowing three groups.</Paragraph>
      <Paragraph position="1"> (1) Tense/aspect in English is manifested by verb  inflections. But such morphological variations are inapplicable to Chinese verbs. Instead, they are conveyed lexically. In other words, tense and aspect in Chinese are expressed using a combination of, for example, time words, auxiliaries, temporal position words, adverbs and prepositions, and particular verbs. They are known as Tense/Aspect Markers.</Paragraph>
      <Paragraph position="2"> (2) Temporal Connectives in English primarily involve conjunctions, such as &amp;quot;after&amp;quot; and &amp;quot;before&amp;quot;, which are the key components in discourse structures. In Chinese, however, conjunctions, conjunctive adverbs, prepositions and position words, or their combinations are required to represent connectives. A few verbs that express cause/effect imply a temporal relation. They are also regarded as a feature relating to discourse structure</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
. The
</SectionTitle>
    <Paragraph position="0"> words which contribute to the tense/aspect and temporal connective expressions are explicit in a sentence and generally known as Temporal Indica- null The casual conjunctions such as &amp;quot;because&amp;quot; are included in this group.</Paragraph>
    <Paragraph position="1"> tors.</Paragraph>
    <Paragraph position="2"> (3) Event Classes are implicit in a sentence. Events can be classified according to their inherent temporal characteristics, such as the degree of telicity and atomicity. The four widespread accepted temporal classes are state, process, punctual event and developing event (Li, 2002). Based on their classes, events interact with the tense/aspect of verbs to determine the temporal relations between two events.</Paragraph>
    <Paragraph position="3"> Temporal indicators and event classes are both referred to as Linguistic Features. Table 1 shows the association between a temporal indicator and its effects. Note that the association is not one-to-one. For example, adverbs affect tense/aspect (e.g. Zheng , being) as well as discourse structure (e.g. Bian , at the same time). For another example, tense/aspect can be jointly affected by auxiliary words (e.g. Guo , were/was), trend verbs (Qi Lai , begin to), and so on. Obviously, it is not a simple task to map the combined effects of the thirteen linguistic features to the corresponding relations. Therefore, a machine learning approach is proposed, which investigates how these features contribute to the task and how they should be combined.</Paragraph>
  </Section>
  <Section position="6" start_page="3" end_page="321" type="metho">
    <SectionTitle>
4 Combining Linguistic Features with Machine
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Learning Approach
</SectionTitle>
      <Paragraph position="0"> Previous efforts in corpus-based NLP have incorporated machine learning methods to coordinate multiple linguistic features, for example, in accent restoration (Yarowsky, 1994) and event classification (Siegel, 1998).</Paragraph>
      <Paragraph position="1"> Temporal relation determination can be modeled as a relation classification task. We formulate the thirteen temporal relations (see Figure 1) as the classes to be decided by a classifier. The classification process is to assign an event pair to one class according to their linguistic features. There existed numerous classification algorithms based upon supervised learning principle. One of the most effective classifiers is Bayesian Classifier, introduced by Duda and Hart (Duda, 1973) and analyzed in more detail by Langley and Thompson (Langley, 1992). Its predictive performance is competitive with state-of-the- null Table 1 Linguistic features: eleven temporal indicators and one event class art classifiers, such as C4.5 and SVM (Friedman, 1997).</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="21" type="sub_section">
      <SectionTitle>
4.1 Bayesian Classifier
</SectionTitle>
      <Paragraph position="0"> Given the class c , Bayesian Classifier learns from training data the conditional probability of each attribute. Classification is performed by applying Bayes rule to compute the posterior probability of c given a particular instance x , and then predicting the class with the highest posterior probability ratio. Let  are the temporal indicators (i.e. the words). E is the set of event classes. T is the set of temporal indicators. Then x is classified as:</Paragraph>
      <Paragraph position="2"> where c denotes the classes different from c . Assuming event classes are independent of temporal indicators given c , we have:</Paragraph>
      <Paragraph position="4"> Assuming temporal indicators are independent of each other, we have</Paragraph>
      <Paragraph position="6"/>
      <Paragraph position="8"> A Naive Bayesian Classifier assumes strict independence among all attributes. However, this assumption is not satisfactory in the context of temporal relation determination. For example, if the</Paragraph>
      <Paragraph position="10"> where u (=0.5) is the smoothing factor. Then,</Paragraph>
      <Paragraph position="12"/>
    </Section>
    <Section position="3" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
4.2 Estimating )|,...,(
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> The effects of a temporal indicator are constrained by its positions in a sentence. For instance, the conjunctive word Yin Wei (because) may represent the different relations when it occurs before or after the first event. Therefore, in estimating )|,...,(</Paragraph>
      <Paragraph position="4"> consider an indicator located in three positions: (1) BEFORE the first event; (2) AFTER the first event and BEFORE the second and it modifies the first event; (3) the same as (2) but it modifies the second event; and (4) AFTER the second event. Note that cases (2) and (3) are ambiguous. The positions of the temporal indicators are the same. But it is uncertain whether these indicators modify the first or the second event if there is no punctuation (such as comma, period, exclamation or question mark) separating their roles. The ambiguity is resolved by using POS information. We assume that an indicator modifies the first event if it is an auxiliary word, a trend word or a position word; otherwise it modifies the second.</Paragraph>
      <Paragraph position="5"> Thus, we rewrite )|,...,(</Paragraph>
      <Paragraph position="7"> In addition to taking positions into account, we further classify the temporal indicators into two groups according to their grammatical categories or semantic roles. The rationale of grouping will be demonstrated in Section 4.3.</Paragraph>
    </Section>
    <Section position="4" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
4.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Several experiments have been designed to evaluate the proposed Bayesian Classifier in combining linguistic features for temporal relation determination and to reveal the impact of linguistic features on learning performance. 700 instances are extracted from Ta Kong Pao (a local Hong Kong Chinese newspaper) financial version. Among them, 500 are used as training data, and 200 as test data, which are partitioned equally into two sets. One is similar as training data in class distribution, while the other is quite different. 209 lexical words, gathered from linguistic books and corpus, are used as the temporal indicators and manually marked with the tags given in Table 1.</Paragraph>
      <Paragraph position="1">  From linguistic perspective, the thirteen features (see Table 1) are useful for temporal relation determination. To examine the impact of each individual feature, we feed a single linguistic feature to the Bayesian Classifier learning algorithm one at a time and study the accuracy of the resultant classifier. The experimental results are given in Table 2. It shows that event classes have greatest accuracy, followed by conjunctions in the second place, and adverbs in the third in the close test. Since punctuation shows no contribution, we only use it as a syntactic feature to differentiate cases (2) and (3) mentioned in Section 4.2.</Paragraph>
      <Paragraph position="2">  We now use Bayesian Classifier introduced in Sections 4.1 and 4.2 to combine all the related temporal indicators and event classes, since none of the features can achieve a good result alone. The simplest way is to combine the features without distinction.</Paragraph>
      <Paragraph position="3"> The conditional probability )|( ctP ji is estimated by (E7'). This model is called Ungrouped Model (UG). However, as illustrated in table 1, the temporal indicators play different roles in building temporal reference. It is not reasonable to treat them equally. We claim that the temporal indicators have two functions, i.e., representing the connections of the clauses, or representing the tense/aspect of the events. We identify them as connective words or tense/aspect markers and separate them into two groups. This allows features to be compared with those in the same group.  [?] , m and l are the number of the connective words and the tense/aspect markers in a sentence respectively. We assume that the occurrences of the two groups are independent. By taking both grouping and position features into account, we replace</Paragraph>
      <Paragraph position="5"/>
      <Paragraph position="7"> gories or Semantic Roles We partition temporal indicators into connective words and tense/aspect markers in two ways. One is simply based on their grammatical categories (i.e. POS information). It separates conjunctions (e.g., Ran Hou , after; Yin Wei , because) and verbs relating to causality (e.g., Dao Zhi , cause) from others. They are assumed to be connective words (i.e.</Paragraph>
      <Paragraph position="8">  T[?] ). This model is called Grammatical Function based Grouping Model (GFG). Unfortunately, such a separation is ineffective. In comparison with UG, the performance of GFG decreases as shown in figure 2. This reveals the complexity of Chinese in connecting expressions. It arises from the fact that some other words, such as adverbs (e.g., Bian ...Bian , meanwhile), prepositions (e.g., Zai , at) and position words (e.g., Zhi Qian , before), can also serve such a connecting function (see Table 1). Actually, the roles of the words falling into these grammatical categories are ambiguous. For instance, the adverb Cai can express an event happened in the past, e.g., &amp;quot;Ta Cai Gang Gang Xie Wan Bao Gao (He just finished the report)&amp;quot;. It can be also used in a connecting expression (such as Cai ...You ...), e.g., &amp;quot;Ta Cai Xie Wan Bao Gao You Qu Tu Shu Guan Liao (He went to the library after he had finished the report)&amp;quot;.</Paragraph>
      <Paragraph position="9"> This finding suggests that temporal indicators should be divided into two groups according to their semantic roles rather than grammatical categories.</Paragraph>
      <Paragraph position="10"> Therefore we propose the third model, namely Semantic Role based Grouping Model (SRG), in which the indicators are manually re-marked as TI_j_pos or TI_at_pos  .</Paragraph>
      <Paragraph position="11"> Figure 2 shows the accuracies of four models (i.e. DM. UG, GFG and SRG) based on the three tests.</Paragraph>
      <Paragraph position="12"> Test 1 is the close test carried out on training data and tests 2 and 3 are open tests performed on different test data. DM (i.e., Default Model) assigns all incoming cases with the most likely class and it is used as evaluation baseline. In our case, it is SAME_AS, which holds 50.2% in training data.</Paragraph>
      <Paragraph position="13"> SRG model outperforms UG and GFG models.</Paragraph>
      <Paragraph position="14"> These results validate our previous assumption empirically. null  &amp;quot;j&amp;quot; and &amp;quot;at&amp;quot; are the tags representing connecting and tense/aspect roles respectively. &amp;quot;pos&amp;quot; is the POS tag of the temporal indicator TI.  When the temporal indicators are classified into two groups based on their semantic roles in SRG model, there are three types of linguistic features used in the Bayesian Classifier, i.e., tense/aspect markers, connective words and event classes. A set of experiments are conducted to investigate the impacts of each individual feature type and the impacts when they are used in combination (shown in Table 3). We find that the performance of methods 1 and 2 in the open tests drops dramatically compared with those in the close test. But the predictive strength of event classes in method 3 is surprisingly high. Two conclusions are thus drawn. Firstly, the models using tense/aspect markers and connective words are more likely to encounter over-fitting problem with insufficient training data. Secondly, different features have varied weights. We then incorporate an optimization approach to adjust the weights of the three types of features, and propose an algorithm to tackle over-fitting problem in the next section.</Paragraph>
    </Section>
    <Section position="5" start_page="21" end_page="321" type="sub_section">
      <SectionTitle>
Method Semantic Groups
</SectionTitle>
      <Paragraph position="0"> l be the weights of event classes, connective words and tense/aspect markers respectively. Then the Weighted Bayesian Classifier is:  In order to estimate the weights, we need a suitable optimization approach to search for the optimal value of ],,[  lll automatically.</Paragraph>
    </Section>
    <Section position="6" start_page="321" end_page="321" type="sub_section">
      <SectionTitle>
5.1 Estimating Weights with Simulated Anneal-
ing Algorithm
</SectionTitle>
      <Paragraph position="0"> Quite a lot optimization approaches are available to compute the optimal value of ],,[  lll . Here, Simulated Annealing algorithm is employed to perform the task, which is a general and powerful optimization approach with excellent global convergence (Kirkpatrick, 1983). Figure 3 shows the procedure of searching for an optimal weight vector with the algorithm. null  t . Note that the initial temperature is critical for a simulated annealing algorithm (Kirkpatrick, 1983). Its value should assure that the initial accept rate is greater than 90%.</Paragraph>
    </Section>
    <Section position="7" start_page="321" end_page="321" type="sub_section">
      <SectionTitle>
5.2 K-fold Cross-Validation
</SectionTitle>
      <Paragraph position="0"> The accuracy of the classifier is defined as the objective function of the Simulated Annealing algorithm illustrated in Figure 3. If it is evaluated with the accuracy over all training data, the Weighted Bayesian Classifier may trap into over-fitting problem and lower the performance due to insufficient data. To avoid this, we employ K-fold Cross-Validation technique. It partitions the original set of data into K parts. One part is selected arbitrarily as evaluating data and the other K-1 parts as training data. Then K accuracies on evaluating data are obtained after K iterations and their average is used as the objective function.</Paragraph>
    </Section>
    <Section position="8" start_page="321" end_page="321" type="sub_section">
      <SectionTitle>
5.3 Experimental Results
</SectionTitle>
      <Paragraph position="0"> Table 4 shows the result of the experiment which compares WSRG (Weighted SRG) with SRG. We use error reduction to evaluate the benefit from incorporating weight parameters into Bayesian Classifier. It is defined as:  The experimental results show that the Weighted Bayesian Classifier outperforms the Bayesian Classifier significantly in the two open tests and it tackles the over-fitting problem well. To test Simulated Annealing algorithm's global convergence, we randomly choose several initial values and they finally converge to a small area [7.2+-0.09, 5.8+-0.02, 3.0+-0.02]. The empirical result demonstrates that the output of a Simulated Annealing algorithm is a global optimal weighting vector.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML