File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0402_metho.xml

Size: 21,460 bytes

Last Modified: 2025-10-06 14:09:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0402">
  <Title>Feature Engineering and Post-Processing for Temporal Expression Recognition Using Conditional Random Fields</Title>
  <Section position="5" start_page="10" end_page="11" type="metho">
    <SectionTitle>
3 Feature Engineering
</SectionTitle>
    <Paragraph position="0"> The success of applying CRFs depends on the quality of the set of features used and the tagging scheme chosen. Below, we discuss these two aspects in greater detail.</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
3.1 Feature sets
</SectionTitle>
      <Paragraph position="0"> Our baseline feature set consists of simple lexical and character features. These features are derived from a context window of two words (left and right).</Paragraph>
      <Paragraph position="1"> Specifically, the features are the lowercase form of all the tokens in the span, with each token contributing a separate feature, and the tokens in the left and right context window constitute another set of features. These feature sets capture the lexical content and context of timexes. Additionally, character type pattern features (such as capitalization, digit sequence) of tokens in the timexes are used to capture the character patterns exhibited by some of the tokens in temporal expressions. These features constitute the basic feature set.</Paragraph>
      <Paragraph position="2"> Another important feature is the list of core timexes. The list is obtained by first extracting the phrases with -TMP function tags from the PennTree bank, and taking the words in these phrases (Marcus et al., 1993). The resulting list is filtered for stopwords. Among others, the list of core timexes consists of the names of days of the week and months, temporal units 'day,' 'month,' 'year,' etc. This list is used to generate binary features. In addition, the list is used to guide the design of other complex features that may involve one or more of token-tag pairs in the context of the current token. One way of using the list for this purpose is to generate a feature that involves bi-grams tokens. In certain cases, information extracted from bi-grams, e.g. +Xx 99 (May 20), can be more informative than information generated from individual tokens. We refer to these features as the list feature set.</Paragraph>
    </Section>
    <Section position="2" start_page="10" end_page="11" type="sub_section">
      <SectionTitle>
3.2 Tagging schemes
</SectionTitle>
      <Paragraph position="0"> A second aspect of feature engineering that we consider in this paper concerns different tagging schemes. As mentioned previously, the task of recognizing timexes is reduced to a sequence-labeling task. We compare three tagging schemes, IO (our baseline), BCEUN, and BCEUN+PRE&amp;POST.</Paragraph>
      <Paragraph position="1"> While the first two are relatively standard, the last one is an extension of the BCEUN scheme. The intuition underlying this tagging scheme is that the most relevant features for timex recognition are extracted from the immediate context of the timex, e.g., the word 'During' in (1) below.</Paragraph>
      <Paragraph position="2">  (1) During &lt;TIMEX2&gt;the past week&lt;/TIMEX2&gt;,  the storm has pounded the city.</Paragraph>
      <Paragraph position="3"> During-PRE the-B past-C week-E ,-POST the storm has pounded the city.</Paragraph>
      <Paragraph position="4"> Therefore, instead of treating these elements uniformly as outside (N), which ignores their relative importance, we conjecture that it is worthwhile to  assign them a special category, like PRE and POST corresponding to the tokens immediately preceding and following a timex, and that this leads to improved results.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="11" end_page="12" type="metho">
    <SectionTitle>
4 Post-processing Using a List
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the proposed method for incorporating a list of core lexical timexes for post-processing the output of a machine learner. As we will see below, although the baseline system (with the IO tagging scheme and the basic feature set) achieves a high accuracy, the recall scores leave much to be desired. One important problem that we have identified is that timexes headed by core lexical items on the list may be missed. This is either due to the fact that some of these lexical items are semantically ambiguous and appear in a non-temporal sense, or the training material does not cover the particular context. In such cases, a reliable list of core timexes can be used to identify the missing timexes.</Paragraph>
    <Paragraph position="1"> For the purposes of this paper, we have created a list containing mainly headwords of timexes. These words are called trigger words since they are good indicators of the presence of temporal expressions.</Paragraph>
    <Paragraph position="2"> How can we use trigger words? Before describing our method in some detail, we briefly describe a more naive (and problematic) approach. Observe that trigger words usually appear in a text along with their complements or adjuncts. As a result, picking only these words will usually contribute to token recall but span precision is likely to drop. Furthermore, there is no principled way of deciding which one to pick (semantically ambiguous elements will also be picked). Let's make this more precise. The aim is to take into account the knowledge acquired by the trained model and to search for the next optimal sequence of tags, which assigns the missed timex a non-negative tag. However, searching for this sequence by taking the whole word sequence is impractical since the number of possible tag sequences (number of all possible paths in a viterbi search) is very large. But if one limits the search to a window of size n (n &lt; 6), sequential search will be feasible. The method, then, works on the output of the system. We illustrate the method by using the example given in (2) below.</Paragraph>
    <Paragraph position="3"> (2) The chairman arrived in the city yesterday, and will leave next week. The press conference will be held tomorrow afternoon.</Paragraph>
    <Paragraph position="4"> Now, assume that (2) is a test instance (a two-sentence document), and that the system returns the following best sequence (3). For readability, the tag N is not shown on the words that are assigned negative tags in all the examples below.</Paragraph>
    <Paragraph position="5"> (3) The chairman arrived in the city yesterday-U , and will leave next week . The press conference will be held tomorrow-B afternoon-E .</Paragraph>
    <Paragraph position="6"> According to (3), the system recognizes only 'yesterday' and 'tomorrow afternoon' but misses 'next week'. Assuming our list of timexes contains the word 'week', it tells us that there is a missing temporal expression, headed by 'week.' The naive method is to go through the above output sequence and change the token-tag pair 'week-N' to 'week-U'. This procedure recognizes the token 'week' as a valid temporal expression, but this is not correct: the valid temporal expression is 'next week'.</Paragraph>
    <Paragraph position="7"> We now describe a second approach to incorporating the knowledge contained in a list of core lexical timexes as a post-processing device. To illustrate our ideas, take the complete sequence in (3) and extract the following segment, which is a window of 7 tokens centered at 'week'.</Paragraph>
    <Paragraph position="8"> (4) . . . [will leave next week . The press] . . .</Paragraph>
    <Paragraph position="9"> We reclassify the tokens in (4) assuming the history contains the token 'and' (the token which appears to the left of this segment in the original sequence) and the associated parameters. Of course, the best sequence will still assign both 'next' and 'week' the N tag since the underlying parameters (feature sets and the associated weights) are the same as the ones in the system. However, since the word sequence in (4) is now short (contains only 7 words) we can maintain a list of all possible tag sequences for it and perform a sequential search for the next best sequence, which assigns the 'week' token a non-negative tag.</Paragraph>
    <Paragraph position="10"> Assume the new tag sequence looks as follows: (5) . . . [will leave next-B week-E . The press] . . .</Paragraph>
    <Paragraph position="11"> This tag sequence will then be placed back into the original sequence resulting in (6):  (6) The chairman arrived in the city yesterday-U , and will leave next-B week-E . The press conference will be held tomorrow-B afternoon-E .</Paragraph>
    <Paragraph position="12"> In this case, all the temporal expressions will be extracted since the token sequence 'next week' is properly tagged. Of course, the above procedure can also return other, invalid sequences as in (7):  (7) a. . . . will leave next-B week-C . The press . . . b. . . . will leave next week-C . The press . . .</Paragraph>
    <Paragraph position="13"> c. . . . will leave next week-C .-E The press . . .</Paragraph>
    <Paragraph position="14">  The final extraction step will not return any timex since none of the candidate sequences in (7) contains a valid tag sequence. The assumption here is that of all the tag sequences, which assign the token 'week' a non-negative tag, those tag sequences which contain the segment 'next-B week-E' are likely to receive a higher weight since the underlying system is trained to recognize temporal expressions and the phrase 'next week' is a likely temporal expression.</Paragraph>
    <Paragraph position="15"> This way, we hypothesize, it is possible to exploit the knowledge embodied in the trained model.</Paragraph>
    <Paragraph position="16"> As pointed out previously, simply going through the list and picking only head words like 'week' will not guarantee that the extracted tokens form a valid temporal expression. On the other hand, the above heuristics, which relies on the trained model, is likely to pick the adjunct 'next'.</Paragraph>
    <Paragraph position="17"> The post-processing method we have just outlined boils down to reclassifying a small segment of a complete sequence using the same parameters (feature sets and associated weights) as the original model, and keeping all possible candidate sequences and searching through them to find a valid sequence.</Paragraph>
  </Section>
  <Section position="7" start_page="12" end_page="14" type="metho">
    <SectionTitle>
5 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section we provide an experimental assessment of the feature engineering and post-processing methods introduced in Sections 3 and 4. Specifically, we want to determine what their impact is on the precision and recall scores of the baseline system, and how they can be combined to boost recall while keeping precision at an acceptable level.</Paragraph>
    <Section position="1" start_page="12" end_page="12" type="sub_section">
      <SectionTitle>
5.1 Experimental data
</SectionTitle>
      <Paragraph position="0"> The training data consists of 511 files, and the test data consists of 192 files; these files were made available in the 2004 Temporal Expression Recognition and Normalization Evaluation. The temporal expressions in the training files are marked with XML tags. The minorThird system takes care of automatically converting from XML format to the corresponding tagging schemes. A temporal expression enclosed by&lt;TIMEX2&gt;tags constitutes a span.</Paragraph>
      <Paragraph position="1"> The features in the training instances are generated by looking at the surface forms of the tokens in the spans and their surrounding contexts.</Paragraph>
    </Section>
    <Section position="2" start_page="12" end_page="14" type="sub_section">
      <SectionTitle>
5.2 Experimental results
</SectionTitle>
      <Paragraph position="0"> Richer feature sets Table 1 lists the results of the first part of our experiments. Specifically, for every tagging scheme, there are two sets of features, basic and list. The results are based on both exact-match and partial match between the spans in the gold standard and the spans in the output of the systems, as explained in Subsection 2.1. In both the exact and partial match criteria, the addition of the list features leads to an improvement in recall, and no change or a decrease in precision.</Paragraph>
      <Paragraph position="1"> In sum, the feature addition helps recall more than it hurts precision, as the F score goes up nearly everywhere, except for the exact-match/baseline pair.</Paragraph>
      <Paragraph position="2"> Tagging schemes In Table 1 we also list the extraction scores for the tagging schemes we consider, IO, BCEUN, and BCEUN+PRE&amp;POST, as described in Section 3.2.</Paragraph>
      <Paragraph position="3"> Let us first look at the impact of the different tagging schemes in combination with the basic feature set (rows 3, 5, 7). As we go from the baseline tagging scheme IO to the more complex BCEUN and BCEUN+PRE&amp;POS, precision increases on the exact-match criterion but remains almost the same on the partial match criterion. Recall, on the other hand, does not show the same trend.</Paragraph>
      <Paragraph position="4"> BCEUN has the highest recall values followed by BCEUN+PRE&amp;POST and finally IO. In general, IO based tagging seems to perform worse whereas BCEUN based tagging scores slightly above its extended tagging scheme BCEUN+PRE&amp;POST.</Paragraph>
      <Paragraph position="5"> Next, considering the combination of extending the feature set and moving to a richer tagging scheme (rows 4, 6, 8), we have very much the same pattern. In both the exact match and the partial match setting, BCEUN tops (or almost tops) the two  (Precision, Recall, F-measure) are in bold face.</Paragraph>
      <Paragraph position="6"> other schemes in both precision and recall.</Paragraph>
      <Paragraph position="7"> In sum, the richer tagging schemes function as precision enhancing devices. The effect is clearly visible for the exact-match setting, but less so for partial matching. It is not the case that the learner trained on the richest tagging scheme outperforms all learners trained with poorer schemes.</Paragraph>
      <Paragraph position="8"> Post-processing Table 2 shows the results of applying the post-processing method described in Section 4. One general pattern we observe in Table 2 is that the addition of the list features improves precision for IO and BCEUN tagging scheme and shows a minor reduction in precision for BCEUN+PRE&amp;POS tagging scheme in both matching criteria. Similarly, in the presence of post-processing, the use of a more complex tagging scheme results in a better precision. On the other hand, recall shows a different pattern. The addition of list features improves recall both for BCEUN and BCEUN+PRE&amp;POS, but hurts recall for the IO scheme for both matching criteria.</Paragraph>
      <Paragraph position="9"> Comparing the results in Table 1 and Table 2, we see that post-processing is a recall enhancing device since all the recall values in Table 2 are higher than the recall values in Table 1. Precision values in Table 2, on the other hand, are lower than those of Table 1. Importantly, the use of a more complex tagging scheme such as BCEUN+PRE&amp;POS, allows us to minimize the drop in precision. In general, the best result (on partial match) in Table 1 is achieved through the combination of BCEUN and basic&amp;list features whereas the best result in Table 2 is achieved by the combination of BCEUN+PRE&amp;POS and basic &amp;list features. Although both have the same over-all scores on the exact match criteria, the latter performs better on partial match criteria. This, in turn, shows that the combination of post-processing, and BCEUN+PRE&amp;POS achieves better results.</Paragraph>
      <Paragraph position="10"> Stepping back We have seen that the extended tagging scheme and the post-processing methods improve on different aspects of the overall performance. As mentioned previously, the extended tagging scheme is both recall and precisionoriented, while the post-processing method is primarily recall-oriented. Combining these two methods results in a system which maintains both these properties and achieves a better overall result. In order to see how these two methods complement each other it is sufficient to look at the highest scores for both precision and recall. The extended tagging scheme with basic features achieves the highest precision but has relatively low recall. On the other hand, the simplest form, the IO tagging scheme and basic features with post-processing, achieves the highest recall and the lowest precision in partial match. This shows that the IO tagging scheme with basic features imposes a minimal amount of constraints, which allows for most of the timexes in the list to be extracted. Put differently, it does not discriminate well between the valid vs invalid occurrences of timexes from the list in the text. At the other extreme, the extended tagging scheme with 7 tags imposes strict criteria on the type of words that constitute a timex, thereby restricting which occurrences of the timex in the list count as valid timexes. In general, although the overall gain in score is limited, our feature engineering and post-processing efforts reveal some interesting facts. First, they show one possible way of using a list for post-processing.</Paragraph>
      <Paragraph position="11">  is repeated for ease of reference; it does not use post-processing. Highest scores (Precision, Recall, Fmeasure) are in bold face.</Paragraph>
      <Paragraph position="12"> This method is especially appropriate for situations where better recall is important. It offers a means of controlling the loss in precision (gain in recall) by allowing a systematic process of recovering missing expressions that exploits the knowledge already embodied in a probabilistically trained model, thereby reducing the extent to which we have to make random decisions. The method is particularly sensitive to the criterion (the quality of the list in the current experiment) used for post-processing.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="14" end_page="15" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> A large number of publications deals with extraction of temporal expressions; the task is often treated as part of a more involved task combining recognition and normalization of timexes. As a result, many timex interpretation systems are a mixture of both rule-based and machine learning approaches (Mani and Wilson, 2000). This is partly due to the fact that timex recognition is more amenable to data-driven methods whereas normalization is best handled using primarily rule-based methods. We focused on machine learning methods for the timex recognition task only. See (Katz et al., 2005) for an overview of methods used for addressing the TERN 2004 task.</Paragraph>
    <Paragraph position="1"> In many machine learning-based named-entity recognition tasks dictionaries are used for improving results. They are commonly used to generate binary features. Sarawagi and Cohen (2004) showed that semi-CRFs models for NE recognition perform better than conventional CRFs. One advantage of semi-CRFs models is that the units that will be tagged are segments which may contain one or more tokens, rather than single tokens as is done in conventional CRFs. This in turn allows one to incorporate segment based-features, e.g., segment length, and also facilitates integration of external dictionaries since segments are more likely to match the entries of an external dictionary than tokens. In this paper, we stuck to conventional CRFs, which are computationally less expensive, and introduced post-processing techniques, which takes into account knowledge embodied in the trained model.</Paragraph>
    <Paragraph position="2"> Kristjannson et al. (2004) introduced constrained CRFs (CCRFs), a model which returns an optimal label sequence that fulfills a set of constraints imposed by the user. The model is meant to be used in an interactive information extraction environment, in which the system extracts structured information (fields) from a text and presents it to the user, and the user makes the necessary correction and submits it back to the system. These corrections constitute an additional set of constraints for CCRFs. CCRFs re-computes the optimal sequence by taking these constraints into account. The method is shown to reduce the number of user interactions required in validating the extracted information. In a very limited sense our approach is similar to this work. The list of core lexical timexes that we use represents the set of constraints on the output of the underlying system. However, our method differs in the way in which the constraints are implemented. In our case, we take a segment of the whole sequence that contains a missing timex, and reclassify the words in this segment while keeping all possible tag sequences sorted based on their weights. We then  search for the next optimal sequence that assigns the missing timex a non-negative tag sequentially. On the other hand, Kristjannson et al. (2004) take the whole sequence and recompute an optimal sequence that satisfies the given constraints. The constraints are a set of states which the resulting optimal sequence should include.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML