XML Viewer - x98-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/x98-1011_metho.xml
Size: 17,770 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1011">
  <Title>EXTRACTING AND NORMALIZING TEMPORAL EXPRESSIONS</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. PROBLEM DESCRIPTION
</SectionTitle>
    <Paragraph position="0"> The task of automatically extracting temporal information can be divided into four parts:</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="52" type="metho">
    <SectionTitle>
1) Recognize the temporal expression.
</SectionTitle>
    <Paragraph position="0"> The event happened Saturday.</Paragraph>
    <Paragraph position="1"> 2) Extract its features.</Paragraph>
    <Paragraph position="2"> Saturday is a day name and a relative expression.</Paragraph>
    <Paragraph position="3"> 3) Compute its interval representation. Based on the reference date of the document and the features of the expression, determine which calendar day is meant. Represent this as an interval: 2 08291998 08291998. null 4) Normalize the interval for database use.</Paragraph>
    <Paragraph position="4"> Store each part of the interval expression, i.e., day, month, year for start and end points, into an 2 For the purpose of this paper, the interval will not address smaller units of time than days, i.e. hours, minutes, and seconds. An interval for a day will have identical endpoints.</Paragraph>
    <Paragraph position="5">  NLToolset structure. Final output format varies according to application requirements.</Paragraph>
    <Section position="1" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
Feature Complexity
</SectionTitle>
      <Paragraph position="0"> The greatest difficulty in building an automatic system for interpreting time expressions is the seemingly infinite variety of ways in which human beings express time.</Paragraph>
      <Paragraph position="1"> The term &amp;quot;feature&amp;quot; in this context refers to a category of information that can be used to interpret the expression. For example, the feature &amp;quot;unit of time&amp;quot; refers to the terms month, day, year, century; &amp;quot;interval endpoint&amp;quot; refers to an explicit reference to at least one end of a time interval, such as before the end of or from June to September.</Paragraph>
      <Paragraph position="2"> Each of the following numbered examples represents a different kind of time expression, based on the features available for its interpretation.</Paragraph>
      <Paragraph position="3">  1. before the end of the year 2. next April 3. March 1, 1992 4. from June to September 5. in the 90's 6. in two weeks 7. the firstyear 8. beginning July 1 9. last Summer 10. next month 11. in the first quarter of fiscal 1992 12. the turn of the century 13. Saturday 14. yesterday 15. the previous April  Table 1 illustrates the relationship between a set of features and the temporal expressions in which they appear. This is often a many-to-many relationship, which makes the manual construction of a decision tree a formidable task.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
Feature Available Example Nmuber
</SectionTitle>
      <Paragraph position="0"> unit of time 1, 6, 7, 10 interval endpoint 1, 4, 5, 7, 8 relative to dateline 1, 2, 4, 6, 8, 9, 10, 13, 14 month name 2, 3, 4, 8, 15 relative direction 2, 9, 10, 15  Each expression will require a unique computation function, based on the features present and their interaction. For example, the second expression, next April, is different from April of next year only if the reference date is within the interval between January 1 and March 31.</Paragraph>
      <Paragraph position="1"> There are many possible combinations of features. Additionally, there are many idiomatic temporal expressions, such as the turn of the century. These possibilities must be captured within the NLToolset's rule packages so that the expression can be recognized.</Paragraph>
    </Section>
    <Section position="3" start_page="51" end_page="52" type="sub_section">
      <SectionTitle>
Relative Expressions
</SectionTitle>
      <Paragraph position="0"> Some time expressions are specific, e.g. March 1, 1992; others are relative expressions, either of a contiguous or non-contiguous nature. For example, expressions like yesterday or next month are non-contiguous because they are relative to the dateline of the message. But, expressions like the previous April or the following day usually refer to the immediately preceding time expression, and thus are thought of as contiguous.</Paragraph>
      <Paragraph position="1"> The CEO announced his retirement on March 5. The followinq d~, the company's stock price rose.</Paragraph>
      <Paragraph position="2">  In this example, on March 5 is a non-contiguous expression and is calculated from the document reference date, while the following day is contiguous and is calculated using the previous temporal expression, March 5, as the reference date.</Paragraph>
      <Paragraph position="3"> Also to be factored in as a consideration in relative expressions is the tense of the verb.</Paragraph>
      <Paragraph position="4"> The ship sailed on Saturdav.</Paragraph>
      <Paragraph position="5"> The ship will sail on Saturdav.</Paragraph>
      <Paragraph position="6"> Computation of the correct interval depends on whether the date is meant to indicate past or future. Ambiguity Some expressions are simply meant to be ambiguous, indicating a general vicinity of time, but not meant to be exact. When the expression, next week, is used, does that mean the seven days beginning on Sunday, or does it mean the five days of the business week? There definitely is information contained within the expression, but the problem is capturing the information without overstating the accuracy of its representation. The following example further illustrates this point.</Paragraph>
      <Paragraph position="7"> Basebafl season begins next week.</Paragraph>
      <Paragraph position="8"> In this case, what is meant is that the season will begin at some point during the interval that is next week; however, the exact time is ambiguous.</Paragraph>
      <Paragraph position="9"> The NLToolset's current implementation will arbitrarily decide what the interval of next week is. It will make no attempt to resolve the ambiguity, nor to note that such ambiguity exists. This is an area for future research.</Paragraph>
    </Section>
    <Section position="4" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
Specialized Calendars
</SectionTitle>
      <Paragraph position="0"> Information extraction systems are often developed for specialized domains. The following examples illustrate the problem of specialized calendars. The first example is from a business domain from which the system must extract information about joint ventures.</Paragraph>
      <Paragraph position="1"> Profits durino the first year reached $5 million. In this example, the reference point is the date that the joint venture began operations. This is used to calculate the interval represented by thefirst year. The second example is from the automotive domain.</Paragraph>
      <Paragraph position="2"> Since the 1990 model year began on October 1, Buick sales have plunged.</Paragraph>
      <Paragraph position="3"> Introduction of world knowledge to the system would be necessary to have it understand that the start of the model year was in 1989.</Paragraph>
      <Paragraph position="4"> The third example might appear in an agricultural domain.</Paragraph>
      <Paragraph position="5"> During the current crop year, Brazil will produce 7 million tons of sugar.</Paragraph>
      <Paragraph position="6"> This time period would depend on the crop grown and the growing location.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="52" end_page="52" type="metho">
    <SectionTitle>
3. EXTRACTION AND COMPUTATION
</SectionTitle>
    <Paragraph position="0"> The NLToolset has a rule package that can recognize common temporal expressions, both absolute and relative; its accuracy has been measured at above 90%. An important feature of the NLToolset is the ability it affords the developer to add variables to the rule patterns. In the case of temporal expressions, the pattern variables capture the features, such as month, day, or year, that make up the expressions. These values are used in the computation of the interval representation.</Paragraph>
    <Paragraph position="1"> Computing the Interval The computation stage involves determining the reference point and using it, plus the feature information and the information from the expression's context to compute the interval. For example, if the expression is next year, the system would find the reference year and then add one; the interval would extend from January 1 until December 31 of that year. If the expression is Saturday, the system must decide whether it refers to next Saturday or last Saturday, based on the sentence tense. It must then ascertain the weekday name of the reference date and add or subtract the appropriate number of days to reach the proper calendar date.</Paragraph>
    <Paragraph position="2"> Arithmetic of calendar days across months can be problematic. To avoid this problem, the NLToolset converts each calendar day into a Julian day number form. 3 This number is the count of days,</Paragraph>
  </Section>
  <Section position="6" start_page="52" end_page="53" type="metho">
    <SectionTitle>
3 The Julian day number was introduced in 1581 by
</SectionTitle>
    <Paragraph position="0"> the French scholar Joseph Justus Scaliger to define a number from which all time could be reckoned. As a starting point, Scaliger chose the last year that the following cycles began simultaneously: the 28 year-long Sun cycle in which the calendar dates repeat on the same weekdays, the 19 year-long Metonic cycle in which the phases of the Moon repeat on almost the same calendar dates, and the 15 year-long cycle for tax collection and census that was used in the Roman  starting with the day 0 on the 1st of January, 4713 BC. After the calculation is completed, the NLToolset converts the Julian day back to its original time scale.</Paragraph>
    <Paragraph position="1"> For the majority of cases, it is a simple matter to write a computation for a specific pattern that takes into consideration all of the relevant features and then determines the interval; however, the many-to-many relationship between features and expressions, coupled with a context dependency, complicates the overall process.</Paragraph>
    <Section position="1" start_page="53" end_page="53" type="sub_section">
      <SectionTitle>
Algorithm Complexity
</SectionTitle>
      <Paragraph position="0"> The simplest approach would be to write a package of rules, each of whose left hand side matches a certain time expression and whose right hand side is the relevant computation function. This method, while simple to implement, would bog down our pattern matcher by giving it too many possible paths to check. The following example illustrates this point.</Paragraph>
      <Paragraph position="1"> Straightforward mapping of patterns to functions</Paragraph>
      <Paragraph position="3"> In this example, if the pattern matcher finds a monthname, it must check each of these patterns to see which one is applicable. If, instead, we construct one non-deterministic pattern, we can eliminate this problem. The curly brackets indicate optional elements.</Paragraph>
      <Paragraph position="4"> Collapse of four pattems into one &lt; monthname { day } { year } &gt; &gt;&gt; Call-correct-function In this case, the complexity migrates to the right side of the rule. The Call-correct-function function now must compute the interval based on the features that have matched. The difficult part, with a variety of candidate features, is constructing a decision tree that is efficient, and then, when new cases are added, reconstructing the decision tree, while maintaining its efficiency.</Paragraph>
      <Paragraph position="5"> Identifying the Interval Type Because the NLToolset represents dates as intervals, the NLToolset must decide how to fill the start and end points of each interval. A starting or empire. This starting year for the Julian day was 4713 BC.</Paragraph>
      <Paragraph position="6"> ending point could be unknown, a part of the date that is being interpreted, or the dateline (or other reference date). The decision as to what will fill each point of the interval is based partly on the prepositions and context, and partly on the date being interpreted. For instance, next week will have a start date at the beginning of the week following the dateline, and an end date at the end of that week. However, by next week will use the dateline as the start date.</Paragraph>
      <Paragraph position="7"> There are, by our reckoning, twelve ways to fill in the Start and end dates. By examining the context in which the date appears, we can select one of these ways rather than trying to work with the contextual information directly as we fill in the interval. Table 2 enumerates the possibilities.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="53" end_page="53" type="metho">
    <SectionTitle>
START END EXAMPLE
</SectionTitle>
    <Paragraph position="0"> beg before last week unk end through last week unk dl until today beg unk as of this week beg end (during) next week beg dl beginning last week end unk after next week end dl since last week dl unk after today dl beg until next week dl end through next week dl dl today KEY:</Paragraph>
    <Paragraph position="2"/>
  </Section>
  <Section position="8" start_page="53" end_page="54" type="metho">
    <SectionTitle>
4. LEARNING THE DECISION TREE
</SectionTitle>
    <Paragraph position="0"> We decided to try using machine learning to help generate the Call-correct-function code. We chose Quinlan's C4.5 software because it has been successfully applied to many problems requiring decision trees. C4.5 uses training examples to build a classification system, which, in this case, will comprise a decision tree which lays outs a feature-based path to each correct computation. As new cases  are added to the rule package, the tree can be quickly regenerated by adding more training examples.</Paragraph>
    <Paragraph position="1"> Using C4.5 We will describe our experiment with C4.5. For a complete description of C4.5, see Quinlan's own publication. 4 To use C4.5, the developer specifies: 1) the classes of interest; these will become the leaves of the decision tree and 2) the features and their possible values; these are the nodes of the tree. A set of training examples is provided and, when the tree has been generated, each path can be considered a rule. The C4.5 specification builds a description space whose dimensions correspond to the number of features describing the problem. Each training example is a point within the space. The decision tree is a classifier that divides the description space into regions, each one labelled with classification type. C4.5 decides which feature is the best one to use as a first discriminator, and then starts to divide the region based on that feature. This is a key element of C4.5. It provides the most efficient tree that it can discover. It also includes heuristics for simplifying the tree. In general, C4.5 generates a decision tree by ordering the testing of features according to how much information each feature will provide. Each decision splits the region into smaller pieces, until finally the classification is reached.</Paragraph>
    <Paragraph position="2"> According to Quinlan's guidelines, the best classifier will have few classes, few regions per class, many training cases relative to the volume of the regions, and no misclassification of the training cases.</Paragraph>
    <Section position="1" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
Failed Attempt
</SectionTitle>
      <Paragraph position="0"> Our first attempt at describing the problem in C4.5 syntax resulted in something like the following model.</Paragraph>
      <Paragraph position="1"> Classes: one class for each computation function  This approach failed because it does not abide by Quinlan's guidelines. We are trying to classify into many categories, one for each computation function. Our preliminary working set consists of fifteen classes. We also have many features with many possible values, and not all of the features are relevant in every case. In fact, in all cases, only a subset of the features is relevant. As a result, C4.5 has difficulty in generating a good decision tree, even with several hundred training examples.</Paragraph>
    </Section>
    <Section position="2" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
Different Approach
</SectionTitle>
      <Paragraph position="0"> To remedy this situation, we transformed the description space by converting the feature values to boolean -- Y or N -- because the value of the feature does not matter as much to the decision as whether the feature is present.</Paragraph>
      <Paragraph position="1"> Classes: one class for each computation function  This change, although it maintains the large number of classes, allows us to reduce the volume of the regions and avoid the fragmentation of the previous model. Additionally, this model produces a binary tree, which is a simple if-then-else algorithm to implement. In fact, we can automatically convert the generated decision tree to C++ code, using a Perl script.</Paragraph>
      <Paragraph position="2"> This is an unusual use of C4.5 in that it does not follow Quinlan's guidelines for developing a good classifier; however, it does work for our purposes. It has alleviated the tedious and time-consuming problem of generating and re-generating an efficient decision tree in C++ code.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML