File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2008_metho.xml

Size: 7,151 bytes

Last Modified: 2025-10-06 14:10:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-2008">
  <Title>Temporal Classification of Text and Automatic Document Dating</Title>
  <Section position="3" start_page="0" end_page="29" type="metho">
    <SectionTitle>
GigaWord English Corpus (LDC, 2003).
2 Background
</SectionTitle>
    <Paragraph position="0"> Temporal information is presently under-utilised for document and text processing purposes. Past and ongoing research work has largely focused on the identification and tagging of temporal expressions, with the creation of tagging methodologies such as TimeML/TIMEX (Gaizauskas and Setzer, 2002; Pustejovsky et al., 2003; Ferro et al., 2004), TDRL (Aramburu and Berlanga, 1998) and associated evaluations such as the ACE TERN competition (Sundheim et al. 2004).</Paragraph>
    <Paragraph position="1"> Temporal analysis has also been applied in Question-Answering systems (Pustejovsky et al., 2004; Schilder and Habel, 2003; Prager et al., 2003), email classification (Kiritchenko et al.</Paragraph>
    <Paragraph position="2">  with original series on the left and the remaining time series component after filtering on the right. Y-axis shows frequency count and X-axis shows the day number (time). 2004), aiding the precision of Information Retrieval results (Berlanga et al., 2001), document summarisation (Mani and Wilson, 2000), time stamping of event clauses (Filatova and Hovy, 2001), temporal ordering of events (Mani et al., 2003) and temporal reasoning from text (Boguraev and Ando, 2005; Moldovan et al., 2005). There is also a large body of work on time series analysis and temporal logic in Physics, Economics and Mathematics, providing important techniques and general background information. In particular, this work uses techniques adapted from Seasonal Auto-Regressive Integrated Moving Average models (SARIMA). SARIMA models are a class of seasonal, non-stationary temporal models based on the ARIMA process (defined as a non-stationary extension of the stationary ARMA model). Non-stationary ARIMA processes are defined by:</Paragraph>
    <Paragraph position="4"> where d is non-negative integer, and ( )Xph ( )Xth polynomials of degrees p and q respectively. The exact parameters for each process (one process per word) are determined automatically by the system. A discussion of the general SARIMA model is beyond the scope of this paper (details can be found in Mathematics &amp; Physics publications). The NLP application of temporal classification and prediction to guess at likely document and text creation dates is a novel application that has not been considered much before, if at all.</Paragraph>
  </Section>
  <Section position="4" start_page="29" end_page="30" type="metho">
    <SectionTitle>
3 Temporal Periodicity Analysis
</SectionTitle>
    <Paragraph position="0"> We have created a high-performance system that decomposes time series into two parts: a periodic component that repeats itself in a predictable manner, and a non-periodic component that is left after the periodic component has been filtered out from the original time series. Figure 1 shows an example of the filtering results on time-series of the words &amp;quot;January&amp;quot; and &amp;quot;the&amp;quot;. The time series are based on training documents selected at random from the GigaWord English corpus. 10% of all the documents in the corpus were used as training documents, with the rest being available for evaluation and testing. A total of 395,944 time series spanning 9 years were calculated from the GigaWord corpus. Figure 2 presents pseudo-code for the time series decomposition algorithm:  1. Find min/max/mean and standard deviation of time series 2. Start with a pre-defined maximum window size (presently set to 366 days) 3. While window size bigger than 1 repeat steps a. to d. below: a. Look at current value in time series (starting first value) b. Do values at positions current, current + window size, current + 2 x window size, etc. vary by less than 1/2 standard deviation? c. If yes, mark current value/window size pair as being possible decomposition match d. Look at next value in time series until the end is reached e. Decrease window size by one 4. Select the minimum number of decompo null sition matches that cover the entire time series using a greedy algorithm Figure 2 Time Series Decomposition Algorithm The time series decomposition algorithm was applied to the 395,944 time series, taking an average of 419ms per series. The algorithm runs in O(n log n) time for a time series of length n.</Paragraph>
    <Paragraph position="1"> The periodic component of the time series is then analysed to extract temporal association rules between words and different &amp;quot;seasons&amp;quot;, including Day of Week, Week Number, Month Number, Quarter, and Year. The procedure of determining if a word, for example, is predominantly peaking on a weekly basis, is to apply a sliding window of size 7 (in the case of weekly periods) and determining if the periodic time series always spikes within this window. Figure 3 shows the frequency distribution of the periodic time series component of the days of week names (&amp;quot;Monday&amp;quot;, &amp;quot;Tuesday&amp;quot;, etc.) Note that the frequency counts peak exactly on that particular day of the week. For example, the word &amp;quot;Monday&amp;quot; is automatically associated with Day 1, and &amp;quot;April&amp;quot; associated with Month 4. The creation of temporal association rules generalises inferences obtained from the periodic data. Each association rule has the following information:  The period number and score matrix represent a probability density function that shows the likelihood of a word appearing on a particular period number. For example, the score matrix for &amp;quot;January&amp;quot; will have a high score for period 1 (and period type set to Monthly). Figure 4 shows some examples of extracted association rules. The PDF scores are shown in Figure 4 as they are stored internally (as multiples of the standard deviation of that time series) and are automatically normalised during the classification process at runtime. Rule generalisation is not possible in such a straightforward manner for the non-periodic data. The use of non-periodic data to optimise the results of the temporal classification and automatic dating system is not covered in this paper.</Paragraph>
  </Section>
  <Section position="5" start_page="30" end_page="31" type="metho">
    <SectionTitle>
4 Temporal Classification and Dating
</SectionTitle>
    <Paragraph position="0"> The periodic temporal association rules are utilised to automatically guess at the creation date of documents automatically. Documents are input into the system and the probability density functions for each word are weighted and added up.</Paragraph>
    <Paragraph position="1"> Each PDF is weighted according to the inverse document frequency (IDF) of each associated word. Periods that obtain high score are then ranked for each type of period and two guesses per period type are obtained for each document. Ten guesses in total are thus obtained for Day of Week,</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML