File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0903_metho.xml

Size: 6,907 bytes

Last Modified: 2025-10-06 14:10:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0903">
  <Title>Automatic Dating of Documents and Temporal Text Classification</Title>
  <Section position="5" start_page="18" end_page="19" type="metho">
    <SectionTitle>
3 Temporal Periodicity Analysis
</SectionTitle>
    <Paragraph position="0"> We have created a high-performance system that decomposes time series into two parts: a periodic component that repeats itself in a predictable manner, and a non-periodic component that is left after the periodic component has been filtered out from the original time series. Figure 1 shows an example of the filtering results on time-series of the words &amp;quot;January&amp;quot; and &amp;quot;the&amp;quot;. The original series is presented together with two series representing the periodic and non-periodic  components of the original time series. The time series are based on training documents selected at random from the GigaWord English corpus.</Paragraph>
    <Paragraph position="1"> 10% of all the documents in the corpus were used as training documents, with the rest being available for evaluation and testing. A total of 395,944 time series spanning 9 years were calculated from the GigaWord corpus. The availability of 9 years of data also mitigated the negative effects of using short time series in combination with SARIMA models (as up to 3,287 data points were available for some words, well above the 50 data point minimum recommendation).</Paragraph>
    <Paragraph position="2"> Figure 2 presents pseudo-code for the time series  decomposition algorithm: 1. Find min/max/mean and standard deviation of time series 2. Start with a pre-defined maximum window size (set to 366 days in our present system) 3. While window size bigger than 1 repeat steps a. to d. below: a. Look at current value in time series (starting from first value) b. Do values at positions current, current + window size, current + 2 x window size, etc. vary by less than half a standard deviation? c. If yes, mark current value/window size pair as being possible decomposition match d. Look at next value in time series until the end is reached e. Decrease window size by one 4. Select the minimum number of decompo null sition matches that cover the entire time series using a greedy algorithm  The time series decomposition algorithm was applied to the 395,944 time series, taking an average of 419ms per series. The algorithm runs in O(n log n) time for a time series of length n. The periodic component of the time series is then analysed to extract temporal association rules between words and different &amp;quot;seasons&amp;quot;, including Day of Week, Week Number, Month Number, Quarter, and Year. The procedure of determining if a word, for example, is predominantly peaking on a weekly basis, is to apply a sliding window of size 7 (in the case of weekly periods) and determining if the periodic time series always spikes within this window. Figure 3 shows the frequency distribution of the periodic time series component of the days of week names (&amp;quot;Monday&amp;quot;, &amp;quot;Tuesday&amp;quot;, etc.) Note that the frequency counts peak exactly on that particular day of the week. Thus, for example, the word &amp;quot;Monday&amp;quot; is automatically associated with Day 1, and &amp;quot;April&amp;quot; associated with Month 4. The creation of temporal association rules generalises the inferences obtained from the periodic data. Each association rule has the follow- null The period number and score matrix represent a probability density function that shows the likelihood of a word appearing on a particular period number. Thus, for example, the score matrix for &amp;quot;January&amp;quot; will have a high score for period 1 (and period type set to Monthly). Figure 4 shows some examples of extracted association rules. The probability density function (PDF) scores are shown in Figure 4 as they are stored internally (as multiples of the standard deviation of that time series) and are automatically normalised during the classification process at runtime. The standard deviation of values in the time series is used instead of absolute values in order to reduce the variance between fluctuations in different time series for words that occur frequently (like pronouns) and those that appear relatively less frequently.</Paragraph>
    <Paragraph position="3"> Rule generalisation is not possible in such a straightforward manner for the non-periodic data. In this paper, the use of non-periodic data to optimise the results of the temporal classification and automatic dating system is not covered. Non-periodic data may be used to generate specific rules that are associated only with particular dates or date ranges. Non-periodic data can also use information obtained from hapax words and other low-frequency words to generate additional refinement rules. However, there is a danger that relying on rules extracted from non-periodic data will simply reflect the specific characteristics of the corpus used to train the system, rather than the language in general. Ongoing research is being performed into calculating relevance levels for rules extracted from non-periodic data.</Paragraph>
  </Section>
  <Section position="6" start_page="19" end_page="20" type="metho">
    <SectionTitle>
4 Temporal Classification and Auto-
</SectionTitle>
    <Paragraph position="0"> matic Dating The periodic temporal association rules are utilised to guess automatically the creation date of  documents. Documents are input into the system and the probability density functions for each word are weighted and added up. Each PDF is weighted according to the inverse document frequency (idf) of each associated word. Periods that obtain high score are then ranked for each type of period and two guesses per period type are obtained for each document. Ten guesses in total are thus obtained for Day of Week, Week Number, Month Number, Quarter, and Year (5 period types x 2 guesses each).</Paragraph>
    <Paragraph position="1">  tribution for extracted Periodic Component displayed in a Weekly Period Type format</Paragraph>
    <Section position="1" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
4.1 TimeML Output
</SectionTitle>
      <Paragraph position="0"> The system can output TimeML compliant markup tags using TIMEX that can be used by other TimeML compliant applications especially during temporal normalization processes. If the base anchor reference date for a document is unknown, and a document contains relative temporal references exclusively, our system output can provide a baseline date that can be used to normalize all the relative dates mentioned in the document. The system has been integrated with a fine-grained temporal analysis system based on TimeML, with promising results, especially when processing documents obtained from the Internet.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML