File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0903_intro.xml

Size: 6,839 bytes

Last Modified: 2025-10-06 14:03:52

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0903">
  <Title>Automatic Dating of Documents and Temporal Text Classification</Title>
  <Section position="4" start_page="0" end_page="18" type="intro">
    <SectionTitle>
2 Background and Assumptions
</SectionTitle>
    <Paragraph position="0"> The main assumption behind this work is that natural language exhibits a unique signature of varying word frequencies over time. New words come into popular use continually, while other words fall into disuse either after a brief fad or when they become obsolete or archaic. Current events, popular issues and topics also affect writers in their choice of words and so does the time period when they create documents. This assumption is implicitly made when people try to guess at the creation date of a document - we would expect a document written in Shakespeare's time to contain higher frequency counts of words and phrases such as &amp;quot;thou art&amp;quot;, &amp;quot;betwixt&amp;quot;, &amp;quot;fain&amp;quot;, &amp;quot;methinks&amp;quot;, &amp;quot;vouchsafe&amp;quot; and so on than would a modern 21st century document.</Paragraph>
    <Paragraph position="1"> Similarly, a document that contains a high frequency of occurrence of the words &amp;quot;terrorism&amp;quot;, &amp;quot;Al Qaeda&amp;quot;, &amp;quot;World Trade Center&amp;quot;, and so on is more likely to be written after 11 September 2001. New words can also be used to create absolute constraints on the creation dates of documents, for example, it is highly improbable that a  document containing the word &amp;quot;blog&amp;quot; was written before July 1999 (it was first used in a newsgroup in July 1999 as an abbreviation for &amp;quot;weblog&amp;quot;), or a document containing the word &amp;quot;Google&amp;quot; to have been written before 1997.</Paragraph>
    <Paragraph position="2"> Words that are now in common use can also be used to impose constraints on the creation date; for example, the word &amp;quot;bedazzled&amp;quot; has been attributed to Shakespeare, thus allowing documents from his time onwards to be identifiable automatically. Traditional dictionaries often try to record the date of appearance of new words in the language and there are various Internet sites, such as WordSpy.com, devoted to chronicling the appearance of new words and their meanings.</Paragraph>
    <Paragraph position="3"> Our system is building up a knowledge base of the first occurrences of various words in different languages, enabling more accurate constraints to be imposed on the likely document creation date automatically.</Paragraph>
    <Paragraph position="4"> Commercial trademarks and company names are also useful in dating documents, as their registration date is usually available in public registries. Temporal information extracted from the documents itself is also useful in dating the documents - for example, if a document contains many references to the year 2006, it is quite likely that the document was written in 2006 (or in the last few weeks of December 2005).</Paragraph>
    <Paragraph position="5"> These notions have been used implicitly by researchers and historians when validating the authenticity of documents, but have not been utilised much in automated systems. Similar applications have so far been largely confined to authorship identification, such as (Mosteller and Wallace, 1964; Fung, 2003) and the identification of association rules (Yarowsky, 1994; Silverstein et al., 1997).</Paragraph>
    <Paragraph position="6"> Temporal information is presently underutilised for automated document classification purposes, especially when it comes to guessing at the document creation date automatically. This work presents a method of using periodical temporal-frequency information present in documents to create temporal-association rules that can be used for automatic document dating.</Paragraph>
    <Paragraph position="7"> Past and ongoing related research work has largely focused on the identification and tagging of temporal expressions, with the creation of tagging methodologies such as TimeML/TIMEX (Gaizauskas and Setzer, 2002; Pustejovsky et al., 2003; Ferro et al., 2004), TDRL (Aramburu and Berlanga, 1998) and their associated evaluations such as the ACE TERN competition (Sundheim et al. 2004).</Paragraph>
    <Paragraph position="8"> Temporal analysis has also been applied in Question-Answering systems (Pustejovsky et al., 2004; Schilder and Habel, 2003; Prager et al., 2003), email classification (Kiritchenko et al., 2004), aiding the precision of Information Retrieval results (Berlanga et al., 2001), document summarisation (Mani and Wilson, 2000), time stamping of event clauses (Filatova and Hovy, 2001), temporal ordering of events (Mani et al., 2003) and temporal reasoning from text (Boguraev and Ando, 2005; Moldovan et al., 2005).</Paragraph>
    <Paragraph position="9"> A growing body of related work related to the computational treatment of time in language has also been building up largely since 2000 (COL-ING 2000; ACL 2001; LREC 2002; TERQAS 2002; TANGO 2003, Dagstuhl 2005).</Paragraph>
    <Paragraph position="10"> There is also a large body of work on time series analysis and temporal logic in Physics, Economics and Mathematics, providing important techniques and general background information.</Paragraph>
    <Paragraph position="11"> In particular, this work uses techniques adapted from Seasonal ARIMA (auto-regressive integrated moving average) models (SARIMA).</Paragraph>
    <Paragraph position="12"> SARIMA models are a class of seasonal, non-stationary temporal models based on the ARIMA process. The ARIMA process is further defined as a non-stationary extension of the stationary ARMA model. The ARMA model is one of the most widely used models when analyzing time series, especially in Physics, and incorporate both auto-regressive terms and moving average terms (Box and Jenkins, 1976). Non-stationary ARIMA processes are defined by the following equation:</Paragraph>
    <Paragraph position="14"> where d is non-negative integer, and ( )Xf ( )Xq polynomials of degrees p and q respectively. The SARIMA extension adds seasonal AR and MA polynomials that can handle seasonally varying data in time series.</Paragraph>
    <Paragraph position="15"> The exact formulation of the SARIMA model is beyond the scope of this paper and can be found in various mathematics and physics publications, such as (Chatfield, 2003; Brockwell et al., 1991; Janacek, 2001).</Paragraph>
    <Paragraph position="16"> The main drawback of SARIMA modelling (and associated models built on the basic ARMA model) is that it requires fairly long time series before accurate results are obtained. The majority of authors recommend that a time series of at least 50 data points is used to build the SARIMA model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML