XML Viewer - p04-2004

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2004_metho.xml
Size: 17,540 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-2004">
  <Title>Temporal Context: Applications and Implications for Computational Linguistics</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Time in information retrieval
</SectionTitle>
    <Paragraph position="0"> In the task of retrieving relevant documents based upon keyword queries, it is customary to treat each document as a vector of terms with associated &amp;quot;weights&amp;quot;. One notion of term weight simply counts the occurrences of each term. Of more utility is the scheme known as term frequency-inverse document frequency (TF.IDF): a0a2a1a4a3a6a5a8a7a9a1a4a3a11a10a13a12a15a14a17a16a19a18a15a20a22a21a24a23a25a1a27a26 where a0a2a1a28a3 is the weight of term k in document d, a7a9a1a28a3 is the frequency ofkin d, N is the total number of documents in the corpus, and a23a29a1 is the total number of documents containing k. Very frequent terms (such as function words) that occur in many documents are downweighted, while those that are fairly unique have their weights boosted.</Paragraph>
    <Paragraph position="1"> Many variations of TF.IDF have been suggested (Singhal, 1997). Our variation, temporal term weighting (TTW), incorporates a term's IDF at different points in time: a0a2a1a28a3a30a5a31a7a9a1a28a3a11a10a32a12a15a14a17a16a19a18a15a20a33a18a35a34a36a26a37a21a24a23a25a1a38a18a35a34a37a26a37a26 Under this scheme, the document collection is divided into T time slices, and N and a23a29a1 are computed for each slice t. Figure 1 illustrates why such a modification is useful. It depicts the frequency of the terms neural networks and expert system for each year in a collection of Artificial Intelligence-related dissertation abstracts.</Paragraph>
    <Paragraph position="2"> Both terms follow a fairly linear trend, moving in opposite directions.</Paragraph>
    <Paragraph position="3"> As was demonstrated for CL in Section 1, the terms which best characterize AI have also changed through time. Table 2 lists the top five &amp;quot;rising&amp;quot; and &amp;quot;falling&amp;quot; bigrams in this corpus, along with their least-squares fit to a linear trend. Lexical variants (such as plurals) are omitted. Using an atemporal TF.IDF, both rising and falling terms would be assigned weights proportional only to a7a39a1a28a3 . A novice user issuing a query would be given a temporally random scattering of documents, some of which might be state-of-theart, others very outdated.</Paragraph>
    <Paragraph position="4"> But with TTW, the weights are proportional to the collective &amp;quot;community interest&amp;quot; in the term at a given point in time. In academic research documents, this yields two benefits. If a term rises from obscurity to popularity over the duration of a corpus, it is not unreasonable to assume that this term originated in one or a few seminal articles. The term is not very frequent across documents when these articles are published, so its weight in the seminal articles will be amplified. Similarly, the term will be downweighted in articles when it has become ubiquitous throughout the literature.</Paragraph>
    <Paragraph position="5"> For a falling term, its weight in early documents will be dampened, while its later use will be emphasized. If a term is very frequent in a document after it has been relegated to obscurity, this is likely to be an historical review article. Such an article would be a good place to start an investigation for someone who is unfamiliar with the term.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Future work
</SectionTitle>
      <Paragraph position="0"> We have discovered clear frequency trends over time in several corpora. Given this, TTW seems beneficial for use in information retrieval, but is in an embryonic stage. The next step will be the development and implementation of empirical tests.</Paragraph>
      <Paragraph position="1"> IR systems typically are evaluated by measures such as precision and recall, but a different test is necessary to compare TTW to an atemporal TF.IDF. One idea we are exploring is to have a system explicitly tag seminal and historical review articles that are centered around a query term, and then compare the results with those generated by bibliometric methods. Few bibliometric analyses have gone beyond examinations of citation networks and the keywords associated with each article. We would consider the entire text.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Time in text categorization
</SectionTitle>
    <Paragraph position="0"> Text categorization (TC) is the problem of assigning documents to one or more pre-defined categories. As Section 2 demonstrated, the terms which best characterize a category can change through time, so intelligent use of temporal context may prove useful in TC.</Paragraph>
    <Paragraph position="1"> Consider the example of sorting newswire documents into the categories ENTERTAINMENT, BUSI-NESS, SPORTS, POLITICS, and WEATHER. Suppose we come across the term athens in a training document. We might expect a fairly uniform distribution of this term throughout the five categories; that is, a40a42a41 a18 Ca43athensa26 = 0.20 for each C. However, in the summer of 2004, we would expect</Paragraph>
    <Paragraph position="3"> a26 to be greatly increased relative to the other categories due to the city's hosting of the Olympic games.</Paragraph>
    <Paragraph position="4"> Documents with &amp;quot;temporally perturbed&amp;quot; terms like athens contain potentially valuable information, but this is lost in a statistical analysis based purely on the content of each document, irrespective of its temporal context. This information can be recovered with a technique we call temporal feature modification (TFM). We first outline a formal model of its use.</Paragraph>
    <Paragraph position="5"> Each term k is assumed to have a generator Gk that produces a &amp;quot;true&amp;quot; distribution a40a42a41 a18 Ca43ka26 across all categories. External events at time y can perturb k's generator, causing a40a44a41 a18 Ca43ka26a37a45 to be different relative to the background a40a44a41 a18 Ca43ka26 computed over the entire corpus. If the perturbation is significant, we want to separate the instances of k at time y from all other instances. We thus treat athens and &amp;quot;athens+summer2004&amp;quot; as though they were actually different terms, because they came from two different generators.</Paragraph>
    <Paragraph position="6"> TFM is a two step process that is captured by this pseudocode:</Paragraph>
    <Paragraph position="8"> for each term k in ModifyList(y): Add pseudo-term &amp;quot;k+y&amp;quot; to Vocab</Paragraph>
    <Paragraph position="10"> for each term k: if &amp;quot;k+y&amp;quot; in Vocab: replace k with &amp;quot;k+y&amp;quot; classify modified document PreModList(C,y,L) is a list of the top L lexemes that, by the odds ratio measure2, are highly associated with category C in year y. We test the hypothesis that these come from a perturbed generator in year y, as opposed to the atemporal generator Gk, by comparing the odds ratios of termcategory pairs in a PreModList in year y with the same pairs across the entire corpus. Terms which pass this test are added to the final ModifyList(y) for year y. For the results that we report, Decision-Rule is a simple ratio test with threshold factor f. Suppose f is 2.0: if the odds ratio between C and k is twice as great in year y as it is atemporally, the decision rule is &amp;quot;passed&amp;quot;. The generator Gkis considered perturbed in year y and k is added to ModifyList(y). In the training and testing phases, the documents are modified so that a term k is replaced with the pseudo-term &amp;quot;k+y&amp;quot; if it passed the ratio test.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 ACM Classifications
</SectionTitle>
      <Paragraph position="0"> We tested TFM on corpora representing genres from academic publications to Usenet postings,  at least twice are included in the vocabulary. and it improved classification accuracy in every case. The results reported here are for abstracts from the proceedings of several of the Association for Computing Machinery's conferences: SIGCHI, SIGPLAN, and DAC. TFM can benefit the ACM community through retrospective categorization in two ways: (1) 7.73% of abstracts (nearly 6000) across the entire ACM corpus that are expected to have category labels do not have them; (2) When a group of terms becomes popular enough to induce the formation of a new category, a frequent occurrence in the computing literature, TFM would separate the &amp;quot;old&amp;quot; uses from the &amp;quot;new&amp;quot; ones.</Paragraph>
      <Paragraph position="1"> The ACM classifies its documents in a hierarchy of four levels; we used an aggregating procedure to &amp;quot;flatten&amp;quot; these. The characteristics of each corpus are described in Table 3. The &amp;quot;TC minutiae&amp;quot; used in these experiments are: Stoplist, Porter stemming, 90/10% train/test split, Laplacian smoothing. Parameters such as type of classifier (Naive Bayes, KNN, TF.IDF, Probabilistic indexing) and threshold factor f were varied.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the improvement in classification accuracy for different percentages of terms modified, using the best parameter combinations for each corpus, which are noted in Table 4. A base-line of 0.0 indicates accuracy without any temporal modifications. Despite the relative paucity of data in terms of document length, TFM still performs well on the abstracts. The actual accuracies when no terms are modified are less than stellar, ranging from 30.7% (DAC) to 33.7% (SIGPLAN) when averaged across all conditions, due to the difficulty of the task (20-22 categories; each document can only belong to one). Our aim is simply to show improvement.</Paragraph>
      <Paragraph position="1"> In most cases, the technique performs best when  binations for each corpus making relatively few modifications: the left side of Figure 2 shows a rapid performance increase, particularly for SIGCHI, followed by a period of diminishing returns as more terms are modified.</Paragraph>
      <Paragraph position="2"> After requiring the one-time computation of odds ratios in the training set for each category/year, TFM is very fast and requires negligible extra storage space.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Future work
</SectionTitle>
      <Paragraph position="0"> The &amp;quot;bare bones&amp;quot; version of TFM presented here is intended as a proof-of-concept. Many of the parameters and procedures can be set arbitrarily. For initial feature selection, we used odds ratio because it exhibits good performance in TC (Mladenic, 1998), but it could be replaced by another method such as information gain. The ratio test is not a very sophisticated way to choose which terms should be modified, and presently only detects the surges in the use of a term, while ignoring the (admittedly rare) declines.</Paragraph>
      <Paragraph position="1"> Using TFM on a Usenet corpus that was more balanced in terms of documents per category and per year, we found that allowing different terms to &amp;quot;compete&amp;quot; for modification was more effective than the egalitarian practice of choosing L terms from each category/year. There is no reason to believe that each category/year is equally likely to contribute temporally perturbed terms.</Paragraph>
      <Paragraph position="2"> Finally, we would like to exploit temporal con- null quency min. is the minimum number of times a term must appear in the corpus in order to be included. tiguity. The present implementation treats time slices as independent entities, which precludes the possibility of discovering temporal trends in the data. One way to incorporate trends implicitly is to run a smoothing filter across the temporally aligned frequencies. Also, we treat each slice at annual resolution. Initial tests show that aggregating two or more years into one slice improves performance for some corpora, particularly those with temporally sparse data such as DAC.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Future work
</SectionTitle>
    <Paragraph position="0"> A third part of this research program, presently in the exploratory stage, concerns lexical (semantic) change, the broad class of phenomena in which words and phrases are coined or take on new meanings (Bauer, 1994; Jeffers and Lehiste, 1979). Below we describe an application in document clustering and point toward a theoretical framework for lexical change based upon recent advances in network analysis.</Paragraph>
    <Paragraph position="1"> Consider a scenario in which a user queries a document database for the term artificial intelligence. We would like to create a system that will cluster the returned documents into three categories, corresponding to the types of change the query has undergone. These responses illustrate the three categories, which are not necessarily mutually exclusive:  1. &amp;quot;This term is now more commonly referred to as AI in this collection&amp;quot;, 2. &amp;quot;These documents are about artificial intelligence, though it is now more commonly called machine learning&amp;quot;, 3. &amp;quot;The following documents are about  artificial intelligence, though in this collection its use has become tacit&amp;quot;.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Acronym formation
</SectionTitle>
      <Paragraph position="0"> In Section 2, we introduced the notions of &amp;quot;rising&amp;quot; and &amp;quot;falling&amp;quot; terms. Figure 3 shows relative frequencies of two common terms and their acronyms in the first and second halves of a corpus of AI discussion board postings collected from 1983-1988. While the acronyms increased in frequency, the expanded forms decreased or remained the same. A reasonable conjecture is that in this informal register, the acronyms AI and CS largely replaced the expansions. During the same time period, the more formal register of dissertation abstracts did not show this pattern for any acronym/expansion pairs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Lexical replacement
</SectionTitle>
      <Paragraph position="0"> Terms can be replaced by their acronyms, or by other terms. In Table 1, database was listed among the top five terms that were most characteristic of the ACL proceedings in 19791984. Bisecting this time slice and including bi-grams in the analysis, data base ranks higher than database in 1979-1981, but drops much lower in 1982-1984. Within this brief period of time, we see a lexical replacement event taking hold. In the AI dissertation abstracts, artificial intelligence shows the greatest decline, while the conceptually similar terms machine learning and pattern recognition rank sixth and twelfth among the top rising terms.</Paragraph>
      <Paragraph position="1"> There are social, geographic, and linguistic forces that influence lexical change. One example stood out as having an easily identified cause: political correctness. In a corpus of dissertation abstracts on communication disorders from 19822002, the term subject showed the greatest relative decrease in frequency, while participant showed the greatest increase. Among the top ten bigrams showing the sharpest declines were three terms that included the word impaired and two that included disabled.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 &amp;quot;Tacit&amp;quot; vocabulary
</SectionTitle>
      <Paragraph position="0"> Another, more subtle lexical change involves the gradual disappearance of terms due to their increasingly &amp;quot;tacit&amp;quot; nature within a particular community of discourse. Their existence becomes so obvious that they need not be mentioned within the community, but would be necessary for an outsider to fully understand the discourse.</Paragraph>
      <Paragraph position="1"> Take, for example, the terms backpropagation and hidden layer. If a researcher of neural networks uses these terms in an abstract, then neural network does not even warrant printing, because they have come to imply the presence of neural network within this research community.</Paragraph>
      <Paragraph position="2"> Applied to IR, one might call this &amp;quot;retrieval by implication&amp;quot;. Discovering tacit terms is no simple matter, as many of them will not follow simple is-a relationships (e.g. terrier is a dog). The example of the previous paragraph seems to contain a hierarchical relation, but it is difficult to define. We believe that examining the temporal trajectories of closely related networks of terms may be of use here, and is also part of a more general project that we hope to undertake. Our intention is to improve existing models of lexical change using recent advances in network analysis (Barabasi et al., 2002; Dorogovtsev and Mendes, 2001).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML