File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0403_concl.xml
Size: 3,574 bytes
Last Modified: 2025-10-06 13:54:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0403"> <Title>Temporal Feature Modification for Retrospective Categorization</Title> <Section position="7" start_page="21" end_page="22" type="concl"> <SectionTitle> 6 Summary and Future Work </SectionTitle> <Paragraph position="0"> In this paper, we have demonstrated a feature modification technique that accounts for three kinds of lexical changes in a set of documents with category labels.</Paragraph> <Paragraph position="1"> Within a category, the distribution of terms can change to reflect the changing nature of the category. Terms can also &quot;migrate&quot; between categories. Finally, the categorization system itself can change, leading to necessary lexical changes in the categories that do not find themselves with altered labels. Temporal feature modification (TFM) accounts for these changes and improves performance on the retrospective categorization task as it is applied to subsets of the Association for Computing Machinery's document collection.</Paragraph> <Paragraph position="2"> While the results presented in this paper indicate that TFM can improve classification accuracy, we would like to demonstrate that its mechanism truly incorporates changes in the lexical content of categories, such as those outlined in Section 1.1. A simple baseline comparison would pit TFM against a procedure in which the corpus is divided into slices temporally, and a classifier is trained and tested on each slice individually. Due to changes in community interest in certain topics, and in the structure of the hierarchy, some categories are heavily represented in certain (temporal) parts of the corpus and virtually absent elsewhere. Thus, the chance of finding every category represented in a single year is very low. For our corpora, this did not even occur once.</Paragraph> <Paragraph position="3"> The &quot;bare bones&quot; version of TFM presented here is intended as a proof-of-concept. Many of the parameters and procedures can be set arbitrarily. For initial feature selection, we used odds ratio because it exhibits good performance in TC (Mladenic, 1998), but it could be replaced by another method such as information gain, mutual information, or simple term/category probabilities.</Paragraph> <Paragraph position="4"> The ratio test is not a very sophisticated way to choose which terms should be modified, and presently only detects the surges in the use of a term, while ignoring the (admittedly rare) declines.</Paragraph> <Paragraph position="5"> In experiments on a Usenet corpus (not reported here) that was more balanced in terms of documents per category and per year, we found that allowing different terms to &quot;compete&quot; for modification was more effective than the egalitarian practice of choosing L terms from each category/year. There is no reason to believe that each category/year is equally likely to contribute temporally perturbed terms.</Paragraph> <Paragraph position="6"> We would also like to exploit temporal contiguity. The present implementation treats time slices as independent entities, which precludes the possibility of discovering temporal trends in the data. One way to incorporate trends implicitly is to run a smoothing filter across the temporally aligned frequencies. Also, we treat each slice at annual resolution. Initial tests show that aggregating two or more years into one slice improves performance for some corpora, particularly those with temporally sparse data such as DAC.</Paragraph> </Section> class="xml-element"></Paper>