File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0403_intro.xml
Size: 4,721 bytes
Last Modified: 2025-10-06 14:03:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0403"> <Title>Temporal Feature Modification for Retrospective Categorization</Title> <Section position="3" start_page="17" end_page="18" type="intro"> <SectionTitle> 2 Data </SectionTitle> <Paragraph position="0"> To make our experiments tractable and easily repeatable for different parameter combinations, we chose to train and test on two subsets of the ACM corpus. One subset consists of collections of abstracts from several different ACM conferences. The other includes the full text collection of documents from one conference.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 2.1 The ACM hierarchy </SectionTitle> <Paragraph position="0"> All classifications were performed with respect to the ACM's Computing Classification System, 1998 version.</Paragraph> <Paragraph position="1"> This, the most recent version of the ACM-CCS, is a hierarchic classification scheme that potentially presents a wide range of hierarchic classification issues. Because the work reported here is focused on temporal aspects of text classification, we have adopted a strategy that effectively &quot;flattens&quot; the hierarchy. We interpret a document which has a primary2 category at a narrow, low level in the hierarchy (e.g., H.3.3.CLUSTERING) as also classified at all broader, higher-level categories leading to the root (H, H.3, H.3.3). With this construction, the most refined categories will have fewer example documents, while broader categories will have more.</Paragraph> <Paragraph position="2"> For each of the corpora considered, a threshold of 50 documents was set to guarantee a sufficient number of instances to train a classifier. Narrower branches of the full ACM-CCS tree were truncated if they contained insufficient numbers of examples, and these documents were associated with their parent nodes. For example, if H.3.3 contained 20 documents and H.3.4 contained 30, these would be &quot;collapsed&quot; into the H.3 category.</Paragraph> <Paragraph position="3"> All of our corpora carry publication timestamp information involving time scales on the order of one to three decades. The field of computer science, not surprisingly, has been especially fortunate in that most of its publications have been recorded electronically. While obviously skewed relative to scientific and academic publishing more generally, we nevertheless find significant &quot;micro-cultural&quot; variation among the different special interest groups.</Paragraph> </Section> <Section position="2" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 2.2 SIGIR full text </SectionTitle> <Paragraph position="0"> We have processed the annual proceedings of the Association for Computing Machinery's Special Interest Group in Information Retrieval (SIGIR) conference from its inception in 1978 to 2002. The collection contains over summaries. Every document is tagged with its year of publication. Unfortunately, only about half of the SIGIR documents bear category labels. The majority of these omissions fall within the 1978-1987 range, leaving us the remaining 15 years to work with.</Paragraph> </Section> <Section position="3" start_page="18" end_page="18" type="sub_section"> <SectionTitle> 2.3 Conference abstracts </SectionTitle> <Paragraph position="0"> We collected nearly 8,000 abstracts from the Special Interest Group in Programming Languages (SIGPLAN), the Special Interest Group in Computer-Human Interaction (SIGCHI) and the Design Automation Conference (DAC). Characteristics of these collections, and of the SIGIR texts, are shown in Table 2.</Paragraph> </Section> <Section position="4" start_page="18" end_page="18" type="sub_section"> <SectionTitle> 2.4 Missing labels in ACM </SectionTitle> <Paragraph position="0"> We derive the statistics below from the corpus of all documents published by the ACM between 1960 and 2003.</Paragraph> <Paragraph position="1"> The arguments can be applied to any corpus which has categorized documents, but for which there are classification gaps in the record.</Paragraph> <Paragraph position="2"> The first column of Table 3 shows that nearly one fifth of all ACM documents, from both conference proceedings and periodicals, do not possess category labels. We define a document's label as &quot;expected&quot; when more than half of the other documents in its publication (one conference proceeding or one issue of a periodical) are labeled, and if there are more than ten total. The second column lists the percentage of documents where we expected a label but did not find one.</Paragraph> </Section> </Section> class="xml-element"></Paper>