File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-1050_intro.xml

Size: 5,756 bytes

Last Modified: 2025-10-06 14:03:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1050">
  <Title>Creating a Test Collection for Citation-based IR Experiments</Title>
  <Section position="3" start_page="0" end_page="392" type="intro">
    <SectionTitle>
2 Motivation
</SectionTitle>
    <Paragraph position="0"> The idea of using terms external to a document, coming from a 'citing' document, has been borrowed from web-based IR. When one paper cites another, a link is made between them and this link structure is analogous to that of the web: hyperlinks ... provide semantic linkages between objects, much in the same manner that citations link documents to other related documents (Pitkow and Pirolli, 1997). Link structure, particularly anchor text, has been used to advantage in web-based IR.</Paragraph>
    <Paragraph position="1"> While web pages are often poorly self-descriptive (Brin and Page, 1998) anchor text is often a higher-level description of the pointed-to page. (Davison,  2000) provides a good discussion of how well anchor text does this and provides experimental results in support. Thus, beginning with (McBryan, 1994), there is a trend of propagating anchor text along its hyperlink to associate it with the linked page, as well as the page in which it is found. Google, for example, includes anchor text as index terms for the linked page (Brin and Page, 1998). The TREC Web tracks have also shown that using anchor text improves retrieval effectiveness for some search tasks (Hawking and Craswell, 2005).</Paragraph>
    <Paragraph position="2"> This idea has already been applied to citations and scienti c articles (Bradshaw, 2003). In Bradshaw's experiment, scienti c documents are indexed by the text that refers to them in documents that cite them.</Paragraph>
    <Paragraph position="3"> However, unlike in experiments with previous collections, we need both the citing and the cited article as full documents in our collection. The question of how to identify citation 'anchor text' and its extent is a matter for research; this requires the full text of the citing article. Previous experiments and test collections have had only limited access to the content of the citing article: Bradshaw had access only to a xed window of text around the citation, as provided by CiteSeer's 'citation context'; in the GIRT collections (Kluck, 2003), a dozen or so content-bearing information elds (e.g., title, abstract, methodological descriptors) represent each document and the full text is not available. Additionally, in Bradshaw's experiment, no access is given to the text of the cited article itself so that the in uence of a term-based IR model cannot be studied and so that documents can only be indexed if they have been cited at least once.</Paragraph>
    <Paragraph position="4"> A test collection containing full text for many citing and cited documents, thus, has advantages from a methodological point of view.</Paragraph>
    <Section position="1" start_page="391" end_page="392" type="sub_section">
      <SectionTitle>
2.1 Choosing a Genre
</SectionTitle>
      <Paragraph position="0"> When choosing a scienti c eld to study, we looked for one that is practicable for us to compile the document collection (freely available machine-readable documents; as few as possible document styles), while still ensuring good coverage of research topics in an entire eld. Had we chosen the medical eld or bioinformatics, the proli c number of journals would have been a problem for the practical document preparation.</Paragraph>
      <Paragraph position="1"> We also looked for a relatively self-contained eld. As we aim to propagate referential text to cited papers as index terms, references from documents in the collection to other documents within the collection will be most useful. We call these internal references. While it is impossible to nd or create a collection of documents with only internal references, we aim for as high a proportion of internal references as possible.</Paragraph>
      <Paragraph position="2"> We chose the ACL (Association for Computational Linguistics) Anthology1, a freely available digital archive of computational linguistics research papers. Computational linguistics is a small, homogenous research eld and the Anthology contains the most prominent publications since the beginning of the eld in 1960, consists of only 2 journals, 7 conferences and 5 less important publications, such as discontinued conferences and a series of workshops, resulting in only 7000 papers2.</Paragraph>
      <Paragraph position="3"> With the ACL Anthology, we expect a high proportion of internal references within a relatively compact document collection. We empirically measured the proportion of collection-internal references. We found a proportion of internal references to all references of 0.33 (the in-factor). We wanted to compare this number to a situation in another, larger eld (genetics) but no straightforward comparison is possible, as there are very many genetics journals and quality of journals probably plays a larger role in a bigger eld. We tried to simulate a similar collection to the 9 main journals+conferences in the Anthology, by considering 10 journals in genetics with a range of impact factors3, resulting in an in-factor of 0.17 (dropping to 0.14 if only 5 journals are considered). Thus, our hypothesis that the Anthology is reasonably selfcontained, at least in comparison with other possible collections, was con rmed.</Paragraph>
      <Paragraph position="4"> The choice of computational linguistics has the added bene t that we are familiar with the domain; we can interpret the subject matter better than we would be able to in the medical domain. This should be of use to us in our eventual experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML