File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1050_metho.xml

Size: 22,279 bytes

Last Modified: 2025-10-06 14:10:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1050">
  <Title>Creating a Test Collection for Citation-based IR Experiments</Title>
  <Section position="4" start_page="392" end_page="394" type="metho">
    <SectionTitle>
3 Building Test Collections
</SectionTitle>
    <Paragraph position="0"> To turn our document collection into a test collection, a parallel set of search queries and relevance judgements is needed. There are a number of alternative methods for building a test collection. For TREC, humans devise queries speci cally for a given set of documents and make relevance judgements on pooled retrieved documents from that set (Harman, 2005). Theirs is an extremely labour-intensive and expensive process and an unrealistic option in the context of our project.</Paragraph>
    <Paragraph position="1"> The Cran eld 2 tests (Cleverdon et al., 1966) introduced an alternative method for creating a test collection, speci cally for scienti c texts. The method was subject to criticism and has not been employed much since. Nevertheless, we believe this method to be worth revisiting for our current situation. In this section, we describe in turn the Craneld 2 method and our adapted method. We discuss some of the original criticisms and their bearing on our own work, then describe our returns thus far.</Paragraph>
    <Section position="1" start_page="392" end_page="392" type="sub_section">
      <SectionTitle>
3.1 The Cran eld 2 Test Collection
</SectionTitle>
      <Paragraph position="0"> The Cran eld 2 tests (Cleverdon et al., 1966) were a comparative evaluation of indexing language devices. From a base collection of 182 (high speed aerodynamics and aircraft structures) papers, the Cran eld test collection was built by asking the authors to formulate the research question(s) behind their work and to judge how relevant each reference in their paper was to each of their research questions, on a 5-point scale. Referenced documents were obtained and added to the base set. Authors were also asked to list additional relevant papers not cited in their paper. The collection was further expanded in a second stage, using bibliographic coupling to search for similar papers to the referenced ones and employing humans to search the collection for other relevant papers. The resultant collection comprised 1400 documents and 221 queries (Cleverdon, 1997).</Paragraph>
      <Paragraph position="1"> The principles behind the Cran eld technique are: * Queries: Each paper has an underlying research question or questions; these constitute valid search queries.</Paragraph>
      <Paragraph position="2"> * Relevant documents: A paper's reference list is a good starting point for nding papers relevant to its research questions.</Paragraph>
      <Paragraph position="3"> * Judges: The paper author is the person best quali ed to judge relevance.</Paragraph>
    </Section>
    <Section position="2" start_page="392" end_page="393" type="sub_section">
      <SectionTitle>
3.2 Our Anthology Test Collection
</SectionTitle>
      <Paragraph position="0"> We altered the Cran eld design to t to a xed, existing document collection. We designed our methodology around an upcoming conference and approached the paper authors at around the time of the conference, to maximize their willingness to participate and to minimise possible changes in their perception of relevance since they wrote the paper.</Paragraph>
      <Paragraph position="1"> Due to the relatively high in-factor of the collection, we expected a signi cant proportion of the relevance judgements gathered in this way to be about Anthology documents and, thus, useful as evaluation data. Hence, the authors of accepted papers for ACL-2005 and HLT-EMNLP-2005 were asked, by email, for their research questions and relevance judgements for their references. We de ned a 4-point relevance scale, c.f. Table 1, since we felt that the distinctions between the Cran eld grades were not clear enough to warrant 5. Our guidelines also included examples of referencing situations that might t each category. Personalized materials for participation were sent, including a reproduction of their paper's reference list in their response form. This meant that invitations could only be sent once the paper had been made available online.</Paragraph>
      <Paragraph position="2"> We further deviated from the Cran eld methodology by deciding not to ask the authors to try to list additional references that could have been included in their reference list. An author's willingness to name such references will differ more from author to author than their naming of original references, as referencing is part of a standardized writing process.</Paragraph>
      <Paragraph position="3"> By asking for this data, the consistency of the data across papers will be degraded and the status of any additional references will be unclear. Furthermore, feedback from an informal pilot study conducted on ten paper authors con rmed that some authors found this task particularly dif cult.</Paragraph>
      <Paragraph position="4"> Each co-author of the papers was invited individually to participate, rather than inviting the rst author alone. This increased the number of invitations that needed to be prepared and sent (by a factor of around 2.5) but also increased the likelihood of getting a return for a given paper. Furthermore, data from multiple co-authors of the same paper can be used to  Grade Description 4 The reference is crucially relevant to the problem. Knowledge of the contents of the referred work will be fundamental to the reader's understanding of your paper. Often, such relevant references are afforded a substantial amount of text in a paper e.g., a thorough summary.</Paragraph>
      <Paragraph position="5"> 3 The reference is relevant to the problem. It may be helpful for the reader to know the contents of the referred work, but not crucial. The reference could not have been substituted or dropped without making signi cant additions to the text. A few sentences may be associated with the reference. 2 The reference is somewhat (perhaps indirectly) relevant to the problem. Following up the reference probably would not improve the reader's understanding of your paper. Alternative references may have been equally appropriate (e.g., the reference was chosen as a representative example from a number of similar references or included in a list of similar references). Or the reference could have been dropped without damaging the informativeness of your paper. Minimal text will be associated with the reference.</Paragraph>
      <Paragraph position="6"> 1 The reference is irrelevant to this particular problem.</Paragraph>
      <Paragraph position="7">  measure co-author agreement on the relevance task.</Paragraph>
      <Paragraph position="8"> This is an interesting research question, as it is not at all clear how much even close collaborators would agree on relevance, but we do not address this here.</Paragraph>
      <Paragraph position="9"> We plan to expand the collection in a second stage, in line with the Cran eld 2 design. We will reapproach contributing authors after obtaining retrieval results on our collection (e.g., with a standard IR engine) and ask them to make additional relevance judgements on these papers.</Paragraph>
    </Section>
    <Section position="3" start_page="393" end_page="394" type="sub_section">
      <SectionTitle>
3.3 Criticisms of Cran eld 2
</SectionTitle>
      <Paragraph position="0"> Both Cran eld 1 (Cleverdon, 1960) and 2 were sub-ject to various criticisms; (Spcurrency1arck Jones, 1981) gives an excellent account of the tests and their criticisms.</Paragraph>
      <Paragraph position="1"> The majority were criticisms of the test collection paradigm itself and are not pertinent here. However, the source-document principle (i.e., the use of queries created from documents in the collection) attracted particular criticisms. The fundamental concern was that the way in which the queries were created led to an unnaturally close relation between the terms in the queries and those used to index the documents in the colection (Vickery, 1967); any such relationship might have created a bias towards a particular indexing language, distorting the comparisons that were the goal of the project.</Paragraph>
      <Paragraph position="2"> In Cran eld 1, system success was measured by retrieval of source documents alone, criticized for being an over-simpli cation and a distortion of 'real-life' searching. The evaluation procedure was changed for Cran eld 2 so that source documents were excluded from searches and, instead, retrieval of other relevant documents was used to measure success. This removed the problem that, usually, when a user searches, there is no source document for their query. Despite this, Vickery notes that there were still verbal links between sought document and question in the new method: each query author was asked to judge the relevance of the source document's references and the questions ... were formulated after the cited papers had been read and has possibly in uenced the wording of his question .</Paragraph>
      <Paragraph position="3"> While adapting the Cran eld 2 method to our needs, we have tried to address some of the criticisms, e.g., that authors' relevance judgements change over time. Nevertheless, we still have source-document queries and must consider the associated criticisms. Firstly, our test collection is not intended for comparisons of indexing languages.</Paragraph>
      <Paragraph position="4"> Rather, we aim to compare the effect of adding extra index terms to a base indexing of the documents.</Paragraph>
      <Paragraph position="5"> The source documents will have no in uence on the base indexing of a document above that of the other documents. The additional index terms, coming from citations to that document, will generally be 'chosen' by someone other than the query author, with no knowledge of the query terms4. Also, our documents will be indexed fully automatically, further diminishing the scope of any subconscious human in uence.</Paragraph>
      <Paragraph position="6"> Thus, we believe that the suspect relationship between queries and indexing is negligible in the con4The exception to this is self-citation. This (very indirectly) allows the query author to in uence the indexing but it seems highly improbable that an author would be thinking about their query whilst citing a previous work.</Paragraph>
      <Paragraph position="7">  text of our work, as opposed to the Cran eld tests, and that the source-document principle is sound.</Paragraph>
    </Section>
    <Section position="4" start_page="394" end_page="394" type="sub_section">
      <SectionTitle>
3.4 Returns and Analysis
</SectionTitle>
      <Paragraph position="0"> Out of around 500 invitations sent to conference authors, 85 resulted in research questions with relevance judgements being returned; 235 queries in total. Example queries are: * Do standard probabilistic parsing techniques, developed for English, fare well for French and does lexicalistion help improve parsing results? * Analyze the lexical differences between genders engaging in telephone conversations.</Paragraph>
      <Paragraph position="1"> Of the 235 queries, 18 were from authors whose co-authors had also returned data and were discarded (for retrieval purposes); we treat co-author data on the same paper as 'the same' and keep only the rst authors'. 47 queries had no relevant Anthology-internal references and were discarded.</Paragraph>
      <Paragraph position="2"> Another 15 had only relevant Anthology references not yet included in the archive5; we keep these for the time being. This leaves 170 unique queries with at least 1 relevant Anthology reference and an average of 3.8 relevant Anthology references each. The average in-factor across queries is 0.42 (similar to our previously estimated Anthology in-factor)6.</Paragraph>
      <Paragraph position="3"> Our average number of judged relevant documents per query is lower than for Cran eld, which had an average of 7.2 (Spcurrency1arck Jones et al., 2000). However, this is the nal number for the Craneld collection, arrived at after the second stage of relevance judging, which we have not yet carried out. Nevertheless, we must anticipate a potentially low number of relevant documents per query, particularly in comparison to, e.g., the TREC ad hoc track (Voorhees and Harman, 1999), with 86.8 judged relevant documents per query.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="394" end_page="394" type="metho">
    <SectionTitle>
4 Document Collection and Processing
</SectionTitle>
    <Paragraph position="0"> The Anthology documents are distributed in PDF, a format designed to visually render printable documents, not to preserve editable text. So the PDF collection must be converted into a fully textual format.</Paragraph>
    <Paragraph position="1">  Firstly, OmniPage Pro 147, a commercial PDF processing software package, scans the PDFs and produces an XML encoding of character-level page layout information. AI algorithms for heuristically extracting character information (similar to OCR) are necessary since many of the PDFs were created from scanned paper-copies and others do not contain character information in an accessible format.</Paragraph>
    <Paragraph position="2"> The OmniPage output describes a paper as text blocks with typesetting information such as font and positional information. A pre-processor (Lewin et al., 2005) lters and summarizes the OmniPage output into Intermediate XML (IXML), as well as correcting certain characteristic errors from that stage. A journal-speci c template converts the IXML to a logical XML-based document structure (Teufel and Elhadad, 2002), by exploiting low-level, presentational, journal-speci c information such as font size and positioning of text blocks.</Paragraph>
    <Paragraph position="3"> Subsequent stages incrementally add more detailed information to the logical representation. The paper's reference list is annotated in more detail, marking up individual references, author names, titles and years of publication. Finally, a citation processor identi es and marks up citations in the document body and their constituent parts, e.g., author names and years.</Paragraph>
  </Section>
  <Section position="6" start_page="394" end_page="396" type="metho">
    <SectionTitle>
5 Preliminary Experimentation
</SectionTitle>
    <Paragraph position="0"> We expect that our test collection, built for our citation experiments, will be of wider value and we intend to make it publicly available. As a sanity check on our data so far, we carried out some preliminary experimentation, using standard IR tools: the Lemur  its integrated language-model based search engine, and the TREC evaluation software, trec eval9.</Paragraph>
    <Section position="1" start_page="395" end_page="395" type="sub_section">
      <SectionTitle>
5.1 Experimental Set-up
</SectionTitle>
      <Paragraph position="0"> We indexed around 4200 Anthology documents.</Paragraph>
      <Paragraph position="1"> This is the total number of documents that have, at the time of writing, been processed by our pipeline (24 years of CL journal, 25 years of ACL proceedings, 14 years of assorted workshops), plus another [?]90 documents for which we have relevance judgements that are not currently available through the Anthology website but should be incorporated into the archive in the future. The indexed documents do not yet contain annotation of the reference list or citations in text. 19 of our 170 queries have no relevant references in the indexed documents and were not included in these experiments. Thus, Figure 2 shows the distribution of queries over number of relevant Anthology references, for a total of 151 queries.</Paragraph>
      <Paragraph position="2"> Our Indri index was built using default parameters with no optional processing, e.g., stopping or stemming, resulting in a total of 20117410 terms, 218977 unique terms and 2263 'frequent'10 terms.</Paragraph>
      <Paragraph position="3"> We then prepared an Indri-style query le from the conference research questions. The Indri query language is designed to handle highly complex queries but, for our very basic purposes, we created simple bag-of-words queries by stripping all punctuation from the natural language questions and using Indri's #combine operator over all the terms. This means Indri ranks documents in accordance with query likelihood. Again, no stopping or stemming was applied.</Paragraph>
      <Paragraph position="4"> Next, the query le was run against the Anthology index using IndriRunQuery with default parameters and, thus, retrieving 1000 documents for each query.</Paragraph>
      <Paragraph position="5"> Finally, for evaluation, we converted the Indri's ranked document lists to TREC-style top results le and the conference relevance judgements compiled into a TREC-style qrels le, including only judgements corresponding to references within the indexed documents. These les were then input to trec eval, to calculate precision and recall metrics.</Paragraph>
    </Section>
    <Section position="2" start_page="395" end_page="396" type="sub_section">
      <SectionTitle>
5.2 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> Out of 489 relevant documents, 329 were retrieved within 1000 (per query) documents. The mean average precision (MAP) was 0.1014 over the 151 queries. This is the precision calculated at each relevant document retrieved (0.0, if that document is not retrieved), averaged over all relevant documents for all queries, i.e., non-interpolated. R-precision, the precision after R (the number of relevant documents for a query) documents are returned, was 0.0965.</Paragraph>
      <Paragraph position="1"> The average precision at 5 documents was 0.0728.</Paragraph>
      <Paragraph position="2"> We investigated the effect of excluding queries with lower than a threshold number of judged relevant documents. Figure 3 shows that precision at 5 documents increases as greater threshold values are applied. Similar trends were observed with other evaluation measures, e.g., MAP and R-precision increased to 0.2018 and 0.1528, respectively, when only queries with 13 or more relevant documents were run, though such stringent thresholding does result in very few queries. Nevertheless, these trends do suggest that the present low number of relevant documents has an adverse effect on retrieval results and is a potential problem for our test collection.</Paragraph>
      <Paragraph position="3"> We also investigated the effect of including only authors' main queries, as another potential way of objectively constructing a 'higher quality' query set.</Paragraph>
      <Paragraph position="4"> Although, this decreased the average in-factor of relevant references, it did, in fact, increase the average absolute number of relevant references in the index.</Paragraph>
      <Paragraph position="5"> Thus, MAP increased to 0.1165, precision at 5 documents to 0.1016 and R-precision to 0.1201.</Paragraph>
      <Paragraph position="6"> These numbers look poor in comparison to the performance of IR systems at TREC but, importantly, they are not intended as performance results. Their purpose is to demonstrate that such numbers can be produced using the data we have collected,  rather than to evaluate the performance of some new retrieval system or strategy.</Paragraph>
      <Paragraph position="7"> A second point for consideration follows directly from the rst: our experiments were carried out on a new test collection and different test collections have different intrinsic dif culty (Buckley and Voorhees, 2004). Thus, it is meaningless to compare statistics from this data (from a different domain) to those from the TREC collections, where queries and relevance judgements were collected in a different way, and where there are very many relevant documents.</Paragraph>
      <Paragraph position="8"> Thirdly, our experiments used only the most basic techniques and the results could undoubtedly be improved by, e.g., applying a simple stop-list. Nevertheless, this notion of intrinsic dif culty means that it may be the case that evaluations carried out on this collection will produce characteristically low precision values.</Paragraph>
      <Paragraph position="9"> Low numbers do not necessarily preclude our data's usefulness as a test collection, whose purpose is to facilitate comparative evaluations. (Voorhees, 1998) states that To be viable as a laboratory tool, a [test] collection must reliably rank different retrieval variants according to their true effectiveness and defends the Cran eld paradigm (from criticisms based on relevance subjectivity) by demonstrating that the relative performance of retrieval runs is stable despite differences in relevance judgements. The underlying principle is that it is not the absolute precision values that matter but the ability to compare these values for different retrieval techniques or systems, to investigate their relative bene ts. A test collection with low precision values will still allow this. It is known that all evaluation measures are unstable for very small numbers of relevant documents (Buckley and Voorhees, 2000) and there are issues arising from incomplete relevance information in a test collection (Buckley and Voorhees, 2004). This makes the second stage of our test collection compilation even more indispensable (asking subjects to judge retrieved documents), as this will increase the number of judged relevant documents, as well as bridging the completeness gap.</Paragraph>
      <Paragraph position="10"> There are further possibilities of how the problem could be countered. We could exclude queries with lower than a threshold number of relevant documents (after the second stage). Given the respectable number of queries we have, we might be able to afford this luxury. We could add relevant documents from outside the Anthology to our collection. This is least preferable methodologically: using the Anthology has the advantage that it has a real identity and was created for real reasons outside our experiments. Furthermore, the collection 'covers a eld', i.e., it includes all important publications and only those. By adding external documents to the collection, it would lose both these properties.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML