File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1035_intro.xml

Size: 7,555 bytes

Last Modified: 2025-10-06 14:01:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1035">
  <Title>Toward a Task-based Gold Standard for Evaluation of NP Chunks and Technical Terms</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The standard metrics for evaluation of the output of NLP systems are precision and recall. Given an arguably correct list of the units that a system would identify if it performed perfectly, there should in principle be no discrepancy between the units identified by a system and the units that are either useful in a particular application or are preferred by human beings for use in a particular task. But when the satisfactory output can take many different forms, as in summarization and generation, evaluation by precision and recall is not sufficient. In these cases, the challenge for system designers and users is to effectively distinguish between systems that provide generally satisfactory output and systems that do not.</Paragraph>
    <Paragraph position="1"> NP chunks (Abney 1991; Ramshaw and Marcus 1995; Evans and Zhai 1996; Frantzi and Ananiadou 1996) and technical terms (Dagan and Church 1994; Justeson and Katz 1995; Daille 1996; Jacquemin 2001; Bourigault et al. 2002) fall into this difficult-toassess category. NPs are recursive structures. For the maximal NP large number of recent newspaper articles on biomedical science and clinical practice, a full-fledged parser would legitimately identify (at least) seven NPs in addition to the maximal one: large number; recent newspaper articles; large number of recent newspaper articles; biomedical science; clinical practice; biomedical science and clinical practice; and recent newspaper articles on biomedical science and clinical practice. To evaluate the performance of a parser, NP chunks can usefully be evaluated by a gold standard; many systems (e.g., Ramshaw and Marcus 1995 and Cardie and Pierce 1988) use the Penn Treebank for this type of evaluation. But for most applications, output that lists a maximal NP and each of its component NPs is bulky and redundant. Even a system that achieves 100% precision and recall in identifying all of the NPs in a document needs criteria for determining which units to use in different contexts or applications.</Paragraph>
    <Paragraph position="2"> Technical terms are a subset of NP chunks. Jacquemin (2001:3) defines terms as multi-word &amp;quot;vehicles of scientific and technical information&amp;quot;.  The operational difficulty, of course, is to decide whether a specific term is a vehicle of scientific and technical information (e.g., birth date or light truck). Evaluation of mechanisms that filter out some terms while retaining others is subject to this difficulty. This is exactly the kind of case where context plays a significant role in deciding whether a term conforms to a definition and where experts disagree.</Paragraph>
    <Paragraph position="3"> In this paper, we turn to an information access task in order to assess terms identified by different techniques. There are two basic types of information access mechanisms, searching and browsing. In searching, the user generates the search terms; in  browsing, the user recognizes potentially useful terms from a list of terms presented by the system. When an information seeker can readily think up a suitable term or linguistic expression to represent the information need, direct searching of text by user-generated terms is faster and more effective than browsing.</Paragraph>
    <Paragraph position="4"> However, when users do not know (or can't remember) the exact expression used in relevant documents, they necessarily struggle to find relevant information in full-text search systems. Experimental studies have repeatedly shown that information seekers use many different terms to describe the same concept and few of these terms are used frequently (Furnas et al. 1987; Saracevic et al. 1988; Bates et al. 1998). When information seekers are unable to figure out the term used to describe a concept in a relevant document, electronic indexes are required for successful information access.</Paragraph>
    <Paragraph position="5"> NP chunks and technical terms have been proposed for use in this task (Boguraev and Kennedy 1997; Wacholder 1998). NP chunks and technical terms have also been used in phrase browsing and phrase hierarchies (Jones and Staveley 1999; Nevill-Manning et al. 1999; Witten et al. 1999; Lawrie and Croft 2000) and summarization (e.g., McKeown et al.</Paragraph>
    <Paragraph position="6"> 1999; Oakes and Paice 2001). In fact, the distinction between task-based evaluation of a system and precision/recall evaluation of the quality of system output is similar to the extrinsic/intrinsic evaluation of summarization (Gallier and Jones 1993).</Paragraph>
    <Paragraph position="7"> In order to focus on the subjects' choice of index terms rather than on other aspects of the information access process, we asked subject to find answers to questions in a college level text book. Subjects used the Experimental Searching and Browsing Interface (ESBI) to browse a list of terms that were identified by different techniques and then merged. Subjects select an index term by clicking on it in order to hyperlink to the text itself. By design, ESBI forces the subjects to access the text indirectly, by searching and browsing the list of index terms, rather than by direct searching of the text.</Paragraph>
    <Paragraph position="8"> Three sets of terms were used in the experiment: one set (HS) was identified using the head-sorting method of Wacholder (1998); the second set (TT) was identified by an implementation of the technical term algorithm of Justeson and Katz (1995); a third set (HUM) was created by a human indexer. The methods for identifying these terms will be discussed in greater detail below.</Paragraph>
    <Paragraph position="9"> Somewhat to our surprise, subjects displayed a very strong preference for the index terms that were identified by the human indexer. Table 1 shows that when measured by percentage terms selected, subjects chose over 13% of the available human terms, but only 1.73% and 1.43% of the automatically selected terms; by this measure the subjects' preference for the human terms was more than 7 times greater than the preference for either of the automatic techniques. (In Table 1 and in the rest of this paper, all index term counts are by type rather than by token,  subjects relative to number of terms in the entire index.</Paragraph>
    <Paragraph position="10"> This initial experiment strongly indicates that 1) people have a demonstrable preference for different types of index terms; 2) these human terms are a very good gold standard. If subjects use a greater proportion of the terms identified by a particular technique, the terms can be judged better than the terms identified by another technique, even if the terms are different. Any automatic technique capable of identifying terms that are preferred over these human terms would be a very strong system indeed. Furthermore, the properties of the terms preferred by the experimental subjects can be used to guide design of systems for identifying and selecting NP chunks and technical terms.</Paragraph>
    <Paragraph position="11"> In the next section, we describe the design of the experiment and in Section 3, we report on what the experimental data shows about human preferences for different kinds of index terms.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML