File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1613_metho.xml
Size: 6,244 bytes
Last Modified: 2025-10-06 14:10:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1613"> <Title>Automatic classi cation of citation function</Title> <Section position="4" start_page="106" end_page="107" type="metho"> <SectionTitle> 3 Features for automatic recognition of </SectionTitle> <Paragraph position="0"> citation function This section summarises the features we use for machine learning citation function. Some of these features were previously found useful for a different application, namely Argumentative Zoning (Teufel, 1999; Teufel and Moens, 2002), some are speci c to citation classi cation.</Paragraph> <Section position="1" start_page="106" end_page="106" type="sub_section"> <SectionTitle> 3.1 Cue phrases </SectionTitle> <Paragraph position="0"> Myers (1992) calls meta-discourse the set of expressions that talk about the act of presenting research in a paper, rather than the research itself (which is called object-level discourse). For instance, Swales (1990) names phrases such as to our knowledge, no. . . or As far as we aware as meta-discourse associated with a gap in the current literature. Strings such as these have been used in extractive summarisation successfully ever since Paice's (1981) work.</Paragraph> <Paragraph position="1"> We model meta-discourse (cue phrases) and treat it differently from object-level discourse.</Paragraph> <Paragraph position="2"> There are two different mechanisms: A nite grammar over strings with a placeholder mechanism for POS and for sets of similar words which can be substituted into a string-based cue phrase (Teufel, 1999). The grammar corresponds to 1762 cue phrases. It was developed on 80 papers which are different to the papers used for our experiments here.</Paragraph> <Paragraph position="3"> The other mechanism is a POS-based recogniser of agents and a recogniser for speci c actions these agents perform. Two main agent types (the 5Spiegel-Rcurrency1using found that out of 2309 citations she examined, 80% substantiated statements.</Paragraph> <Paragraph position="4"> authors of the paper, and everybody else) are modelled by 185 patterns. For instance, in a paragraph describing related work, we expect to nd references to other people in subject position more often than in the section detailing the authors' own methods, whereas in the background section, we often nd general subjects such as researchers in computational linguistics or in the literature .</Paragraph> <Paragraph position="5"> For each sentence to be classi ed, its grammatical subject is determined by POS patterns and, if possible, classi ed as one of these agent types. We also use the observation that in sentences without meta-discourse, one can assume that agenthood has not changed.</Paragraph> <Paragraph position="6"> 20 different action types model the main verbs involved in meta-discourse. For instance, there is a set of verbs that is often used when the over-all scienti c goal of a paper is de ned. These are the verbs of presentation, such as propose, present, report and suggest ; in the corpus we found other verbs in this function, but with a lower frequency, namely describe, discuss, give, introduce, put forward, show, sketch, state and talk about . There are also specialised verb clusters which co-occur with PBas sentences, e.g., the cluster of continuation of ideas (eg. adopt, agree with, base, be based on, be derived from, be originated in, be inspired by, borrow, build on,. . . ). On the other hand, the semantics of verbs in Weak sentences is often concerned with failing (of other researchers' approaches), and often contain verbs such as abound, aggravate, arise, be cursed, be incapable of, be forced to, be limited to, . . . .</Paragraph> <Paragraph position="7"> We use 20 manually acquired verb clusters.</Paragraph> <Paragraph position="8"> Negation is recognised, but too rare to de ne its own clusters: out of the 20 2 = 40 theoretically possible verb clusters, only 27 were observed in our development corpus. We have recently automated the process of verb object pair acquisition from corpora for two types of cue phrases (Abdalla and Teufel, 2006) and are planning on expanding this work to other cue phrases.</Paragraph> </Section> <Section position="2" start_page="106" end_page="107" type="sub_section"> <SectionTitle> 3.2 Cues Identi ed by annotators </SectionTitle> <Paragraph position="0"> During the annotator training phase, the annotators were encouraged to type in the metadescription cue phrases that justify their choice of category. We went through this list by hand and extracted 892 cue phrases (around 75 per category). The les these cues came from were not part of the test corpus. We included 12 features that recorded the presence of cues that our annotators associated with a particular class.</Paragraph> </Section> <Section position="3" start_page="107" end_page="107" type="sub_section"> <SectionTitle> 3.3 Other features </SectionTitle> <Paragraph position="0"> There are other features which we use for this task. We know from Teufel and Moens (2002) that verb tense and voice should be useful for recognizing statements of previous work, future work and work performed in the paper. We also recognise modality (whether or not a main verb is modi ed by an auxiliary, and which auxiliary it is).</Paragraph> <Paragraph position="1"> The overall location of a sentence containing a reference should be relevant. We observe that more PMot categories appear towards the beginning of the paper, as do Weak citations, whereas comparative results (CoCoR0, CoCoR-) appear towards the end of articles. More ne-grained location features, such as the location within the paragraph and the section, have also been implemented. null The fact that a citation points to own previous work can be recognised, as we know who the paper authors are. As we have access to the information in the reference list, we also know the last names of all cited authors (even in the case where an et al. statement in running text obscures the later-occurring authors). With self-citations, one might assume that the probability of re-use of material from previous own work should be higher, and the tendency to criticise lower.</Paragraph> </Section> </Section> class="xml-element"></Paper>