File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1205_intro.xml
Size: 11,225 bytes
Last Modified: 2025-10-06 14:02:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1205"> <Title>Zone Identification in Biology Articles as a Basis for Information Extraction</Title> <Section position="3" start_page="0" end_page="30" type="intro"> <SectionTitle> 2 Overview of the framework </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="29" type="sub_section"> <SectionTitle> 2.1 The need for zone identification (ZI) </SectionTitle> <Paragraph position="0"> We discuss below the critical issues in bioNLP involved in pin-pointing and organizing factual information and show how ZI can be applied.</Paragraph> <Paragraph position="1"> First, articles provide information in various rhetorical statuses (e.g. new vs. old results; own vs. previous work). Current IE relies on surface lexical and syntactic patterns, neglecting the rhetorical status of information. Thus, we are in danger of extracting old results mixed with new ones.</Paragraph> <Paragraph position="2"> (1) Recent data suggest that ... ~ is involved in DPC removal in mammalian cells (ref.), ... ...The data presented here suggest that ... The data (1) provide statements in different rhetorical statuses (boldfaced by us). Preprocessing the text in terms of such information helps filter out old results (i.e. the first statement).</Paragraph> <Paragraph position="3"> Secondly, so far the scope of bioNLP largely bear on abstracts. But arguably, the final goal should be full texts, given their much richer sources of information and the increasing ease of access (e.g. open access to collections such as PUBMED-central; online journals such as EMBO, PNAS, and JCB). This involves exploring new techniques because there are some essential differences from abstracts. Among others, full texts present much more complexity in the sentence structure and vocabulary (e.g. inserted phrases, embedded sentences, nominalization of verbs, more anaphoric expressions). Thus, we expect that the analysis of the whole text requires a much more complex set of patterns and algorithms, resulting in errors. A solution to this problem is to identify the subset of the article relevant to further analysis at issue. For example, in order to extract certain kinds of biological interactions found by the author, we could skip statements about previous work as seen in the Introduction section. Thirdly, experimental results make sense in their relation to the experimental goal and procedure. Also, there are usually a sequence of experiments performed, each of which obtains complex results. Therefore, it is important to extract a set of experimental results in an organized manner. This also helps identify the reference of demonstratives (e.g. this) and pronouns (e.g. it).</Paragraph> <Paragraph position="4"> From these points of view, ZI in articles plays an essential role in extracting factual information of different sorts from different zone classes.</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 2.2 Characteristics of the framework </SectionTitle> <Paragraph position="0"> The idea underlying ZI in our sense contrasts with other, discourse relations-based notions (e.g.</Paragraph> <Paragraph position="1"> Mann et al. 1987; Kando 1999; van Dijk, 1980); we focus on the global type of information. For example, in our ZI, reference to previous work as background information remains as such whether it is supported or refuted by the author later in the article, whereas this difference plays an essential role in discourse relations-based analyses.</Paragraph> <Paragraph position="2"> A. Koike (at AVIRG 2004) reported that to extract the interactions between two biological elements from PUBMED abstracts, about 400 patterns were necessary. The larger picture we have consists of 2 levels; 1) ZI, and 2a) analysis of zone interactions (e.g. discourse relations), or 2b) analysis on specific zones (i.e. extraction of biological interactions). In this paper we focus on the first step.</Paragraph> </Section> <Section position="3" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 2.3 Annotation scheme </SectionTitle> <Paragraph position="0"> Our annotation scheme is proposed in (Mizuta et al., 2004), based on Teufel et al.'s (2002) scheme.</Paragraph> <Paragraph position="1"> Three major modifications are made; 1) a fine-grained OWN class based on the model of an experimental procedure which we identified across journals, 2) CNN and DFF classes to cover the relations between data/findings, and 3) nested annotation. The set of zone classes is as follows: * BKG (Background): given information (reference to previous work or a generally accepted fact) * PBM (Problem-setting): the problem to be solved; the goal of the present work/paper.</Paragraph> <Paragraph position="2"> * OTL (Outline): a characterization/ summary of the content of the paper.</Paragraph> <Paragraph position="3"> * TXT (Textual): section organization of the paper (e.g. &quot;Section 3 describes our method&quot;). * OWN: the author's own work: * MTH (Method): experimental procedure; * RSL (Result): the results of the experiment; * INS (Insight): the author's insights and findings obtained from experimental results (including the interpretation) or from previous work * IMP (Implication): the implications of experimental results (e.g. conjectures, assessment, applications, future work) or those of previous work * ELS (Else): anything else within OWN.</Paragraph> <Paragraph position="4"> * CNN (Connection): correlation or consistency between data and/or findings.</Paragraph> <Paragraph position="5"> * DFF (Difference): a contrast or inconsistency between data and/or findings.</Paragraph> <Paragraph position="6"> The basic annotation unit is a sentence, but in some cases it may be a phrase. In light of those cases which fit into multiple zones, we employ 2level annotation. Empirical analysis indicates that even though zone classes are conceptually nonoverlapping, an annotation unit may fit into multiple classes. That is, a linguistic unit (e.g. a sentence) may well represent complex concepts. Therefore, we consider that nested annotation is necessary, even though it complicates annotation. 3 Zone identification -1: Main features of each zone Based on our sample annotation of full texts, we discuss the major features extracted from the data of each zone class. Complex cases and the location of zones will be discussed in later sections.</Paragraph> </Section> <Section position="4" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.1 BACKGROUND (BKG) </SectionTitle> <Paragraph position="0"> (1) In cells, DNA is tightly associated with ...</Paragraph> <Paragraph position="1"> (2) Ref. suggested/ suggests that ~ (3) A wide variety of restriction-modification (R null M) systems have been discovered ....</Paragraph> <Paragraph position="2"> BKG has three tense variations; 1) simple present for a generic statement about background information (e.g. biological facts; reference to previous work), 2) simple past, and 3) present perfect, to mention the current relevance of previous work. A wider range of verbs are used to cover both biological and bibliographical facts. Citations in the sentence-final position having as its scope the whole sentence signal BKG, but inter-sentential citations having a smaller scope do not.</Paragraph> </Section> <Section position="5" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.2 PROBLEM SETTING (PBM ) </SectionTitle> <Paragraph position="0"> There are two types of PBM.</Paragraph> <Paragraph position="1"> (2) X has not been established/addressed there has been no study on X little is currently known about ~ there are very limited data concerning X X remain unclear The first type as illustrated above is observed in the I-section ; it addresses the problem to solve. It has a 'negative polarity' in that it mentions something missing in the current situation (e.g. knowledge, study, a research question). It contains vocabulary expressing negation or incompleteness (boldfaced). Tense variation is either simple present or present perfect, depending on the temporal interval referred to. The range of verbs used has not been analyzed yet.</Paragraph> <Paragraph position="2"> (3) To test {whether ~ / this hypothesis/...}, To evaluate X; To address the question of X The second type of PBM is observed in the Rsection. As illustrated in (3), it corresponds to a to-phrase appearing sentence-initially or finally. It is combined with a description of experimental procedure, as illustrated in (8).</Paragraph> <Paragraph position="3"> The two types of PBM are both related to a goal description. The first type concerns the whole work and the second type its subset (i.e. an experiment).</Paragraph> </Section> <Section position="6" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.3 OUTLINE (OTL) </SectionTitle> <Paragraph position="0"> (4) We report here the results of experiments.... In brief, we have asked, ... To address the first question, we utilized ... We found ... Together, these results not only confirm that .... but also that... (End of the I-section) In what follows, I-, M-, R-, and D- section stand for Introduction, Method and Materials, Results, and Discussion sections, respectively OTL provides a concise characterization of (or an 'excerpts' from) the work as an abstract does. (5) [Introduction Body Conclusion] full-text article The rhetorical scheme of the whole article is analyzed as (5). OTL has as its scope &quot;Body&quot;, and thus it is expected to appear either in Introduction or Conclusion. This conforms to our investigation. Tense choices are between simple present and future (in Introduction), and between present perfect and simple past (in Conclusion).</Paragraph> <Paragraph position="1"> The first element of (4) signals the beginning of an OTL zone. By itself it would fit into AIM (of the paper) employed in (Teufel et al., 2002). It contains certain kind of linguistic signals such as: (6) Indexicals: e.g. in this paper; in the present study; here 'Reporting verbs' or verbs for presentation: e.g. we show/ demonstrate/ present/ report However, OTL consists of a wider range of sentences. As illustrated in (4), OTL also contains those elements which provide information relevant to other zones (e.g. PBM, MTH and RSL). We consider that the whole sequence of sentences in (4) deserve an independent class from both theoretical and practical perspectives. That is, it is embedded in a reporting context, and provides abstract-like information. Thus, we propose OTL.</Paragraph> </Section> <Section position="7" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.4 TEXTUAL (TXT) </SectionTitle> <Paragraph position="0"> TXT zones were not observed in our sample.</Paragraph> <Paragraph position="1"> This makes sense because the journals investigated provide a rigid section format. However, we retain this class for future application to other journals which may provide a more flexible section format.</Paragraph> </Section> </Section> class="xml-element"></Paper>