File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1205_metho.xml
Size: 17,351 bytes
Last Modified: 2025-10-06 14:09:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1205"> <Title>Zone Identification in Biology Articles as a Basis for Information Extraction</Title> <Section position="4" start_page="30" end_page="31" type="metho"> <SectionTitle> 3.5 METHOD (MTH) </SectionTitle> <Paragraph position="0"> (7) we performed X , using ...; we exploited the presence of ~; we utilized sucrose-gradient fractionation; X was normalized MTH takes the form of an event description in the past tense, using matrix verbs expressing the experimental procedure (e.g. perform, examine, use, collect, purify). Either a passive or an active form (with we as its semantic subject) is used.</Paragraph> <Paragraph position="1"> We observed that a paragraph in the R-section starts with a combination of PBM and MTH as illustrated in (8). It is much more common for PBM to come first. This can be explained in terms of 'iconicity', the phenomenon that the conceptual and/or the real world ordering of elements is often reflected in linguistic expressions. In (8), the PBM For a more comprehensive set of expressions, see, for example, (Swales, 1990) and (Teufel et al., 2002). portion (to-phrase) is preposed conforming to the fact that the author first had the experimental goal.</Paragraph> </Section> <Section position="5" start_page="31" end_page="31" type="metho"> <SectionTitle> 3.6 RESULT (RSL) </SectionTitle> <Paragraph position="0"> (9) the distribution of ~ was shifted from ...; no significant change was seen; cells ... demonstrated an enrichment in ~ RSL usually describes an event in the past tense, as MTH does, using a certain set of verbs expressing; 1) phenomena (e.g. represent, show and demonstrate, having as its subject the material used), 2) observations (e.g. observe, recognize and see, having we as its subject, or in the passive form), or 3) biological processes (e.g. mutate, translate, express, often in the passive form). (10) the distribution of ~ is shifted from ... no significant change is seen cells devoid of Scp160p demonstrates ~ ~ are presented in Table 2.</Paragraph> <Paragraph position="1"> As illustrated above, RSL, unlike MTH, may also be written in the present tense to create a context in which the author observes and presents the results real-time, referring to figures.</Paragraph> <Paragraph position="2"> In the R-section, RSL zones were observed to follow MTH with no discourse connectives.</Paragraph> <Paragraph position="3"> However, the boundary was rather easy to identify, by virtue of a cause-effect relation identified. Specifically, matrix verbs used in these zones played a critical role; some of them present a rather complementary distribution. This feature is useful for machine learning too.</Paragraph> <Paragraph position="4"> MTH and RSL may be combined by resulted in: (11) [Parallel ... transcription reactions using...] MTH resulted in [... strong smears. ]</Paragraph> </Section> <Section position="6" start_page="31" end_page="32" type="metho"> <SectionTitle> RSL </SectionTitle> <Paragraph position="0"> However, result in is usually observed in relating biological events, and the above usage relating a method and results is found uncommon. Also, the explicit use of result(s) as below is uncommon: (12) The results, ......, were striking. First, ...</Paragraph> <Paragraph position="1"> Given these, keyword searches using result(s) do not work for the purpose of identifying experimental results. In contrast, RSL zones can be identified using features such as matrix verbs and location. Thus, annotating RSL zones is important. (13) Interestingly/ Surprisingly/ Noticeably/..., In a RSL zone, empathetic expressions as in (13) may be used, often sentence-initially, to call the reader's attention. The adjective version (e.g. striking in (12)) is also used.</Paragraph> <Paragraph position="2"> The occurrences in MTH/ RSL in our sample were: perform 38/2, use 181/12, collect 10/1, purify 23/2, observe 1/43, reduce 1/15, affect 1/15, associate 6/25. However, some verbs had a rather neutral distribution (e.g. detect 11/13, follow 26/8). Such cases require the use of other features too, as we will discuss later on. We have identified three major patterns for INS. The examples below illustrate the first pattern: (14) [As can be seen in Figure 2C, ... was not significantly different compared with that in These are conventionalized forms which the author uses in stating his/her interpretation of the results with respect to a biological process behind the observed results. A generalization is: 5 (16) X indicate Y (a variant: X, indicating Y ) X: results/experiments/studies, Y: biological statement or model, Verb variations from our sample: indicate/suggest/demonstrate/represent/reveal. The second pattern is a sentence using the verb seem/ appear or consider such as: (17) X seem/appear to V (It seems/ appears that ~) X is considered to V The third pattern is the use of confirm/ support: (18) This was confirmed, as shown in Figure 3. Here, this refers to the author's hypothesis. Although (18) refers to a figure which shows the result, the sentence does not fit into RSL but into INS. We consider that it describes the author's interpretation of the result and that the hypothesis is now licensed as an insight. A generalization is: (19) X confirm that Y; Y was confirmed.</Paragraph> <Paragraph position="3"> X: results/experiments/studies Y: proposition (hypothesis or prediction).</Paragraph> <Paragraph position="4"> As we will discuss later, confirm also signals CNN, relating two things (X and Y). Therefore, it triggers a nested annotation for INS and CNN.</Paragraph> <Section position="1" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 3.7 IMPLICATION (IMP) </SectionTitle> <Paragraph position="0"> The IMP class is used as a cover category for the author's 'weaker' insights from experimental results and for other kinds of implication of the work (e.g. assessment, applications, future work).</Paragraph> <Paragraph position="1"> (20) Fusion of ...of type III enzymes, ..., would result in type IIG enzymes...</Paragraph> <Paragraph position="2"> (21) We speculate that as ~ lose ..., ~ increases. 'Weaker' insights (vs. 'regular' insights fitting into INS) are signaled by; 1) modal expressions In our data, suggest occurred mainly in INS (63%) and BKG (23%), and indicate in INS(55%), RSL(20%) and MTH (10%). This means that these verbs strongly signal INS but other features are also needed for ZI (e.g. location, zone sequence, and the subject of the verb). (e.g. could, may, might, be possible, one possibility is that) and 2) verbs related to conjecture (e.g. speculate, hypothesize), as in the examples above. (22) These data are significant because ...</Paragraph> <Paragraph position="3"> (23) This approach has the potential to increase ... (24) ~ provides structural insights into ~ Assessment is signaled by weak linguistic clues as illustrated in (22) - (24) above.</Paragraph> <Paragraph position="4"> (25) Potential targets also remain to be studied; we do not yet know (26) Further experiments will focus on ~; a future study/work/challenge...</Paragraph> <Paragraph position="5"> Taken out of context, IMP mentioning future work look very similar to PBM as in (25), unless it contains key words such as future and further, as in (26). The critical feature for the distinction between them is the section in which they appear.</Paragraph> </Section> <Section position="2" start_page="32" end_page="32" type="sub_section"> <SectionTitle> 3.8 ELSE (ELS) </SectionTitle> <Paragraph position="0"> We found only few cases of ELS in our data.</Paragraph> <Paragraph position="1"> The following is an example (a naming statement).</Paragraph> <Paragraph position="2"> (27) ..., we refer to this gene as gip-1 and ~ as ...</Paragraph> <Paragraph position="3"> The lack of ELS zone in our data indicates that the domain of experimental biology has a more established methodology and that the focus is on the experiments and the findings obtained. In other domains where the methodology is less standardized (e.g. computer science), we would expect some essential cases fitting into ELS (e.g.</Paragraph> <Paragraph position="4"> the author's proposal and invention) and thus further elaboration of classes would be needed.</Paragraph> <Paragraph position="5"> As in (28) and (29), DFF is signalled by a limited set of vocabulary (mainly, different and contrast and their variants). Also, as illustrated above, DFF often overlaps with other classes (e.g. INS, IMP, RSL), and therefore involves nested annotation.</Paragraph> </Section> </Section> <Section position="7" start_page="32" end_page="32" type="metho"> <SectionTitle> 3.10 CONNECTION (CNN) </SectionTitle> <Paragraph position="0"> (30) This conservation further supports their putative regulatory role in exon skipping.</Paragraph> <Paragraph position="1"> (31) this peroxide treatment experiment was consistent with previous data (32) The results also confirm the recent discovery of MntH ... (ref).</Paragraph> <Paragraph position="2"> (33) This conclusion was supported not only by ... but also by ...</Paragraph> <Paragraph position="3"> The CNN class covers statements mentioning consistency (i.e. some sort of positive relation) between data/findings. A generalization is: (34) X is consistent with Y ; X conform to Y X is {similar to/ same as Y ; X support Y X/Y: previous work, the author's observation, model, hypothesis, insight, etc.</Paragraph> <Paragraph position="4"> (35) X. Similarly, Y. (X/Y: a proposition) The specific relation mentioned shows a variety (e.g. correlation or similarity; support for the author's own or other's data/ idea/ findings). Interestingly, we observed more CNN zones than DFF zones in our sample (Mizuta et al., 2004), and we consider that this is not accidental; this asymmetry indicates that biologists put more focus on correlation between two elements.</Paragraph> </Section> <Section position="8" start_page="32" end_page="34" type="metho"> <SectionTitle> 4 Zone identification -2: elaboration </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="32" end_page="33" type="sub_section"> <SectionTitle> 4.1 Nested zones for complex concepts </SectionTitle> <Paragraph position="0"> The following examples illustrate complex zones motivating nested annotation: (36) [ [Similar DNA links were also observed in the complexes with ... (ref.), which show structural difference from the previous proposal is...] DFF Sentence (36) provides a result and compares it with other results (boldfaced). Thus, the sentence fits into RSL and CNN simultaneously; it is a case of combined zones, conceptually distinct from indeterminacy between two zones. Sentence (37) illustrates an example of nested zones. The first two sentences fit into BKG and INS respectively. Also, they contrast with each other, with respect to the role which zinc is claimed to play, deserving of DFF as a whole (but there is no explicit clue at this point). The key word in the third sentence, another difference (boldfaced) licenses the sentence to DFF and also indicates an element referring to a difference already mentioned. Accordingly the first two sentences will be annotated for DFF.</Paragraph> <Paragraph position="1"> Precisely speaking, combined zones and nested zones are not identical. But we treat combined zones as a special case of nesting, as two zones having the same scope and an arbitrary ordering. Importantly, nested zones (in a wider sense) are conceptually distinct from ambiguity between two zones; the sentences simultaneously fit into DFF and CNN classes cover a wide range of relations between data and findings.</Paragraph> <Paragraph position="2"> This insight was checked with a biologist. This asymmetry also suggests the essential difference between the biology and the computer science domains. In the scheme by (Teufel et al., 2002) focusing on computer science articles, CONTRAST seems to be more important than BASIS.</Paragraph> <Paragraph position="3"> multiple zones. In fact, in our sample, most CNN and DFF zones overlap with another zone such as INS and IMP. Since CNN and DFF zones are important for our purpose, we consider that nested annotation is necessary.</Paragraph> </Section> <Section position="2" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 4.2 Controversial cases </SectionTitle> <Paragraph position="0"> (38) However, it was not evident whether DPCs composed of ... were ... or protelytic degradation was involved in the process.</Paragraph> <Paragraph position="1"> A PBM zone (in I-section) and an IMP zone describing future work (or limitations) often look very similar on the surface, as illustrated in (38), which is the last sentence in the article describing the limitation of the work presented. A critical feature is the location; PBM in this use is located in the I-section, whereas IMP in other sections.</Paragraph> <Paragraph position="2"> A PBM zone in I-section (e.g. X remains unclear) is considered to be a subset of a larger BKG zone when the problem mentioned is a generally accepted fact. However, we chose to avoid nested annotation in this case, because; 1) the situation above is rather common, and yet 2) we identify the significance of PBM zone in its own. In case a single sentence consists of a clause fitting into BKG and another fitting into PBM, then it will result in a complex annotation. That is, we annotate the sentence as both BKG and PBM.</Paragraph> <Paragraph position="3"> 5 Zone identification -3: location We now analyze the zones appearing in each section and their sequence, to try to describe the locations where a specific zone class may appear.</Paragraph> <Paragraph position="4"> The section organization of the sample articles is mapped onto the scheme shown in (5) as follows: In what follows, I, M, R, and D stand for the corresponding section.</Paragraph> </Section> <Section position="3" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 5.1 I-section and M-section </SectionTitle> <Paragraph position="0"> Common to all sample articles, the I-section consists of a large number of BKG zones with a few PBM zones inserted in it, which is then followed by an OTL zone. The OTL zone may or may not constitute a separate paragraph.</Paragraph> <Paragraph position="1"> The M-section focuses on methodological details, and thus consists of MTH zones possibly with an ignorable number of other zones (e.g. BKG, INS).</Paragraph> </Section> <Section position="4" start_page="33" end_page="33" type="sub_section"> <SectionTitle> 5.2 R-section The R-section consists of 'problem-solving' </SectionTitle> <Paragraph position="0"> units following the experimental procedure. The main elements of each unit are PBM, MTH, and RSL zones, which are often then followed by an</Paragraph> </Section> <Section position="5" start_page="33" end_page="34" type="sub_section"> <SectionTitle> 5.3 D-section </SectionTitle> <Paragraph position="0"> The D-section is much more complex and flexible, but some generalization is possible.</Paragraph> <Paragraph position="1"> First, the essential components of D-section, both quantitatively and qualitatively, are INS and IMP zones. This indicates that the focus of D-section is on obtaining deeper insights. In contrast with the zone sequence in the R-section, INS and IMP often precede, or even lack, RSL and BKG zones related to them. A closer look at examples explains the apparent lack of RSL/BKG: (43) The data within this report demonstrate... (44) As for the C-rich element, its comparison with the PTB binding motif has shown that these are different motifs.</Paragraph> <Paragraph position="2"> (45) Similarly, the failure of ... protein (Fig. 7) suggests that...</Paragraph> <Paragraph position="3"> The italicized elements in (43) - (45) would fit into RSL or MTH, but are too small constituents to be annotated. As a result, only the whole sentence gets annotated as INS. A similar tendency holds also for BKG (e.g. since-clause), but less frequently. We may consider extracting these cases in future work. Usually D-section ends with OTL (summary) or IMP (assessment or future work). 6 Zone identification using multiple features Table 1 illustrates multiple features contributing to ZI, as we identified them through our manual annotation. We observed that certain pairs of zone classes present similar distribution of key features, with the same primary feature, and that BKG lacks a key feature, indicating its neutral nature. Using multiple features is critical in ZI. We intend to We observe that these paragraph-initial zones trigger the PBM zone. For example, this in (41) refers to the preceding IMP zone, and the RSL in (42) mentions the results of a preceding experiment.</Paragraph> <Paragraph position="4"> improve our insight shown here through quantitative analysis (cf. fn. 3 and 4). It then better helps determine the right set of features and their relative priority to be used in machine learning. Explanatory notes on the priority of features: : primary feature (with specific clues); : major feature; : secondary feature x: negative feature; -: non-/less informative</Paragraph> </Section> </Section> class="xml-element"></Paper>