File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0304_metho.xml
Size: 24,097 bytes
Last Modified: 2025-10-06 14:09:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0304"> <Title>Parallel Entity and Treebank Annotation</Title> <Section position="3" start_page="21" end_page="22" type="metho"> <SectionTitle> 2 Guidelines for Entity Annotation </SectionTitle> <Paragraph position="0"> Here we give a summary of the main features of our annotation guidelines. We have been influenced in this by the annotation guidelines for the Automatic Content Extraction (ACE) project (Consortium, 2004).</Paragraph> <Paragraph position="1"> However, our source materials are medical abstracts from PubMed , and important differences between the domains have required significant changes and additions to many definitions, guidelines, and procedures.</Paragraph> <Paragraph position="2"> Most obviously, the vocabulary is very different. Many of the tokens in our source texts are chemical terms with a complex productive morphology, and a certain number are unique in PubMed. Many others are strings of notation, like S37F, often containing relevant entity references that must be isolated (S, 37,andF). And even apart from these, we are looking at a very different dialect of English from that used by the Wall Street Journal and the Associated Press. Annotation of English newswire requires native English competency; entity annotation of biomedical English requires a background in biology as well.</Paragraph> <Paragraph position="3"> The entity instances in the text are also qualitatively different. Instead of individual pieces of the physical or social universe - Emanuel Sosa, the Eiffel Tower, the man in the yellow hat -wehaveab- null Another source of influence is previous work in annotation for biomedical information extraction, such as (Ohta et al., 2002). Space prevents adequate discussion of here of the differences. null http://www.ncbi.nlm.nih.gov/entrez/ stractions, categories that are not to be confused with their instantiations: neuroblastoma, K-ras (a gene), codon 42.</Paragraph> <Paragraph position="4"> We are not currently annotating pronominal or other forms of coreference.</Paragraph> <Section position="1" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 2.1 Entities Annotated </SectionTitle> <Paragraph position="0"> For the sake of this project the definition for &quot;Gene Entity&quot; has two significant characteristics. First, as just mentioned, &quot;Gene&quot; refers to a conceptual entity as opposed to the specific manifestation of a gene (e.g., not the &quot;K-ras&quot; in some specific cell in some individual, but an abstraction that cannot be pointed to).</Paragraph> <Paragraph position="1"> Second, &quot;Gene&quot; refers to a composite entity as opposed to the strict biological definition. There are often ambiguities in the usage of the entity names. I is sometimes unclear as to whether the gene or protein is being referenced, and the same name can refer to the gene or the protein at different locations in the same document. In a similar way as the ACE project allows &quot;geopolitical&quot; entities to have different roles, such as &quot;location&quot; or &quot;organization&quot;, we consider a &quot;Gene&quot; to be a composite entity that can have different roles throughout a document. Therefore, Gene entity mentions can have types Gene-generic, Geneprotein, and Gene-RNA.</Paragraph> <Paragraph position="2"> As mentioned in the introduction, Variation events are relations between entities representing different aspects of a Variation; specifically, a Variation is a relationship between two or more of the following entities: Type (e.g., point mutation, translocation,orinversion), Location (e.g., codon 14, 1p36.1,orbase pair 278), Original-State and Altered-State (e.g., Thymine).</Paragraph> <Paragraph position="3"> The entities as such are independent and unconnected. We add a level of relation to annotate the associations between them: For example, the text fragment a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution (S249C) contains the entities: This domain shows no such clear distinction between Name and Nominal mentions as in the texts covered by ACE.</Paragraph> <Paragraph position="4"> Variation-state-original serine Variation-state-altered cysteine These entities are annotated individually but are also collected into a single Variation relation.</Paragraph> <Paragraph position="5"> It is also possible for a Variation relation to arise from a more compact collection of entities. For example, the text S249C consists of three entities collected into a Variation relation: These four components represent the key elements necessary to describe any genomic variation event. Variations are often underspecified in the literature. For example, the first relation above has all four components while the second is missing the Variation-type. Characterizing individual Variations as relations among such components provides us with a great deal of flexibility.</Paragraph> <Paragraph position="6"> The &quot;Gene&quot; entities are analogous to the ACE geopolitical entity, in that the second part of the entity names (&quot;-RNA&quot;, &quot;-generic&quot;,&quot;-protein&quot;) disambiguates the metonymy of the &quot;Gene&quot;. The subtypes of the Variation entities, in contrast, indicate different kinds of entities in their own right, which can also function as components of a Variation relation.</Paragraph> <Paragraph position="7"> The Malignancy annotation guidelines were under development during the annotation of the corpus described here. While they have since been more completely defined, they are not included as part of the annotated files discussed here, and so are not further discussed in this paper.</Paragraph> </Section> <Section position="2" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 2.2 Discontinuous Entities </SectionTitle> <Paragraph position="0"> We have introduced a mechanism we call &quot;chaining&quot; to annotate discontinuous entities, which may be more common in abstracts than in full text because of the pressure to reduce word count. For example, in K- and N-ras there are two entities, K-ras and N-ras, of which only the second is a solid block of text. Our entity annotators are allowed to change the tokenization if necessary to isolate the components of K-ras: 1. K- ... ras (chain with separated tokens) null 2. N-ras (contiguous tokens)</Paragraph> </Section> <Section position="3" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 2.3 Entity Frequencies </SectionTitle> <Paragraph position="0"> Table 1 shows the number of instances of each of the entity types in the 318 abstracts, discussed further in Section 4, that have been both entity annotated and treebanked. We separate the entities into single-token and multiple-token categories since it is only the multiple-token categories that raise an issue for mapping constituents.</Paragraph> </Section> </Section> <Section position="4" start_page="22" end_page="23" type="metho"> <SectionTitle> 3 Treebank Annotation </SectionTitle> <Paragraph position="0"> The Penn Treebank II guidelines (Bies et al., 1995) were followed as closely as possible, but the nature of the biomedical corpus has made some changes necessary or desirable. We have also taken this opportunity to address several long-standing issues with the original set of guidelines, with regard to NP structure in particular. This has resulted in the introduction of one new node label for sub-NP nominal substrings (NML). One additional empty category (*P*) has been introduced in order to improve the match-up of chained entity categories with treebank nodes. It is used as a placeholder to represent distributed modification in nominals and does not represent the trace of movement.</Paragraph> <Section position="1" start_page="23" end_page="23" type="sub_section"> <SectionTitle> 3.1 Tokenization/Part-of-Speech </SectionTitle> <Paragraph position="0"> We have also adopted several changes in word-level tokenization, leading to a number of part-of-speech and structural differences as well. Many hyphenated words are now treated as separate tokens (New York - based would be four tokens, for example). These hyphens now have the part-of-speech tag HYPH. If the separated prefix is a morphological unit that does not exist as a free-standing word, it has the part-of-speech tag AFX. With chemical names and scientific notation in the biomedical corpus in particular, spaces and punctuation may occur within a single &quot;token&quot;, which will have a single POS tag.</Paragraph> </Section> <Section position="2" start_page="23" end_page="23" type="sub_section"> <SectionTitle> 3.2 Right-Branching Default </SectionTitle> <Paragraph position="0"> We assume a default binary right-branching structure under any NP and NML node. Each daughter of the phrase (whether a single token or itself a constituent node) is assumed to have scope over everything to its right. This means that every daughter also forms a constituent with everything to its right.</Paragraph> <Paragraph position="1"> This assumption makes the annotation process for multi-token nominals less complex and the resulting trees more legible, but still allows us to readily derive constituent nodes not explicitly represented. For example, in</Paragraph> <Paragraph position="3"> we assume that &quot;liver cancer&quot; is a constituent, and that &quot;primary&quot; has scope over it.</Paragraph> <Paragraph position="4"> So, although we do not show the intermediate nodes explicitly in our annotation, our assumed structure for this NP could be derived as</Paragraph> <Paragraph position="6"> As discussed in Section 5, entities sometimes map to such implicit constituents, and a node needs to be added to make the constituent explicit so the the entity can be mapped to it.</Paragraph> </Section> <Section position="3" start_page="23" end_page="23" type="sub_section"> <SectionTitle> 3.3 New Node Level for Non-Right-Branching: NML </SectionTitle> <Paragraph position="0"> We use the NML node label to mark nominal sub-constituents that do not follow the default binary right-branching structure. Any two or more non-final elements that form a constituent are bound together by NML.</Paragraph> <Paragraph position="2"/> </Section> <Section position="4" start_page="23" end_page="23" type="sub_section"> <SectionTitle> 3.4 New Empty Category for Distributed </SectionTitle> <Paragraph position="0"> Readings within NP: *P* As discussed in Section 2.2, discontinuous entities are annotated using the &quot;chaining&quot; mechanism. Analogously, we have introduced a placeholder, *P*, for distributed material in the treebank. It is used exclusively in coordinated nominal structures, placed in coordinated elements that are missing either a distributed head or a distributed premodifier. In K- and N-ras, the coordinated premodifier K- is missing the distributed head ras, so the placeholder *P* is inserted after K- and coindexed with ras:</Paragraph> <Paragraph position="2"> This creates constituent nodes K-ras and N-ras that align with the entities being represented by chaining.</Paragraph> </Section> </Section> <Section position="5" start_page="23" end_page="25" type="metho"> <SectionTitle> 4 Annotation Process </SectionTitle> <Paragraph position="0"> The annotation process comprises the following steps: Paragraph and sentence annotation (including the delimitation of irrelevant text such as author names); tokenization; entity annotation; part-of-speech (POS) annotation; treebanking; merged representation.</Paragraph> <Paragraph position="1"> Entity annotation precedes POS annotation, since the entity annotators often have to correct the tokenization, which affects the POS labels. For example, nephro- and hepatocarcinoma refers to two entities, nephrocarcinoma and hepatocarcinoma,and so the entity annotator would split hepatocarcinoma into two tokens, for chaining nephro and carcinoma In spite of the apparent similarity between *P* and right node raising structures (*RNR*), they are not interchangeable as the shared element often occurs to the left rather than the right (e.g., codon 12 or 13 in Section 5.3).</Paragraph> <Paragraph position="2"> (see Section 2.2). Since the entity annotators are not qualified for POS annotation, doing POS annotation after entity annotation allows the POS annotators to annotate any such tokenization changes.</Paragraph> <Paragraph position="3"> Treebank annotation uses the same tokenization as for the corresponding entity file. Continuing the above example, the treebank file would have separate tokens for hepato and carcinoma. Note that this would be the case even if we did not have the goal of mapping entities to constituents. It arises from the more minimal requirement of maintaining identical tokenization in the treebank and entity files, and so leads to changes in treebank annotation such as discussed in Section 3.4.</Paragraph> <Paragraph position="4"> All of the annotation steps except entity annotation use automated taggers (or a parser in the case of treebanking), producing annotation that then gets hand-corrected.</Paragraph> <Paragraph position="5"> The use of the parser for producing a parse for correction by the treebankers include a somewhat unusual feature that arises from our parallel entity and treebank annotation. The parser that we are using, (Bikel, 2004), allows prebracketing of parts of the parser input, so that the parser will respect the prebracketing. We use this ability to prebracket entities, which can also help to disambiguate the constituencies for prenominal modifiers, which can often be unclear for annotators without a medical background. For example, the input to the parser might contain something like:</Paragraph> <Paragraph position="7"> indicating by the (* ) that tyrosine kinase should be a constituent. (It is a Gene-protein.) Our first release of data, PennBioIE Release 0.9 (http://bioie.ldc.upenn.edu/ publications), contains 1157 oncology PubMed abstracts, all annotated for entities and POS, of which 318 have also been treebanked. The website also contains full documentation for the various annotation guidelines mentioned in this paper.</Paragraph> <Section position="1" start_page="24" end_page="25" type="sub_section"> <SectionTitle> 4.1 Example of Merged Output </SectionTitle> <Paragraph position="0"> The 318 files that have been both treebanked and entity annotated are also available in a merged &quot;.mrg&quot; format. The treebank and entity annotations are both stand-off, referring to character spans in the same source file, and we take advantage of this so that the merged representation relates the entities and constituents by these spans. Figure 1 shows a fragment of one such .mrg file.</Paragraph> <Paragraph position="1"> This .mrg file excerpt shows the text of sentence 4 in the file, which spans the character offsets 331..605. Each entity is listed by span (which can in- null clude several tokens), entity type, and the text of the entity. The treebank part is the same basic format as the .mrg files from the Penn Treebank, except that each terminal has the format (POSTag:[from..to] terminal) where [from..to] is that terminal's span in the source file.</Paragraph> <Paragraph position="2"> The first entity listed, K-ras, is a Gene-RNA entity with span [373..378], which corresponds to the single token: (NN:[373..378] K-ras) The second entity, exon 2, is a Variation-location with span [379..385], which corresponds to the two tokens:</Paragraph> <Paragraph position="4"> The third entity, point mutations, is a Variation-type with span [386..401], which corresponds to the two tokens:</Paragraph> <Paragraph position="6"> By including the terminal span information in the treebank, we make explicit how the tokens that make up the entities are treated in the treebank representation. null</Paragraph> </Section> </Section> <Section position="6" start_page="25" end_page="27" type="metho"> <SectionTitle> 5 Entity-Constituent Mapping </SectionTitle> <Paragraph position="0"> One of our goals for the release of the corpus is to allow users to choose how they wish to handle the integration of the entity and treebank information.</Paragraph> <Paragraph position="1"> By providing the corresponding spans for both aspects of the annotation, we provide the raw material for any integrated approach.</Paragraph> <Paragraph position="2"> We therefore do not attempt to force the entities and constituents to line up perfectly. However, given the parallel annotation just illustrated, we can analyze how close we come to the ideal of the entities behaving as semantic types on syntactic constituents. null</Paragraph> <Section position="1" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 5.1 Mapping Categories </SectionTitle> <Paragraph position="0"> Leaving aside chains for the moment, we categorize each entity/treebank mapping in one of three ways: Exact match There is a node in the tree that yields exactly the entity. For example, the entity exon 2 in Figure 1 ;[379..385]:variation-location: &quot;exon 2&quot; corresponds exactly to the NML node in Figure 1</Paragraph> <Paragraph position="2"> Missing node There is no node in the tree that yields exactly that entity, but it is possible to add a node to the tree that would yield the entity. A common reason for this is that the default right branching treebank annotation (Section 3.2) does not make explicit the required node.</Paragraph> <Paragraph position="3"> For example, the entity point mutations in Figure Note that this node corresponds exactly to the implicit constituency assumed by the right branching rule. For our own internal research purposes we have generated a version of the treebank with such nodes added, although they are not in the current release. Crossing The most troublesome case, in which the entity does not match a node in the tree and also cuts across constituent boundaries, so it is not even possible to add a node yielding the entity. Typically this</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> Token Instances </SectionTitle> <Paragraph position="0"> is due to an entity containing text corresponding to a prepositional phrase. For example, the sentence One ER showed a G-to-T mutation in the second position of codon 12 has the entity [1280..1307]:variation-location: &quot;second position of codon 12&quot; The relevant part of the corresponding tree is Due to the inclusion of the determiner in the NP the second position, while it is absent from the entity definition which does include the following PP, it is not possible to add a node to the tree yielding exactly second position of codon 12.</Paragraph> <Paragraph position="1"> It is possible The inclusion of the PP in an entity can be a problem for the constituent mapping even aside from the determiner issue. It is possible for the PP, such as of codon 12, to be followed by another PP, such as in K-ras. Since all PPs are attached at the same level, of codon 12 and in K-ras are sisters, and so, even if the determiner was included in the entity name, there is no constituent consisting of just the second position of codon 12. However, in that case it is then possible to add a node yielding the NP and first PP. A similar issue sometimes arises when attempting to relate Propbank arguments to tree constituents.</Paragraph> </Section> <Section position="3" start_page="26" end_page="26" type="sub_section"> <SectionTitle> Instances </SectionTitle> <Paragraph position="0"> to relax the requirements on exact match to include the determiner.</Paragraph> <Paragraph position="1"> However, one of our initial goals in this investigation was to determine whether this sort of limited crossing is indeed a major source of the mapping mismatches.</Paragraph> </Section> <Section position="4" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 5.2 Overall Mapping Results </SectionTitle> <Paragraph position="0"> Table 2 is a breakdown of how well the (non-chain) entities can be mapped to constituents. Here we are concerned only with entities that consist of multiple tokens, since single-token entities can of course map directly to the relevant token.</Paragraph> <Paragraph position="1"> The number of crossing cases is relatively small.</Paragraph> <Paragraph position="2"> One reason for this is the use of relations for breaking potentially large entities into component parts, since the component entities either already map to an entity or can easily be made to do so by making implicit constituents explicit to disambiguate the tree structure. The crossing cases tend to be ones in which the entities are in a sense a bit too &quot;big&quot;, such as including a prepositional phrase.</Paragraph> <Paragraph position="3"> Another alternative would be to modify the treatment of noun phrases and determiners in the treebank annotation to be more akin to DPs. However, this has proved to be an impractical addition to the annotation process.</Paragraph> <Paragraph position="4"> As discussed in Section 4, we are prebracketing entities in the parses prepared for the treebankers to correct. There are two possibilities for how the entities can therefore ever cross tree-bank constituents: (1) the treebank annotation was done before we started doing such prebracketing, so the treebank annotator was not aware of the entities, or (2) the prebracketing was in-</Paragraph> </Section> <Section position="5" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 5.3 Chained Entities </SectionTitle> <Paragraph position="0"> Table 3 shows the matching status of multiple token instances that are also chains (and so were not included in Table 2). The presence of chains is mostly localized to certain entity types, and the mapping is mostly successful. Variation-location contains many of the chains due to the occurrences of phrases such as codon 12 or 13, which map exactly to the corresponding use of the *P* placeholder, such as: Cases that do not map exactly are ones in which the syntactic context does not permit the use of the placeholder *P*. For example, the text specific codons (12, 13, and 61), has three discontinuous entities (codons..12, codons..13, codons..61), but the parenthetical context does not permit using We have described here parallel syntactic and entity annotation and how changes in the guidelines facilitate a mapping between entities and syntactic constituents. Our main purpose in this paper has been to investigate the success of this mapping. As Tables 2 and 3 show, once we make explicit the implicit right-branching binary structure, only 6.2% of the entities cannot be mapped directly to a node in the tree. It also appears likely that a significant percentage of even the non-matching cases can match as well, with a slight relaxation of the matching requirement (e.g., allowing entities to have an optional determiner). deed done, but the treebank annotator could not abide by the resulting tree and modified the parser output accordingly. 1410 total multiple token entities, both chained and nonchained, with 87 cases that cannot be mapped (55 crossing, 32 chained non-exact match).</Paragraph> <Paragraph position="1"> We view this in part as a successful experiment illustrating how both linguistic content and entity annotation can be enhanced by their interaction.</Paragraph> <Paragraph position="2"> We expect this enhancement to be useful both for biomedical information extraction in particular and more generally for the development of statistical systems that can take into account different levels of annotation in a mutually beneficial way.</Paragraph> </Section> </Section> class="xml-element"></Paper>