File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3101_metho.xml
Size: 18,171 bytes
Last Modified: 2025-10-06 14:09:29
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3101"> <Title>A resource for constructing customized test suites for molecular biology entity identification systems</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The taxonomy of features for entities </SectionTitle> <Paragraph position="0"> and sentential contexts In this section we describe the feature sets for entities and sentences, and motivate the inclusion of each, where not obvious.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Feature set for entities </SectionTitle> <Paragraph position="0"> Conceptually, the features for describing name-inputs are separated into four categories: orthographic/typographic, morphosyntactic, source, and lexical.</Paragraph> <Paragraph position="1"> * Orthographic/typographic features describe the presence or absence of features on the level of individual characters, for example the case of letters, the presence or absence of punctuation marks, and the presence or absence of numerals.</Paragraph> <Paragraph position="2"> * Morphosyntactic features describe the presence or absence of features on the level of the morpheme or word, such as the presence or absence of participles, the presence or absence of genitives, and the presence or absence of function words.</Paragraph> <Paragraph position="3"> * Source features are defined with reference to the source of an input. (It should be noted that in software engineering, as in Chomskyan theoretical linguistics, data need not be naturally-occurring to be useful; however, with the wealth of data available for gene names, there is no reason not to include naturalistic data, and knowing its source may be useful, e.g. in evaluating performance on FlyBase names, etc.) Source features include source type, e.g. literature, database, or invention; identifiers in a database; canonical form of the entity in the database; etc.</Paragraph> <Paragraph position="4"> * Lexical features are defined with respect to the relationship between an input and some outside source of lexical information, for instance whether or not an input is or contains a common English word. This is also the place to indicate whether or not an input is present in a resource such as LocusLink, whether or not it is on a particular stoplist, whether it is in-vocabulary or out-of-vocabulary for a particular language model, etc.</Paragraph> <Paragraph position="5"> The distinction between these three broad categories of features is not always clear-cut. For example, presence of numerals is an orthographic/typographic feature, and is also morphosyntactic when the numeral postmodifies a noun, e.g. in heat shock protein 60. Likewise, features may be redundant--for example, the presence of a Greek letter in the square-bracket- or curly-bracketenclosed formats, or the presence of an apostrophized genitive, are not independent of the presence of the associated punctuation marks. However, Boolean queries over the separate feature sets let them be manipulated and queried independently. So, entities with names like A' can be selected independently of names like Parkinson's disease.</Paragraph> <Paragraph position="6"> Length: Length is defined in characters for symbols and in whitespace-tokenized words for names. Case: This feature is defined in terms of five possible values: all-upper-case, all-lower-case, uppercase-initial-only, each-word-upper-case-initial (e.g. Pray For Elves), and mixed. The fault model motivating this feature hypothesizes that taggers may rely on case to recognize entities and may fail on some combinations of cases with particular sentential positions. For example, one system performed well on gene symbols in general, except when the symbols are lower-case-initial and in sentence-initial position (e.g. p100 is abundantly expressed in liver... (PMID 1722209) and bif displays strong genetic interaction with msn (PMID 12467587).</Paragraph> <Paragraph position="7"> Numeral-related features: A set of features encodes whether or not an entity contains a numeral, whether the numeral is Arabic or Roman, and the positions of numerals within the entity (initial, medial, or final). The motivation for this feature is the hypothesis that a system might be sensitive to the presence or absence of numerals in entities. One system failed when the entity was a name (vs. a symbol), it contained a number, and the number was in the right-most (vs. a medial) position in a word. It correctly tagged entities like glucose 6 phosphate dehydrogenase but missed the boundary on <gp>alcohol dehydrogenase</gp> 6. This pattern was specific to numbers--letters in the same position are handled correctly.</Paragraph> <Paragraph position="8"> Punctuation-related features: A set of features includes whether an entity contains any punctuation, the count of punctuation marks, and which marks they are (hyphen, apostrophe, etc.). One system failed to recognize names (but typically not symbols) when they included hyphens. Another system had a very reliable pattern of failure involving apostrophes just in case they were in genitives.</Paragraph> <Paragraph position="9"> Greek-letter-related features: These features encode whether or not an entity contains a Greek letter, the position of the letter, and the format of the letter. (This feature is an example of an orthographic feature which may be defined on a substring longer than a character, e.g. beta.) Two systems had problems recognizing gene names when they contained Greek letters in the PubMed Central format, i.e. [beta]1 integrin.</Paragraph> <Paragraph position="10"> The most salient morphosyntactic feature is whether an entity is a name or a symbol. The fault model motivating this feature suggests that a system might perform differently depending on whether an input is a name or a symbol. The most extreme case of a system being sensitive to this feature was one system that performed very well on symbols but recognized no names whatsoever.</Paragraph> <Paragraph position="11"> Features related to function words: a set of features encodes whether or not an entity contains a function word, the number of function words in the entity, and their positions--for instance, the facts: that scott of the antarctic (FlyBase ID FBgn0015538) contains two function words; that they are of and the; and that they are medial to the string. This feature is motivated by two fault models. One posits that a system might apply a stoplist to its input and that processing of function words might therefore halt at an early stage. The other posits that a system might employ shallow parsing to find boundaries of entities and that the shallow parser might insert boundaries at the locations of function words, causing some words to be omitted from the entity. One system always had partial hits on names that were multi-word unless each word in it was upper-case-initial, or there was an alphanumeric postmodifier (i.e. a numeral, upper-cased singleton letter, or Greek letter) at the right edge. Features related to inflectional morphology: a set of features encodes whether or not an entity contains nominal number or genitive morphology or verbal participial morphology, and the positions of the words in the entity that contain those morphemes, for instance the facts that apoptosis antagonizing transcription factor (HUGO ID 19235) contains a present participle and that the word that contains it is medial to the string. Features related to parts of speech: Future development of the data will include features encoding the parts of speech present in names.</Paragraph> <Paragraph position="12"> Source or authority: This feature encodes the source of or authority cited for an entity. For many of the entries in the current data, it is an identifier from some database. For others, it is a website (e.g.</Paragraph> <Paragraph position="13"> www.flynome.org). Other possible values include the PMID of a document in which it was observed.</Paragraph> <Paragraph position="14"> Original form in source: Where there is a source for the entity or for some canonical form of the entity, the original form is given. This is not equivalent to the &quot;official&quot; form, but rather is the exact form in which the entity occurs; it may even contain typographic errors (e.g. the extraneous space in nima -related kinase, LocusID 189769 (reported to the NCBI service desk).</Paragraph> <Paragraph position="15"> These might be better called lexicographic features.</Paragraph> <Paragraph position="16"> They can be encoded impressionistically, or can be defined with respect to an external source, such as WordNet, the UMLS, or other lexical resources. They may also be useful for encoding strictly local information, such as whether or not a gene was attested in training data or whether it is present in a particular language model or other local resource. These features are allowed in the taxonomy but are not implemented in the current data. Our own use of the entity data suggests that it should be, especially encoding of whether or not names include common English words.</Paragraph> <Paragraph position="17"> (The presence of function words is already encoded.)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Feature set for sentential contexts </SectionTitle> <Paragraph position="0"> In many ways, this data is much harder to build and classify than the names data, for at least two reasons.</Paragraph> <Paragraph position="1"> Many more features interact with each other, and as soon as a sentence contains more than one gene name, it contains more than one environment, and the number of features for the sentence as a whole is multiplied, as are the interactions between them. For this reason, we have focussed our attention so far on sentences containing only a single gene name, although the current version of the data does include a number of multi-name sentences.</Paragraph> <Paragraph position="2"> The fundamental distinction in the feature set for sentences has to do with whether the sentence is intended to provide an environment in which gene names actually appear, or whether it is intended to provide a non-trivial opportunity for false positives. True positive sentences contain some slot in which entities from the names data can be inserted, e.g. <> polymorphisms may be correlated with an increased risk of larynx cancer or <> interacts with <> and <> in the two-hybrid system.</Paragraph> <Paragraph position="3"> False positive sentences contain one or more tokens that are deliberately intended to pose challenging opportunities for false positives. Certainly any sentence which does not consist all and only of a single gene name contains opportunities for false positives, but not all potential false positives are created equal. We include in the data set sentences that contain tokens with orthographic and typographic characteristics that mimic the patterns commonly seen in gene names and symbols, e.g. The aim of the present study is to evaluate the impact on QoL... where QoL is an abbreviation for quality of life. We also include sentences that contain &quot;keywords&quot; that may often be associated with genes, such as gene, protein, mutant, expression, etc., e.g. Demonstration of antifreeze protein activity in Antarctic lake bacteria.</Paragraph> <Paragraph position="4"> Number and positional features encode the total number of slots in the sentence, and their positions. The value for the position feature is a list whose values range over initial, medial, and final. For example, the sentence <> interacts with <> and <> in the two-hybrid system has the value I,M (initial and medial) for the position feature.</Paragraph> <Paragraph position="5"> Typographic context features encode issues related to tokenization, specifically related to punctuation, for example if a slot has punctuation on the left or right edge, and the identity of the punctuation marks.</Paragraph> <Paragraph position="6"> List context features encode data about position in lists. These include the type of list (coordination, asyndetic coordination, or complex coordination).</Paragraph> <Paragraph position="7"> The appositive feature is for the special case of appositioned symbols or abbreviations and their full names or definitions, e.g. The Arabidopsis INNER NO OUTER (INO) gene is essential for formation and...</Paragraph> <Paragraph position="8"> For the systems that we have tested with it, it has not revealed problems that are independent of the typographic context. However, we expect it to be of future use in testing systems for abbreviation expansion in this domain.</Paragraph> <Paragraph position="9"> Source features encode the identification and type of the source for the sentence and its original form in the source. The source identifier is often a PubMed ID. It bears pointing out again that there is no a priori reason to use sentences with any naturally-occurring &quot;source&quot; at all, as opposed to the products of the software engineer's imagination. Our primary rationale for using naturalistic sources at all for the sentence data has more to do with convincing the user that some of the combinations of entity features and sentential features that we claim to be worth generating actually do occur. For instance, it might seem counterintuitive that gene symbols or names would ever occur lower-case-initial in sentence initial position, but in fact we found many instances of this phenomenon; or that a multi-word gene name would occur in text in all upper-case letters, but see the INNER NO OUTER example above.</Paragraph> <Paragraph position="10"> Syntactic features encode the characteristics of the local environment. Some are very lexical, such as: whether the following word is a keyword; whether the preceding word is a species name. Others are more abstract, such as whether the preceding word is an article; whether the preceding word is an adjective; whether the preceding word is a conjunction; whether the preceding word is a preposition. Interactions with the list context features are complex. The fault model motivating these features hypothesizes that POS context and the presence of keywords might affect a system's judgments about the presence and boundaries of names.</Paragraph> <Paragraph position="11"> Most features for FP sentences encode the characteristics that give the contents of the sentence their FP potential. The keyword feature is a list of keywords present in the sentence, e.g. gene, protein, expression, etc. The typographic features feature encodes whether or not the FP potential comes from orthographic or typographic features of some token in the sentence, such as mixed case, containing hyphens and a number, etc. The morphological features feature encodes whether or not the FP potential comes from apparent morphology, such as words that end with ase or in.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Testing the relationship between </SectionTitle> <Paragraph position="0"> predictions from performance on a test suite and performance on a corpus Precision and recall on data in a structured test suite should not be expected to predict precision and recall on a corpus, since there is no relation between the prevalence of features in the test suite and prevalence of features in the corpus. However, we hypothesized that performance on an equivalence class of inputs in a test suite might predict performance on the same equivalence class in a corpus. To test this hypothesis, we ran a number of test suites through one of the systems and analyzed the results, looking for patterns of errors. The test suites were very simple, varying only entity length, case, hyphenation, and sentence position. Then we ran two corpora through the same system and examined the output for the actual corpora to see if the predictions based on the system's behavior on the test suite actually described performance on similar entities in the corpora.</Paragraph> <Paragraph position="1"> One corpus, which we refer to as PMC (since it was sampled from PubMed Central), consists of 2417 sentences sampled randomly from a set of 1000 full-text articles. This corpus contains 3491 entities. It is described in Tanabe and Wilbur (2002b). The second corpus was distributed as training data for the BioCreative competition. It consists of 10,000 sentences containing 11,851 entities and is described in detail at www.mitre.org/public/biocreative. Each corpus is annotated for entities.</Paragraph> <Paragraph position="2"> The predictions based on the system's performance on the test suite data were: that end with numerals.</Paragraph> <Paragraph position="3"> We then examined the system's true positive, false positive, and false negative outputs from the two corpora for outputs that belonged to the equivalence classes in 1-5. Table 1 shows the results.</Paragraph> <Paragraph position="4"> predictable categories Numbers in the far left column refer to the predictions listed above. Overall performance on the corpora was: BioCreative P = .65, R = .68, and PMC P = .71, R = .62.</Paragraph> <Paragraph position="5"> For equivalence classes 1, 2, and 4, the predictions mostly held. Low recall was predicted, and actual recall was .41, 0.0, .52, 1.0 (the one anomaly), .33, and .46 for these classes of names, versus overall recall of .68 on the BioCreative corpus and .62 on the PMC corpus. The prediction held for equivalence class 5, as well; good recall was predicted, and actual recall was .80 and .70--higher than the overall recalls for the two corpora. The third prediction could not be evaluated due to the normalization of case in the gold standards. These results suggest that a test suite can be a good predictor of performance on entities with particular typographic characteristics.</Paragraph> </Section> class="xml-element"></Paper>