File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1201_metho.xml

Size: 23,112 bytes

Last Modified: 2025-10-06 14:09:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1201">
  <Title>Recognizing Names in Biomedical Texts using Hidden Markov Model and SVM plus Sigmoid</Title>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
* Word Formation Pattern (F
WFP
): The purpose
</SectionTitle>
    <Paragraph position="0"> of this feature is to capture capitalization, digitalization and other word formation  information. This feature has been widely used in the biomedical domain (Kazama et al 2002; Shen et al 2003; Zhou et al 2004). In this paper, the same feature as in Shen et al 2003 is used.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
* Morphological Pattern (F
MP
): Morphological
</SectionTitle>
    <Paragraph position="0"> information, such as prefix and suffix, is considered as an important cue for terminology identification and has been widely applied in the biomedical domain (Kazama et al 2002; Lee et al 2003; Shen et al 2003; Zhou et al 2004). Same as Shen et al 2003, we use a statistical method to get the most useful prefixes/suffixes from the training data.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
* Part-of-Speech (F
POS
</SectionTitle>
    <Paragraph position="0"> ): Since many of the words in biomedical entity names are in lowercase, capitalization information in the biomedical domain is not as evidential as that in the newswire domain. Moreover, many biomedical entity names are descriptive and very long. Therefore, POS may provide useful evidence about the boundaries of biomedical entity names.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
* Head Noun Trigger (F
HEAD
</SectionTitle>
    <Paragraph position="0"> ): The head noun, which is the major noun of a noun phrase, often describes the function or the property of the noun phrase. In this paper, we automatically extract unigram and bigram head nouns from the training data, and rank them by frequency. For each entity class, we select 50% of top ranked head nouns as head noun triggers. Table 1 shows some of the examples.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
ALIAS
</SectionTitle>
    <Paragraph position="0"> ): Besides the above widely used features, we also propose a novel name alias feature. The intuition behind this feature is the name alias phenomenon that relevant entities will be referred to in many ways throughout a given text and thus success of named entity recognition is conditional on success at determining when one noun phrase refers to the very same entity as another noun phrase.</Paragraph>
    <Paragraph position="1"> During decoding, the entity names already recognized from the previous sentences of the document are stored in a list. When the system encounters an entity name candidate (e.g. a word with a special word formation pattern), a name alias algorithm (similar to Schwartz et al 2003) is invoked to first dynamically determine whether the entity name candidate might be alias for a previously recognized name in the recognized list. This is done by checking whether all the characters in the entity name candidate exist in a recognized entity name in the same order and whether the first character in the entity name candidate is same as the first character in the recognized name. For a relevant work, please see Jacquemin (2001). The name alias feature F</Paragraph>
  </Section>
  <Section position="9" start_page="1" end_page="1" type="metho">
    <SectionTitle>
ALIAS
</SectionTitle>
    <Paragraph position="0"> is represented as ENTITYnLm (L indicates the locality of the name alias phenomenon). Here ENTITY indicates the class of the recognized entity name and n indicates the number of the words in the recognized entity name while m indicates the number of the words in the recognized entity name from which the name alias candidate is formed. For example, when the decoding process encounters the word &amp;quot;TCF&amp;quot;, the word &amp;quot;TCF&amp;quot; is proposed as an entity name candidate and the name alias algorithm is invoked to check if the word &amp;quot;TCF&amp;quot; is an alias of a recognized named entity. If &amp;quot;T cell Factor&amp;quot; is a &amp;quot;Protein&amp;quot; name recognized earlier in the document, the word &amp;quot;TCF&amp;quot; is determined as an alias of &amp;quot;T cell Factor&amp;quot; with the name alias feature Protein3L3 by taking the three initial letters of the three-word &amp;quot;protein&amp;quot; name &amp;quot;T cell Factor&amp;quot;.</Paragraph>
  </Section>
  <Section position="10" start_page="1" end_page="11" type="metho">
    <SectionTitle>
3. METHODS
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="11" type="sub_section">
      <SectionTitle>
3.1 Hidden Markov Model
</SectionTitle>
      <Paragraph position="0"> Given above various features, the key problem is how to effectively and efficiently integrate them together and find the optimal resolution to biomedical named entity recognition. Here, we use the Hidden Markov Model (HMM) as described in Zhou et al 2002. A HMM is a model where a sequence of outputs is generated in addition to the Markov state sequence. It is a latent variable model in the sense that only the output sequence is observed while the state sequence remains &amp;quot;hidden&amp;quot;.</Paragraph>
      <Paragraph position="1"> Given an observation sequence O , the purpose of a HMM is to find the most likely state sequence S that maximizes . Here, the observation o , where is the word and is the feature set of the word w , and the state is structural and s</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> where denotes the position of the current word in the entity; ENTITY indicates the class of the entity; and FEATURE is the feature set used to model the ngram more precisely.</Paragraph>
      <Paragraph position="7"/>
      <Paragraph position="9"> The second term in Equation (1) is the mutual information between S and . In order to simplify the computation of this term, we assume mutual information independence:</Paragraph>
      <Paragraph position="11"/>
      <Paragraph position="13"> That is, an individual tag is only dependent on the output sequence O and independent on other tags in the tag sequence S . This assumption is reasonable because the dependence among the tags in the tag sequence S has already been captured by the first term in Equation (1). Applying the assumption (2) to Equation (1), we have:</Paragraph>
      <Paragraph position="15"> From Equation (3), we can see that: * The first term can be computed by applying chain rules. In ngram modeling (Chen et al 1996), each tag is assumed to be dependent on the N-1 previous tags.</Paragraph>
      <Paragraph position="16">  component (dictionary) of the tagger. The idea behind the model is that it tries to assign each output an appropriate tag (state), which contains boundary and class information. For example, &amp;quot;TCF 1 binds stronger than NF kB to TCEd DNA&amp;quot;. The tag assigned to token &amp;quot;TCF&amp;quot; should indicate that it is at the beginning of an entity name and it belongs to the &amp;quot;Protein&amp;quot; class; and the tag assigned to token &amp;quot;binds&amp;quot; should indicate that it does not belong to an entity name. Here, the Viterbi algorithm (Viterbi 1967) is implemented to find the most likely tag sequence. The problem with the above HMM lies in the data sparseness problem raised by P in the third term of Equation (3). Ideally, we would have sufficient training data for every event whose conditional probability we wish to calculate.</Paragraph>
      <Paragraph position="17"> Unfortunately, there is rarely enough training data to compute accurate probabilities when decoding on new data. Generally, two smoothing approaches (Chen et al 1996) are applied to resolve this problem: linear interpolation and back-off.</Paragraph>
      <Paragraph position="18"> However, these two approaches only work well when the number of different information sources is limited. When a few features and/or a long context are considered, the number of different information sources is exponential. In this paper, a Support Vector Machine (SVM) plus sigmoid is proposed to resolve this problem in our system.</Paragraph>
      <Paragraph position="20"/>
    </Section>
    <Section position="2" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
3.2 Support Vector Machine plus Sigmoid
</SectionTitle>
      <Paragraph position="0"> Support Vector Machines (SVMs) are a popular machine learning approach first presented by Vapnik (1995). Based on the structural risk minimization of statistical learning theory, SVMs seek an optimal separating hyper-plane to divide the training examples into two classes and make decisions based on support vectors which are selected as the only effective examples in the training set. However, SVMs produce an uncalibrated value that is not probability. That is, the unthresholded output of an SVM can be represented as</Paragraph>
      <Paragraph position="2"> Basically, SVMs are binary classifiers.</Paragraph>
      <Paragraph position="3"> Therefore, we must extend SVMs to multi-class (e.g. K) classifiers. For efficiency, we apply the one vs. others strategy, which builds K classifiers so as to separate one class from all others, instead of the pairwise strategy, which builds K*(K-1)/2 classifiers considering all pairs of classes.</Paragraph>
      <Paragraph position="4"> Moreover, we only apply the simple linear kernel, although other kernels (e.g. polynomial kernel) and pairwise strategy can have better performance.</Paragraph>
      <Paragraph position="5"> Finally, for each state s , there is one sigmoid . Therefore, the sigmoid outputs are normalized to get a probability distribution using</Paragraph>
      <Paragraph position="7"/>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
3.3 Post-Processing
</SectionTitle>
      <Paragraph position="0"> Two post-processing modules, namely cascaded entity name resolution and abbreviation resolution, are applied in our system to further improve the performance.</Paragraph>
      <Paragraph position="1"> Cascaded Entity Name Resolution It is found (Shen et al 2003) that 16.57% of entity names in GENIA V3.0 have cascaded constructions, e.g.</Paragraph>
      <Paragraph position="2"> &lt;RNA&gt;&lt;DNA&gt;CIITA&lt;/DNA&gt; mRNA&lt;/RNA&gt;.</Paragraph>
      <Paragraph position="3"> Therefore, it is important to resolve such phenomenon.</Paragraph>
      <Paragraph position="4"> Here, a pattern-based module is proposed to resolve the cascaded entity names while the above HMM is applied to recognize embedded entity  names and non-cascaded entity names. In the GENIA corpus, we find that there are six useful patterns of cascaded entity name constructions:  In our experiments, all the rules of above six patterns are extracted from the cascaded entity names in the training data to deal with the cascaded entity name phenomenon.</Paragraph>
    </Section>
    <Section position="4" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
Abbreviation Resolution
</SectionTitle>
      <Paragraph position="0"> While the name alias feature is useful to detect the inter-sentential name alias phenomenon, it is unable to identify the inner-sentential name alias phenomenon: the inner-sentential abbreviation.</Paragraph>
      <Paragraph position="1"> Such abbreviations widely occur in the biomedical domain.</Paragraph>
      <Paragraph position="2"> In our system, we present an effective and efficient algorithm to recognize the inner-sentential abbreviations more accurately by mapping them to their full expanded forms. In the GENIA corpus, we observe that the expanded form and its abbreviation often occur together via parentheses. Generally, there are two patterns: &amp;quot;expanded form (abbreviation)&amp;quot; and &amp;quot;abbreviation (expanded form)&amp;quot;.</Paragraph>
      <Paragraph position="3"> Our algorithm is based on the fact that it is much harder to classify an abbreviation than its expanded form. Generally, the expanded form is more evidential than its abbreviation to determine its class. The algorithm works as follows: Given a sentence with parentheses, we use a similar algorithm as in Schwartz et al 2003 to determine whether it is an abbreviation with parentheses. This is done by starting from the end of both the abbreviation and the expanded form, moving from right to left and trying to find the shortest expanded form that matches the abbreviation. Any character in the expanded form can match a character in the abbreviation with one exception: the match of the character at the beginning of the abbreviation must match the first alphabetic character of the first word in the expanded form. If yes, we remove the abbreviation and the parentheses from the sentence. After the sentence is processed, we restore the abbreviation with parentheses to its original position in the sentence. Then, the abbreviation is classified as the same class of the expanded form, if the expanded form is recognized as an entity name. In the meanwhile, we also adjust the boundaries of the expanded form according to the abbreviation, if necessary. Finally, the expanded form and its abbreviation are stored in the recognized list of biomedical entity names from the document to help the resolution of forthcoming occurrences of the same abbreviation in the document.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="11" end_page="11" type="metho">
    <SectionTitle>
4. EXPERIMENTS AND EVALUATION
</SectionTitle>
    <Paragraph position="0"> We evaluate our PowerBioNE system on GENIA V1.1 and GENIA V3.0 using precision/recall/Fmeasure. For each evaluation, we select 20% of the corpus as the held-out test data and the remaining 80% as the training data. All the experimentations are done 5 times and the evaluations are averaged over the held-out test data. For cascaded entity name resolution, an average of 59 and 97 rules are extracted from the cascaded entity names in the training data of GENIA V1.1 and V3.0 respectively. For POS, all the POS taggers are trained on the training data with POS imported from the corresponding GENIA V3.02p with POS annotated.</Paragraph>
    <Paragraph position="1"> Table 2 shows the performance of our system on GENIA V1.1 and GENIA V3.0, and the comparison with that of the best reported system (Shen et al 2003). It shows that our system achieves the F-measure of 69.1 on GENIA V1.1 and the F-measure of 71.2 on GENIA V3.0 respectively, without help of any dictionaries. It also shows that our system outperforms Shen et al (2003) by 6.9 in F-measure on GENIA V1.1 and 4.6 in F-measure on GENIA V3.0. This is largely due to the superiority of the SVM plus sigmoid in our system (improvement of 3.7 in F-measure on GENIA V3.0) over the back-off approach in Shen et al (2003) and the novel name alias feature (improvement of 1.2 in F-measure on GENIA V3.0). Finally, evaluation also shows that the cascaded entity name resolution and the abbreviation resolution contribute 3.4 and 2.1 respectively in F-measure on GENIA V3.0.</Paragraph>
    <Paragraph position="2">  One important question is about the performance of different entity classes. Table 3 shows the performance of some of the biomedical entity classes on GENIA V3.0. Of particular interest, our system achieves the F-measure of 77.8 on the class &amp;quot;Protein&amp;quot;. It shows that the performance varies a lot among different entity classes. One reason may be due to different difficulties in recognizing different entity classes. Another reason may be due to the different numbers of instances in different entity classes. Though GENIA V3.0 provides a good basis for named entity recognition in the biomedical domain and probably the best available, it has clear bias. Table 3 shows that, while GENIA V3.0 is of enough size for recognizing the major classes, such as &amp;quot;Protein&amp;quot;, &amp;quot;Cell Type&amp;quot;, &amp;quot;Cell Line&amp;quot;, &amp;quot;Lipid&amp;quot; etc, it is of limited size in recognizing other classes, such as &amp;quot;Virus&amp;quot;.</Paragraph>
  </Section>
  <Section position="12" start_page="11" end_page="11" type="metho">
    <SectionTitle>
5. ERROR ANALYSIS
</SectionTitle>
    <Paragraph position="0"> In order to further evaluate our system and explore possible improvement, we have implemented an error analysis. This is done by randomly choosing 100 errors from our recognition results. During the error analysis, we find many errors are due to the strict annotation scheme and the annotation inconsistence in the GENIA corpus, and can be considered acceptable. Therefore, we will also examine the acceptable F-measure of our system, in particular, the acceptable F-measure on the &amp;quot;protein&amp;quot; class.</Paragraph>
    <Paragraph position="1"> All the 100 errors are classified as follows: * Left boundary errors (14): It includes the errors with correct class identification, correct right boundary detection and only wrong left boundary detection. We find that most of such errors come from the long and descriptive naming convention.</Paragraph>
    <Paragraph position="2"> We also find that 11 of 14 errors are acceptable and ignorance of the descriptive words often does not make a much difference for the entity names.</Paragraph>
    <Paragraph position="3"> In fact, it is even hard for biologists to decide whether the descriptive words should be a part of the entity names, such as &amp;quot;normal&amp;quot;, &amp;quot;activated&amp;quot;, etc. In particular, 4 of 14 errors belong to the &amp;quot;protein&amp;quot; class. Among them, two errors are acceptable, e.g. &amp;quot;classical &lt;PROTEIN&gt;1,25 (OH) 2D3 receptor&lt;/PROTEIN&gt;&amp;quot; =&gt; &amp;quot;&lt;PROTEIN&gt;classical 1,25 (OH) 2D3 receptor&lt;/PROTEIN&gt;&amp;quot; (with format of &amp;quot;annotation in the corpus =&gt; identification made by our system&amp;quot;), while the other two are unacceptable, e.g. &amp;quot;&lt;PROTEIN&gt;viral transcription factor&lt;/PROTEIN&gt; =&gt; viral &lt;PROTEIN&gt;transcription factor&lt;/PROTEIN&gt;&amp;quot;.</Paragraph>
    <Paragraph position="4"> * Cascaded entity name errors (15): It includes the errors caused by the cascaded entity name phenomenon. We find that most of such errors come from the annotation inconsistence in the GENIA corpus: In some cases, only the embedded entity names are annotated while in other cases, the embedded entity names are not annotated. Our system tends to annotate both the embedded entity names and the whole entity names. Among them, we find that 13 of 16 errors are acceptable. In particular, 2 of 16 errors belong to the &amp;quot;protein&amp;quot; class and both are acceptable, e.g. &amp;quot;&lt;DNA&gt;NF kappa B binding site&lt;/DNA&gt;&amp;quot; =&gt;</Paragraph>
  </Section>
  <Section position="13" start_page="11" end_page="11" type="metho">
    <SectionTitle>
&amp;quot;&lt;DNA&gt;&lt;PROTEIN&gt;NF kappa B&lt;/PROTEIN&gt;
</SectionTitle>
    <Paragraph position="0"> binding site&lt;/DNA&gt;&amp;quot;.</Paragraph>
    <Paragraph position="1"> * Misclassification errors (18): It includes the errors with wrong class identification, correct right boundary detection and correct left boundary detection. We find that this kind of errors mainly comes from the sense ambiguity of biomedical entity names and is very difficult to disambiguate. Among them, 8 errors are related with the &amp;quot;DNA&amp;quot; class and 6 errors are related with the &amp;quot;Cell Line&amp;quot; and &amp;quot;Cell Type&amp;quot; classes. We also find that only 3 of 18 errors are acceptable. In particular, there are 6 errors related to the &amp;quot;protein&amp;quot; class. Finally, we find that all the 6 errors are caused by misclassification of the &amp;quot;DNA&amp;quot; class to the &amp;quot;protein&amp;quot; class and all of them are unacceptable, e.g. &amp;quot;&lt;DNA&gt;type I IFN&lt;DNA&gt;&amp;quot; =&gt; &amp;quot;&lt;PROTEIN&gt;type I IFN&lt;/PROTEIN&gt;&amp;quot;.</Paragraph>
    <Paragraph position="2"> * True negative (23): It includes the errors by missing the identification of biomedical entity names. We find that 16 errors come from the &amp;quot;other&amp;quot; class and 10 errors from the &amp;quot;protein&amp;quot; class. We also find that the GENIA corpus annotates some general noun phrases as biomedical entity names, e.g. &amp;quot;protein&amp;quot; in &amp;quot;the protein&amp;quot; and &amp;quot;cofactor&amp;quot; in &amp;quot;a cofactor&amp;quot;. Finally, we find that 11 of 23 errors are acceptable. In particular, 9 of 23 errors related to the &amp;quot;protein&amp;quot; class. Among them,  protein&amp;quot;, while the other 6 are unacceptable, e.g. &amp;quot; &lt;PROTEIN&gt;80 kDa&lt;/PROTEIN&gt; =&gt; &amp;quot;80 kDa&amp;quot;. * False positive (15): It includes the errors by wrongly identifying biomedical entity names which are not annotated in the GENIA corpus. We find that 9 of 15 errors come from the &amp;quot;other&amp;quot; class. This suggests that the annotation of the &amp;quot;other&amp;quot; class is much lack of consistency and most problematic in the GENIA corpus. We also find that 7 of 15 errors are acceptable. In particular, 2 of 15 errors are related to the &amp;quot;protein&amp;quot; class and both are acceptable, e.g. &amp;quot;affinity sites&amp;quot;=&gt; &amp;quot;&lt;PROTEIN&gt;affinity sites&lt;/PROTEIN&gt;&amp;quot;.</Paragraph>
    <Paragraph position="3"> * Miscellaneous (14): It includes all the other errors, e.g. combination of the above errors and the errors caused by parentheses. We find that only 1 of 14 errors is acceptable. We also find that, among them, 2 errors are related with the &amp;quot;protein&amp;quot; class and both are unacceptable, e.g.</Paragraph>
    <Paragraph position="4"> &amp;quot;&lt;PROTEIN&gt;17 amino acid epitope&lt;/PROTEIN&gt;&amp;quot; =&gt; &amp;quot;17 &lt;RNA&gt;amino acid epitope&lt;/RNA&gt;&amp;quot;.</Paragraph>
    <Paragraph position="5"> From above error analysis, we find that about half (46/100) of errors are acceptable and can be avoided by flexible annotation scheme (e.g.</Paragraph>
    <Paragraph position="6"> regarding the modifiers in the left boundaries) and consistent annotation (e.g. in the annotation of the &amp;quot;other&amp;quot; class and the cascaded entity name phenomenon). In particular, about one third (9/25) of errors are acceptable on the &amp;quot;protein&amp;quot; class. This means that the acceptable F-measure can reach about 84.4 on the 23 classes of GENIA V3.0. In particular, the acceptable F-measure on the &amp;quot;protein&amp;quot; class is about 85.8. In addition, this performance is achieved without using any extra resources (e.g. dictionaries). With help of extra resources, we think an acceptable F-measure of near 90 can be achieved in the near future.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML