File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1030_metho.xml

Size: 7,001 bytes

Last Modified: 2025-10-06 14:07:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1030">
  <Title>Extracting the Names of Genes and Gene Products with a Hidden Markov Model</Title>
  <Section position="4" start_page="202" end_page="204" type="metho">
    <SectionTitle>
3 Mx.'tzho d
</SectionTitle>
    <Paragraph position="0"> The lmrl)osc of our mod(;1 is Io lind t;hc n,osl: likely so(tilth, liCe of name classes (C) lbr a given se(tucncc of wor(ls (W). The set of name ('lasses inchutcs the 'Unk' name (:lass whi('h we use li)r 1)ackgromM words not 1)elonging to ally ()\[ the interesting name classes given in Tal)lc 1 and t;hc given st;qu(m(:e of words which w(~ ,&gt;('. spans a single s(,Jd;cn('c. The task is thcrcfor(~ 1(} maxintize Pr((TIH:). \Y=c iml)lem(mt a I\]MM to estimate this using th('. Markov assuml)tion that Pr(CIIY= ) can be t'(mnd from t)igrams of ha.me classes.</Paragraph>
    <Paragraph position="1"> In th('. following model we (:onsid(u&amp;quot; words to 1)c ordered pairs consisting of a. surface word, W, and a. word tbature, 1&amp;quot;, given as &lt; W, F &gt;.</Paragraph>
    <Paragraph position="2"> The word features thcms('Jvcs arc discussed in Section 3.1.</Paragraph>
    <Paragraph position="3"> As is common practice, we need to (:alculatc the 1)rol)abilities for a word sequence for the first; word's name class and every other word diflbrently since we have no initial nalnt&gt;class to make a transition frolll. Accordingly we use l;he R)llowing equation to (:alculatc the ilfitial</Paragraph>
    <Paragraph position="5"> likelihood estimates from counts on training data, so that tbr example,</Paragraph>
    <Paragraph position="7"> Where T() has been found from counting the events in thc training cortms. In our current sysl;oln \vc SC\[; t;tlc C()llSt;&amp;lltS ~i }lJld o- i \])y halld all(l let ~ ai = 1.0, ~ Ai = 1.0, a0 &gt; al k O-2, A0 &gt; A I... _&gt; As. Tile current name-class Ct is conditioned oil the current word and feat;llrc~ thc I)rcviolls name-class, ~*t--l: and t)rcvious word an(t tbaturc.</Paragraph>
    <Paragraph position="8"> Equations 1 and 2 implement a linearinterpolating HMM that incorporates a mmfl)cr  of sub-models (rethrred to fl'om now by their A coefficients) designed to reduce the effects of data sparseness. While we hope to have enough training data to provide estimates tbr all model parameters, in reality we expect to encounter highly fl'agmented probability distributions. In the worst case, when even a name class pair has not been observed beibre in training, the model defaults at A5 to an estimate of name class unigrams. We note here that the bigram language model has a non-zero probability associated with each bigram over the entire vocal)ulary. null Our model differs to a backoff formulation because we tbund that this model tended to suffer fl'om the data sparseness problem on our small training set. Bikel et al for example considers each backoff model to be separate models, starting at the top level (corresl)onding approximately to our Ao model) and then falling back to a lower level model when there not enough evidence. In contrast, we have combined these within a single 1)robability calculation tbr state (class) transitions. Moreover, we consider that where direct bigram counts of 6 or more occur in the training set, we can use these directly to estimate the state transition probability and we nse just the ,~0 model in this case. For counts of less than 6 we smooth using Equation 2; this can be thought of as a simt)le form of q)ncketing'. The HMM models one state per name (:lass as well as two special states tbr the start and end ofa sentence.</Paragraph>
    <Paragraph position="9"> Once the state transition l)rol)abilities have been calcnlated according to Equations 1 and 2, the Viterbi algorithm (Viterbi, 1967) is used to search the state space of 1)ossible name class assignments. This is done in linear time, O(MN 2) for 54 the nunfl)er of words to be classified and N the number of states, to find the highest probability path, i.e. to maxinfise Pr(W, C). In our exl)eriments 5/i is the length of a test sentence. The final stage of our algorithm that is used after name-class tagging is complete is to use ~ clean-up module called Unity. This creates a frequency list of words and name-classes tbr a docmnent and then re-tags the document using the most frequently nsed name class assigned by the HMM. We have generally tbund that this improves F-score performance by al)out 2.3%, both tbr re-tagging spuriously tagged words and</Paragraph>
    <Section position="1" start_page="203" end_page="204" type="sub_section">
      <SectionTitle>
3.1 Word features
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the character t'eatnres that we used which are based on those given for Nymble and extended to give high pertbrmance in both molecular-biology and newswire domains. The intnition is that such features provide evidence that helps to distinguish nmne classes of words.</Paragraph>
      <Paragraph position="1"> Moreover we hyt)othesize that such featnres will help the model to find sinfilarities between known words that were tbnnd in the training set and unknown words (of zero frequency in the training set) and so overcome the unknown word t)rol)lem. To give a simple example: if we know that LMP - 1 is a member of PROTEIN and we encounter AP - 1 for the first time in testing, we can make a fairly good guess about the category of the unknown word 'LMP' based on its sharing the same feature TwoCaps with the known word 'AP' and 'AP's known relationship with '- 1'.</Paragraph>
      <Paragraph position="2"> Such unknown word evidence is captured in submodels A1 through ),3 in Equation 2. \Y=e  consider that character information 1)rovides more mealfingflll distinctions between name (;\]asses than for examI)le part-of-speech (POS), since POS will 1)redominmltly 1)e noun fi)r all name-class words. The t'catures were chosen to be as domain independent as possit)le, with the exception of Ilyphon and Greel,:Letter which have t)articular signitieance for the terminology in this dolnain.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="204" end_page="204" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="204" end_page="204" type="sub_section">
      <SectionTitle>
4.1 Training and testing set
</SectionTitle>
      <Paragraph position="0"> The training set we used in our experiments ('onsisted of 100 MEI)II, INI~ al)stra(:ts, marked Ul) ill XS/\[L l)y a (lonmin ext)ert for the name ('lasses given in Tal)le 1. The mmfl)er of NEs that were marked u 1) by class are also given in Tfl)le 1 and the total lmmber of words in the corlms is 299/\]:0. The al)stracts were chosen from a sul)(lomain of moleeular-1)iology that we formulated by s(',ar(;hing under the terms h/uman, blood cell, trav,.scription ,/'actor in the 1)utiMed datal)asc, This yiel(l('.(t al)t)roximately 33(10 al/stracts. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML