File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-2002_metho.xml

Size: 16,465 bytes

Last Modified: 2025-10-06 14:08:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-2002">
  <Title>Identifying Chemical Names in Biomedical Text: An Investigation of the Substring Co-occurrence Based Approaches</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Available Data
</SectionTitle>
    <Paragraph position="0"> In order to train a statistical model for recognizing chemicals a list of about 240 thousands entries have been download from National Cancer Institute website (freely available at dtp.nci.nih.gov). Entries are unique names of about 45 thousands unique chemicals. Each entry includes a name of a chemical possibly followed by alternative references and some comments. This additional information had to be deleted in order to compute statistics from chemical names only. While there were no clean separators between chemical names and the additional materials, several patterns were designed to clean up the list. Applying those patterns shrunk each entry on average by half. This cleaning step has not produced perfect results in both leaving some unusable material in and deleting some useful strings, yet it improved the performance of all methods dramatically. Cleaning the list by hand might have produced better results, but it would require more expertise and take a lot of time and would contradict the goal of building the system from readily available data.</Paragraph>
    <Paragraph position="1"> We used text from MEDLINE abstracts to model general biomedical language. These were available as a part of the MEDLINE database of bibliographical records for papers in biomedical domain. Records that had non-empty abstracts have been extracted. From those 'title' and 'abstract' fields were taken and cleaned off from remaining XML tags.</Paragraph>
    <Paragraph position="2"> Both the list of chemical names (LCN) and the text corpus obtained from the MED LINE database (MED) were tokenized by splitting on the white spaces. White space tokenization was used over other possible approaches, as the problem of tokenization is very hard for chemical names, because they contain a lot of internal punctuation. We also wanted to avoid splitting chemical names into tokens that are too small, as they would contain very little internal information to work with. The counts of occurrences of tokens in LCN and MD were used in all experiments to build models of chemical names and general biomedical text.</Paragraph>
    <Paragraph position="3"> In addition, 15 abstracts containing chemical names were selected from the parts of MEDLINE corpus not used for the creation of the above list. These abstracts have been annotated by hand and used as development and test sets.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Classification Using Substring
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Importance Criteria
3.1 Classification Approach
</SectionTitle>
      <Paragraph position="0"> Most obvious approach to this problem is to try to match the chemicals in the list against the text and label only the matches, i.e. chemicals that are known from the list. This approach is similar to the memory-based baseline described by Palmer et al., 1997, where instead of using precompiled list they memorized all the entries that occurred in a training text.</Paragraph>
      <Paragraph position="1"> A natural extension of matching is a decision list.</Paragraph>
      <Paragraph position="2"> Each classification rule in the list checks if a substring is present in a token. Matching can be viewed as just an extreme of this approach, where the strings selected into the decision list are the complete tokens from the LCN (including token boundary information). Using other substrings increases recall, as non-exact matches are detected, and it also improves precision, as it decreases the number of error coming from noise in LCN.</Paragraph>
      <Paragraph position="3"> While decision list performs better than matching, its performance is still unsatisfactory. Selecting only highly indicative substrings results in high precision, but very low recall. Lowering the thresholds and taking more substrings decreases the precision without improving the recall much until the precision gets very low.</Paragraph>
      <Paragraph position="4"> The decision list approach makes each decision based on a single substring. This forces us to select only substrings that are extreemly rare outside the chemical names. This in turn results in extremely low recall. An alternative would be to combine the information from multiple substrings into a single decision using Naive Bayes framework. This would keep precision from dropping as dramatically when we increase the number of strings used in classification.</Paragraph>
      <Paragraph position="5"> We would like to estimate the probability of a token being a part of a chemical name given the token (string) p(c|s) . Representing each string as a set of its substrings we need to estimate p(c|s1...sn). Using Bayes Rule, we</Paragraph>
      <Paragraph position="7"> Assuming independence of substrings s1...sn and conditional independence of substrings s1...sn given c, we can rewrite:</Paragraph>
      <Paragraph position="9"> (2) Now notice that for most applications we would like to be able to vary precision/recall tradeoff by setting some threshold t and classifying each string s as a chemical only if</Paragraph>
      <Paragraph position="11"> This allows us to avoid estimation of p(c) (estimating p(c) is hard without any labeled text). We can estimate p(si|c) and p(si) from the LCN and MED respectively as tokens)(/#) containg tokens(#)( ii ssp = (5)</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Substring Selection
</SectionTitle>
      <Paragraph position="0"> For this approach, we need to decide what set of substring {si} of s to use to represent s. We would like to select a set of non-overlapping substrings to make the independence assumption more grounded (while it is clear that even non-overlapping substrings are not independent, assuming independence of overlapping substrings clearly causes major problems). In order to do this we need some measure of usefulness of substrings. We would like to select substrings that are both informative and reliable as features, i.e. the substrings fraction of which in LCN is different from the fraction of them in MED and which occur often enough in LCN. Once this measure is defined, we can use dynamic programming algorithm similar to Viterbi decoding to select the set of non-overlapping substrings with maximum value.</Paragraph>
      <Paragraph position="1"> Kullback-Leibler divergence based measure If we view the substring frequencies as a distribution, we can ask the question which substrings account for the biggest contribution to Kullback-Leibler divergence (Cover et al, 1991) between distribution given by LCN and that given by MED. From this view it is reasonable to take p(si|c)*log(p(si|c)/p(si)) as a measure of value of a substring. Therefore, the selection criterion would be tspcspcsp iii &gt;))(/)|(log()|( (6) where t is some threshold value. Notice that this measure combines frequency of a substring in chemicals and the difference between frequencies of occurrences of the substring in chemicals and non-chemicals.</Paragraph>
      <Paragraph position="2"> A problem with this approach arises when either p(si|c) or p(si) is equal to zero. In this case, this selection criterion cannot be computed, yet some of the most valuable strings could have p(si) equal to zero.</Paragraph>
      <Paragraph position="3"> Therefore, we need to smooth probabilities of the strings to avoid zero values. One possibility is to include all strings si, such that p(si)=0 and p(si|c)&gt;t', where t'&lt;t is some new threshold needed to avoid selecting very rare strings. It would be nice though not to introduce an additional parameter. An alternative would be to reassign probabilities to all substrings and keep the selection criterion the same. It could be done, for example, using Good-Turing smoothing (Good 1953).</Paragraph>
      <Paragraph position="4"> Selection by significance testing A different way of viewing this is to say that we want to select all the substrings in which we are confident. It can be observed that tokens might contain certain substrings that are strong indicators of them being chemicals. Useful substrings are the ones that predict significantly different from the prior probability of being a chemical. I.e. if the frequency of chemicals among all tokens is f(c), then s is a useful substring if the frequency of chemicals among tokens containing s f(c|s) is significantly different from f(c). We test the significance by assuming that f(c) is a good estimate for the prior probability of a token being a chemical p(c), and trying to reject the null hypothesis, that actual probability of chemicals among tokens that contain s is also p(c). If the number of tokens containing s is n(s) and the number of chemicals containing s is c(s) , then the selection criterion becomes</Paragraph>
      <Paragraph position="6"> This formula is obtained by viewing occurrences of s as Bernoulli trials with probability p(c) of the occurrence being a chemical and probability (1-p(c)) of the occurrence being non-chemical. Distribution obtained by n(s) such trials can be approximated with the normal distribution with mean n(s)p(c) and variance n(s)p(c)(1-p(c)).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="11" type="metho">
    <SectionTitle>
4 Classification Using N-gram Models
</SectionTitle>
    <Paragraph position="0"> We can estimate probability of a string given class (chemical or non-chemical) as the probability of letters of the string based on a finite history.</Paragraph>
    <Paragraph position="1">  where S is the string to be classified and si are the letters of S.</Paragraph>
    <Paragraph position="2"> The N-gram approach has been a successful modeling technique in many other applications. It has a number of advantages over the Bayesian approach. In this framework we can use information from all substrings of a token, and not only sets of non-overlapping ones. There is no (incorrect) independence assumption, so we get a more sound probability model. As a practical issue, there has been a lot of work done on smoothing techniques for N-gram models (Chen et al., 1998), so it is easier to use them.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Investigating Usefulness of Different N-gram
Lengths
</SectionTitle>
      <Paragraph position="0"> As the first task in investigating N-gram models, we investigated usefulness of N-grams of different length.</Paragraph>
      <Paragraph position="1"> For each n, we constructed a model based on the substrings of this length only using Laplacian smoothing to avoid zero probability.</Paragraph>
      <Paragraph position="3"> where N is the length of the N-grams, nii-N+1 and ncii-N+1 are the number of occurrences of N-gram sisi-1...si-N-1 in MEDLINE and chemical list respectively, d is the smoothing parameter, and B is the number of different N-grams of length N.</Paragraph>
      <Paragraph position="4"> The smoothing parameter was tuned for each n individually using the development data (hand annotated MEDLINE abstracts). The results of these experiments showed that 3-grams and 4-grams are most useful. While poor performance by longer N-grams was somewhat surprising, results indicated that overtraining might be an issue for longer N-grams, as the model they produce models the training data more precisely. While unexpected, the result is similar to the conclusion in Dunning '94 for language identification task.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="11" type="sub_section">
      <SectionTitle>
4.2 Interpolated N-gram Models
</SectionTitle>
      <Paragraph position="0"> In many different tasks that use N-gram models, interpolated or back-off models have been proven useful. The idea here is to use shorter N-grams for smoothing longer ones.</Paragraph>
      <Paragraph position="2"> where lj's are the interpolation coefficients, m and mc are the total number of letters in MEDLINE and chemical list respectively. lj can generally depend on si-1...si-N+1 , with the only constraint that all l j coefficients sum up to one. One of the main question for interpolated models is learning the values for l's.</Paragraph>
      <Paragraph position="3"> Estimating N different l's for each context si-1...si-N+1 is a hard learning task by itself that requires a lot of development data. There are two fundamentally different ways for dealing with this problem. Often grouping different coefficients together and providing single value for each group, or imposing some other constraints on the coefficients is used to decrease the number of parameters. The other approach is providing a theory for values of l's without tuning them on the development data (This is similar in spirit to Minimal Description Length approach). We have investigated several different possibilities in both of these two approaches.</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.3 Computing Interpolation Coefficients: Fixed
Coefficients
</SectionTitle>
      <Paragraph position="0"> Equation (10) can be rewritten in a slightly different form:</Paragraph>
      <Paragraph position="2"> This form states more explicitly that each N-gram model is smoothed by all lower models. An extreme of the grouping approach is then to make all lj's equal, and tune this single parameter on the development data.</Paragraph>
    </Section>
    <Section position="4" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.4 Computing Interpolation Coefficients:
Context Independent Coefficients
</SectionTitle>
      <Paragraph position="0"> Relaxing this constraint and going back to the original form of equation (10), we can make all lj's independent of their context, so we get only N parameters to tune.</Paragraph>
      <Paragraph position="1"> When N is small, this can be done even with relatively small development set. We can do this by exploring all possible settings of these parameters in an N dimensional grid with small increment. For larger N we have to introduce an additional constraint that l j's should lie on some function of j with a smaller number of parameters. We have used a quadratic function (2 parameters, as one of them is fixed by the constraint that all lj's have to sum up to 1). Using higher order of the function gives more flexibility, but introduces more parameters, which would require more development data to tune well. The quadratic function seems to be a good trade off that provides enough flexibility, but does not introduce too many parameters.</Paragraph>
    </Section>
    <Section position="5" start_page="11" end_page="11" type="sub_section">
      <SectionTitle>
4.5 Computing Interpolation Coefficients:
Confidence Based Coefficients
</SectionTitle>
      <Paragraph position="0"> The intuition for using interpolated models is that higher level N-grams give more information when they are reliable, but lower level N-grams are usually more reliable, as they normally occur more frequently. We can formalize this intuition by computing the confidence of higher level N-grams and weight them proportionally. We are trying to estimate p(si|si-1...si-N+1) with the ratio nii-N+1 /ni-1i-N+1. We can say that our observation in the training data was generated by ni-1i-N+1 Bernoulli trials with outcomes either si or any other letter. We consider si to be a positive outcome and any other letter would be a negative outcome. Given this model we have nii-N+1 positive outcomes in ni-1i-N+1 Bernoulli trials with probability of positive outcome p(si|si-1...si-N+1). This means that the estimate given by nii-N+1 /ni-1i-N+1 has the confidence interval of binomial distribution approximated by normal given by  where c = ni-1i-N+1 .</Paragraph>
      <Paragraph position="1"> Since the true probability is within I of the estimate, the lower level models should not change the estimate given by the highest-level model by more than I. This means that lN-1 in the equation (11) should be equal to I. By recursing the argument we get</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML