XML Viewer - n06-1010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1010_metho.xml
Size: 20,387 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-1010">
  <Title>Exploiting Domain Structure for Named Entity Recognition</Title>
  <Section position="3" start_page="74" end_page="75" type="metho">
    <SectionTitle>
2 Generalizability-Based Feature Ranking
</SectionTitle>
    <Paragraph position="0"> We take a commonly used approach and treat NER as a sequential tagging problem (Borthwick, 1999; Zhou and Su, 2002; Finkel et al., 2005). Each token is assigned the tag I if it is part of an NE and the tag O otherwise. Let x denote the feature vector for a token, and let y denote the tag for x. We first compute the probability p(y|x) for each token, using a  learned classifier. We then apply Viterbi algorithm to assign the most likely tag sequence to a sequence of tokens, i.e., a sentence. The features we use follow the common practice in NER, including surface word features, orthographic features, POS tags, substrings, and contextual features in a local window of size 5 around the target token (Finkel et al., 2005).</Paragraph>
    <Paragraph position="1"> As in any learning problem, feature selection may affect the NER performance significantly. Indeed, a very likely cause of the domain overfitting problem may be that the learned NE recognizer has picked up some non-generalizable features, which are not useful for a new domain. Below, we present a generalizability-based feature ranking method, which favors more generalizable features.</Paragraph>
    <Paragraph position="2"> Formally, we assume that the training examples are divided into m subsets T</Paragraph>
    <Paragraph position="4"> sponding to m different domains D</Paragraph>
    <Paragraph position="6"> We further assume that the test set T  is from a new domain D  , and this new domain shares some common features of the m training domains. Note that these are reasonable assumptions that reflect the situation in real problems. We use generalizability to denote the amount of contribution a feature can make to the classification accuracy on any domain. Thus, a feature with high generalizability should be useful for classification on any domain. To identify the highly generalizable features, we must then compare their contributions to classification among different domains. Suppose in each individual domain, the features can be ranked by their contributions to the classification accuracy. There are different feature ranking methods based on different criteria. Without loss of generality, let us use r T : F - {1, 2, . . . , |F|} to denote a ranking function that maps a feature f [?] F to a rank r T (f) based on a set of training examples T, where F is the set of all features, and the rank denotes the position of the feature in the final ranked list. The smaller the rank r T (f) is, the more important the feature f is in the training set T. For the m training domains, we thus have m ranking functions</Paragraph>
    <Paragraph position="8"> To identify the generalizable features across the m different domains, we propose to combine the m individual domain ranking functions in the following way. The idea is to give high ranks to features that are useful in all training domains . To achieve this goal, we first define a scoring function s : F - R as follows:</Paragraph>
    <Paragraph position="10"> We then rank the features in decreasing order of their scores using the above scoring function. This is essentially to rank features according to their maximum rank max</Paragraph>
    <Paragraph position="12"> (f) among the m domains. Let function r gen return the rank of a feature in this combined, generalizability-based ranked list. The original ranking function r T used for individual domain feature ranking can use different criteria such as information gain or kh  statistic (Yang and Pedersen, 1997). In our experiments, we used a ranking function based on the model parameters of the classifier, which we will explain in Section 5.2. Next, we need to incorporate this preference for generalizable features into the classifier. Note that because this generalizability-based feature ranking method is independent of the learning algorithm, it can be applied on top of any classifier. In this work, we choose the logistic regression classifier. One way to incorporate the feature ranking into the classifier is to select the top-k features, where k is chosen by cross validation. There are two potential problems with this hard feature selection approach. First, once k features are selected, they are treated equally during the learning stage, resulting in a loss of the preference among these k features. Second, this incremental feature selection approach does not consider the correlation among features. We propose an alternative way to incorporate the feature ranking into the classifier, where the preference for generalizable features is transformed into a non-uniform prior over the feature parameters in the model. This can be regarded as a soft feature selection approach.</Paragraph>
  </Section>
  <Section position="4" start_page="75" end_page="75" type="metho">
    <SectionTitle>
3 Logistic Regression for NER
</SectionTitle>
    <Paragraph position="0"> In binary logistic regression models, the probability of an observation x being classified as I is</Paragraph>
    <Paragraph position="2"> are the weights for the features, and x prime is the augmented feature vector with x  = 1. The weight vector b can be learned from the training examples by a maximum likelihood estimator. It is worth pointing out that logistic regression has a close relation with maximum entropy models. Indeed, when the features in a maximum entropy model are defined as conjunctions of a feature on observations only and a Kronecker delta of a class label, which is a common practice in NER, the maximum entropy model is equivalent to a logistic regression model (Finkel et al., 2005). Thus the logistic regression method we use for NER is essentially the same as the maximum entropy models used for NER in previous work. To avoid overfitting, a zero mean Gaussian prior on the weights is usually used (Chen and Rosenfeld, 1999; Bender et al., 2003), and a maximum a posterior (MAP) estimator is used to maximize the posterior probability:</Paragraph>
    <Paragraph position="4"> are set uniformly to the same value for all features, because there is in general no additional prior knowledge about the features.</Paragraph>
  </Section>
  <Section position="5" start_page="75" end_page="77" type="metho">
    <SectionTitle>
4 Rank-Based Prior
</SectionTitle>
    <Paragraph position="0"> Instead of using the same s</Paragraph>
    <Paragraph position="2"> for all features, we propose a rank-based non-uniform Gaussian prior on the weights of the features so that more generalizable features get higher prior variances (i.e., low prior strength) and features on the bottom of the list get low prior variances (i.e., high prior strength).</Paragraph>
    <Paragraph position="3"> Since the prior has a zero mean, such a prior would force features on the bottom of the ranked list, which have the least generalizability, to have near-zero weights, but allow more generalizable features to be assigned higher weights during the training stage.</Paragraph>
    <Section position="1" start_page="75" end_page="75" type="sub_section">
      <SectionTitle>
4.1 Transformation Function
</SectionTitle>
      <Paragraph position="0"> We need to find a transformation function h :</Paragraph>
      <Paragraph position="2"> so that we can set s</Paragraph>
      <Paragraph position="4"> in the generalizability-based ranked feature list, as defined in Section 2. We choose the following h function because it has the desired properties as described above:</Paragraph>
      <Paragraph position="6"> where a and b (a, b &gt; 0) are parameters that control the degree of the confidence in the generalizability-based ranked feature list. Note that a corresponds to the prior variance assigned to the top-most feature in the ranked list. When b is small, the prior variance drops rapidly as the rank r increases, giving only a small number of top features high prior variances.</Paragraph>
      <Paragraph position="7"> When b is larger, there will be less discrimination among the features. When b approaches infinity, the prior becomes a uniform prior with the variance set to a for all features. If we set a small threshold t on the variance, then we can derive that at least m =</Paragraph>
      <Paragraph position="9"> features have a prior variance greater than t.</Paragraph>
      <Paragraph position="10"> Thus b is proportional to the logarithm of the number of features that are assigned a variance greater than the threshold t when a is fixed. Figure 1 shows the h function when a is set to 20 and b is set to a set of different values.</Paragraph>
    </Section>
    <Section position="2" start_page="75" end_page="77" type="sub_section">
      <SectionTitle>
4.2 Parameter Setting using Domain-Aware
Validation
</SectionTitle>
      <Paragraph position="0"> We need to set the appropriate values for the parameters a and b. For parameter a, we use the following  simple strategy to obtain an estimation. We first train a logistic regression model on all the training data using a Gaussian prior with a fixed variance (set to 1 in our experiments). We then find the maximum</Paragraph>
      <Paragraph position="2"> in this trained model. Finally we set a = b</Paragraph>
      <Paragraph position="4"> reasoning is that since a is the variance of the prior for the best feature, a is related to the &amp;quot;permissible range&amp;quot; of b for the best feature, and b max gives us a way for adjusting a according to the empirical range</Paragraph>
      <Paragraph position="6"> As we pointed out in Section 4.1, when a is fixed, parameter b controls the number of top features that are given a relatively high prior variance, and hence implicitly controls the number of top features to choose for the classifier to put the most weights on.</Paragraph>
      <Paragraph position="7"> To select an appropriate value of b, we can use a held-out validation set to tune the parameter value b. Here we present a validation strategy that exploits the domain structure in the training data to set the parameter b for a new domain. Note that in regular validation, both the training set and the validation set contain examples from all training domains. As a result, the average performance on the validation set may be dominated by domains in which the NEs are easy to classify. Since our goal is to build a classifier that performs well on new domains, we should pay more attention to hard domains that have lower classification accuracy. We should therefore examine the performance of the classifier on each training domain individually in the validation stage to gain an insight into the appropriate value of b for a new domain, which has an equal chance of being similar to any of the training domains.</Paragraph>
      <Paragraph position="8"> Our domain-aware validation strategy first finds the optimal value of b for each training domain. For</Paragraph>
      <Paragraph position="10"> of the training data belonging to do-</Paragraph>
      <Paragraph position="12"> . Then for each domain D</Paragraph>
      <Paragraph position="14"> , we train a classifier on the training sets of all domains, that is, we train on</Paragraph>
      <Paragraph position="16"> based on the assumption that D  is a mixture of all the training domains. b  is then chosen to be a weighted average of b</Paragraph>
      <Paragraph position="18"> is an even mixture of all training domains.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="77" end_page="78" type="metho">
    <SectionTitle>
5 Empirical Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
5.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> We evaluated our domain-aware approach to NER on the problem of gene recognition in biomedical literature. The data we used is from BioCreAtIvE Task 1B (Hirschman et al., 2005). We chose this data set because it contains three subsets of MED-LINE abstracts with gene names from three species (fly, mouse, and yeast), while no other existing annotated NER data set has such explicit domain structure. The original BioCreAtIvE 1B data was not provided with every gene annotated, but for each abstract, a list of genes that were mentioned in the abstract was given. A gene synonym list was also given for each species. We used a simple string matching method with slight relaxation to tag the gene mentions in the abstracts. We took 7500 sentences from each species for our experiments, where half of the sentences contain gene mentions. We further split the 7500 sentences of each species into two sets, 5000 for training and 2500 for testing.</Paragraph>
      <Paragraph position="1"> We conducted three sets of experiments, each combining the 5000-sentence training data of two species as training data, and the 2500-sentence test data of the third species as test data. The 2500-sentence test data of the training species was used for validation. We call these three sets of experi-</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
5.2 Comparison with Baseline Method
</SectionTitle>
      <Paragraph position="0"> Because the data set was generated by our automatic tagging procedure using the given gene lists, there is no previously reported performance on this data set for us to compare with. Therefore, to see whether using the domain structure in the training data can really help the adaptation to new domains, we compared our method with a state-of-the-art baseline method based on logistic regression. It uses a Gaussian prior with zero mean and uniform variance on all model parameters. It also employs 5-fold regular cross validation to pick the optimal variance for the prior. Regular feature selection is also considered in the baseline method, where the features are first ranked according to some criterion, and then cross validation is used to select the top-k features. We tested three popular regular feature ranking methods: feature frequency (F), information gain (IG), and kh  statistic (CHI). These methods were discussed in (Yang and Pedersen, 1997). However, with any of the three feature ranking criteria, cross validation showed that selecting all features gave the best average validation performance. Therefore, the best baseline method which we compare our method with uses all features. We call the baseline method BL.</Paragraph>
      <Paragraph position="1"> In our method, the generalizability-based feature ranking requires a first step of feature ranking within each training domain. While we could also use F, IG or CHI to rank features in each domain, to make our method self-contained, we used the following strategy. We first train a logistic regression model on each domain using a zero-mean Gaussian prior with variance set to 1. Then, features are ranked in decreasing order of the absolute values of their weights. The rationale is that, in general, features with higher weights in the logistic regression model are more important. With this ranking within each training domain, we then use the generalizability-based feature ranking method to combine the m domain-specific rankings. The obtained ranked feature list is used to construct the rank-based prior, where the parameters a and b are set in the way as discussed in Section 4.2. We call our method DOM.</Paragraph>
      <Paragraph position="2"> In Table 1, we show the precision, recall, and F1 measures of our domain-aware method (DOM) and the baseline method (BL) in all three sets of experiments. We see that the domain-aware method out-performs the baseline method in all three cases when F1 is used as the primary performance measure. In F+Y=M and M+Y=F, both precision and recall are also improved over the baseline method.</Paragraph>
      <Paragraph position="3">  Note that the absolute performance shown in Table 1 is lower than the state-of-the-art performance of gene recognition (Finkel et al., 2005).</Paragraph>
      <Paragraph position="4">  One reason is that we explicitly excluded the test domain from the training data, while most previous work on gene recognition was conducted on a test set drawn from the same collection as the training data. Another reason is that we used simple string matching to generate the data set, which introduced noise to the data because gene names often have irregular lexical variants.</Paragraph>
      <Paragraph position="5">  ing and generalizability-based feature ranking on M+Y=F To further understand how our method improved the performance, we compared the generalizability-based feature ranking method with the three regular feature ranking methods, F, IG, and CHI, that were used in the baseline method. To make fair comparison, for the regular feature ranking methods, we also used the rank-based prior transformation as described in Section 4 to incorporate the preference for top-ranked features. Figure 2, Figure 3 and Figure 4 show the performance of different feature ranking methods in the three sets of experiments as the parameter b for the rank-based prior changes. As we pointed out in Section 4, b is proportional to the logarithm of the number of &amp;quot;effective features&amp;quot;. From the figures, we clearly see that the curve for the generalizability-based ranking method DOM is always above the curves of the other methods, indicating that when the same amount of top features are being emphasized by the prior, the features selected by DOM give better performance on a new domain than the features selected by the other methods. This suggests that the top-ranked features in DOM are indeed more suitable for adaptation to new domains than the top features ranked by the other methods.</Paragraph>
      <Paragraph position="6"> The figures also show that the ranking method DOM achieved better performance than the baseline over a wide range of b values, especially in F+Y=M and M+Y=F, whereas for methods F, IG and CHI, the performance quickly converged to the baseline performance as b increased.</Paragraph>
      <Paragraph position="7"> It is interesting to note the comparison between F and IG (or CHI). In general, when the test data is similar to the training data, IG (or CHI) is advantageous over F (Yang and Pedersen, 1997). However, in this case when the test domain is different from the training domains, F shows advantages for adaptation. A possible explanation is that frequent features are in general less likely to be domain-specific, and therefore feature frequency can also be used as a criterion to select generalizable features and to filter out domain-specific features, although it is still not as effective as the method we proposed.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="78" end_page="80" type="metho">
    <SectionTitle>
6 Related Work
</SectionTitle>
    <Paragraph position="0"> The NER problem has been extensively studied in the NLP community. Most existing work has focused on supervised learning approaches, employing models such as HMMs (Zhou and Su, 2002), MEMMs (Bender et al., 2003; Finkel et al., 2005), and CRFs (McCallum and Li, 2003). Collins and Singer (1999) proposed an unsupervised method for named entity classification based on the idea of cotraining. Ando and Zhang (2005) proposed a semi-supervised learning method to exploit unlabeled data for building more robust NER systems. In all these studies, the evaluation is conducted on unlabeled data similar to the labeled data.</Paragraph>
    <Paragraph position="1"> Recently there have been some studies on adapting NER systems to new domains employing techniques such as active learning and semi-supervised learning (Shen et al., 2004; Mohit and Hwa, 2005),  or incorporating external lexical knowledge (Ciaramita and Altun, 2005). However, there has not been any study on exploiting the domain structure contained in the training examples themselves to build generalizable NER systems. We focus on the domain structure in the training data to build a classifier that relies more on features generalizable across different domains to avoid overfitting the training domains. As our method is orthogonal to most of the aforementioned work, they can be combined to further improve the performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML