File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2918_metho.xml

Size: 25,083 bytes

Last Modified: 2025-10-06 14:10:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2918">
  <Title>Using Gazetteers in Discriminative Information Extraction</Title>
  <Section position="5" start_page="133" end_page="133" type="metho">
    <SectionTitle>
3 Previous Use of Gazetteers
</SectionTitle>
    <Paragraph position="0"> Gazetteers have been widely used in a variety of information extraction systems, including both rule-based systems and statistical models. In addition to lists of people names, locations, etc., recent work in the biomedical domain has utilised gazetteers of biological and genetic entities such as gene names (Finkel et al., 2005; McDonald and Pereira, 2005).</Paragraph>
    <Paragraph position="1"> In general gazetteers are thought to provide a useful source of external knowledge that is helpful when an entity cannot be identi ed from knowledge contained solely within the data set used for training.</Paragraph>
    <Paragraph position="2"> However, some research has questioned the usefulness of gazetteers (Krupka and Hausman, 1998).</Paragraph>
    <Paragraph position="3"> Other work has supported the use of gazetteers in general but has found that lists of only moderate size are suf cient to provide most of the bene t (Mikheev et al., 1999). Therefore, to date the effective use of gazetteers for information extraction has in general been regarded as a black art . In this paper we explain some of the likely reasons for these ndings, and propose ways to more effectively handle gazetteers when they are used by maxent-style models.</Paragraph>
    <Paragraph position="4"> In work developed independently and in parallel to the work presented here, Sutton et al. (2006) identify general problems with gazetteer features and propose a solution similar to ours. They present results on NP-chunking in addition to NER, and provide a slightly more general approach. By contrast, we motivate the problem more thoroughly through analysis of the actual errors observed and through consideration of the success of other candidate solutions, such as traditional regularisation over feature subsets.</Paragraph>
  </Section>
  <Section position="6" start_page="133" end_page="578" type="metho">
    <SectionTitle>
4 Our Experiments
</SectionTitle>
    <Paragraph position="0"> In this section we describe our experimental setup, and provide results for the baseline models.</Paragraph>
    <Section position="1" start_page="133" end_page="578" type="sub_section">
      <SectionTitle>
4.1 Task and Dataset
</SectionTitle>
      <Paragraph position="0"> Named entity recognition (NER) involves the identi cation of the location and type of pre-de ned entities within a sentence. The CRF is presented with a set of sentences and must label each word so as to indicate whether the word appears outside an entity, at the beginning of an entity of a certain type or  within the continuation of an entity of a certain type. Our results are reported on the CoNLL-2003 shared task English dataset (Sang and Meulder, 2003). For this dataset the entity types are: persons (PER), locations (LOC), organisations (ORG) and miscellaneous (MISC). The training set consists  tokens and the test set consists of 3</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="2" start_page="578" end_page="578" type="sub_section">
      <SectionTitle>
4.2 Gazetteers
</SectionTitle>
      <Paragraph position="0"> We employ a total of seven gazetteers for our experiments. These cover names of people, places and organisations. Speci cally, we have gazetteers</Paragraph>
    </Section>
    <Section position="3" start_page="578" end_page="578" type="sub_section">
      <SectionTitle>
4.3 Feature set
</SectionTitle>
      <Paragraph position="0"> Our experiments are centred around two CRF models, one with and one without gazetteer features.</Paragraph>
      <Paragraph position="1"> The model without gazetteer features, which we call standard, comprises features de ned in a window of ve words around the current word. These include features encoding n-grams of words and POS tags, and features encoding orthographic properties of the current word. The orthographic features are based on those found in (Curran and Clark, 2003).</Paragraph>
      <Paragraph position="2"> Examples include whether the current word is capitalised, is an initial, contains a digit, contains punctuation, etc. In total there are 450  345 features in the standard model.</Paragraph>
      <Paragraph position="3"> We call the second model, with gazetteer features, standard+g. This includes all the features contained in the standard model as well as 8  329 gazetteer features. Our gazetteer features are a typical way to represent gazetteer information in maxent-style models. They are divided into two categories: unlexicalised and lexicalised. The unlexicalised features model the dependency between a word's presence in a gazetteer and its NER label, irrespective of the word's identity. The lexicalised features, on the other hand, include the word's identity and so provide more re ned word-speci c modelling of the  in the standard+g model.</Paragraph>
    </Section>
    <Section position="4" start_page="578" end_page="578" type="sub_section">
      <SectionTitle>
4.4 Baseline Results
</SectionTitle>
      <Paragraph position="0"> Table 1 gives F scores for the standard and standard+g models. Development set scores are included for completeness, and are referred to later in the paper. We show results for both unregularised and regularised models. The regularised models are trained with a zero-mean Gaussian prior, with the variance set using the development data.</Paragraph>
      <Paragraph position="1"> We see that, as expected, the presence of the gazetteer features allows standard+g to outperform standard, for both the unregularised and regularised models. To test signi cance, we use McNemar's matched-pairs test (Gillick and Cox, 1989) on point-wise labelling errors. In each case, the standard+g model outperforms the standard model at a significance level of p a1 0 a2 02. However, these results camou age the fact that the gazetteer features introduce some negative effects, which we explore in the next section. As such, the real bene t of including the gazetteer features in standard+g is not fully realised. null</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="578" end_page="578" type="metho">
    <SectionTitle>
5 Problems with Gazetteer Features
</SectionTitle>
    <Paragraph position="0"> We identify problems with the use of gazetteer features by considering test set labelling errors for both standard and standard+g. We use regularised models here as an illustration. Table 2 shows the 1Many gazetteer entries involve strings of words where the individual words in the string do not appear in the gazetteer in isolation. For this reason the lexicalised gazetteer features are not simply determined by the word identity features.</Paragraph>
    <Paragraph position="1">  number of sites (a site being a particular word at a particular position in a sentence) where labellings have improved, worsened or remained unchanged with respect to the gold-standard labelling with the addition of the gazetteer features. For example, the value in the top-left cell is the number of sites where both the standard and standard+g label words correctly. null The most interesting cell in the table is the top-right one, which represents sites where standard is correctly labelling words but, with the addition of the gazetteer features, standard+g mislabels them.</Paragraph>
    <Paragraph position="2"> At these sites, the addition of the gazetteer features actually worsens things. How well, then, could the standard+g model do if it could somehow reduce the number of errors in the top-right cell? In fact, if it had correctly labelled those sites, a signi cantly higher test set F score of 90 a2 36% would have been obtained. This potential upside suggests much could be gained from investigating ways of correcting the errors in the top-right cell. It is not clear whether there exists any approach that could correct all the errors in the top-right cell while simultaneously maintaining the state in the other cells, but approaches that are able to correct at least some of the errors should prove worthwhile.</Paragraph>
    <Paragraph position="3"> On inspection of the sites where errors in the top-right cell occur, we observe that some of the errors occur in sequences where no words are in any gazetteer, so no gazetteer features are active for any possible labelling of these sequences. In other cases, the errors occur at sites where some of the gazetteer features appear to have dictated the label, but have made an incorrect decision. As a result of these observations, we classify the errors from the top-right cell of Table 2 into two types: type A and type B.</Paragraph>
    <Section position="1" start_page="578" end_page="578" type="sub_section">
      <SectionTitle>
5.1 Type A Errors
</SectionTitle>
      <Paragraph position="0"> We call type A errors those errors that occur at sites where gazetteer features seem to have been directly responsible for the mislabelling. In these cases the gazetteer features effectively over-rule the other features in the model causing a mislabelling where the standard model, without the gazetteer features, correctly labels the word.</Paragraph>
      <Paragraph position="1"> An example of a type A error is given in the sentence extract below: about/O Healy/I-LOC This is the labelling given by standard+g. The correct label for Healy here is I-PER. The standard model is able to decode this correctly as Healy appears in the training data with the I-PER label. The reason for the mislabelling by the standard+g model is that Healyappears in both the gazetteer of place names and the gazetteer of person surnames.</Paragraph>
      <Paragraph position="2"> The feature encoding the gazetteer of place names with the I-LOC label has a l value of 4 a2 20, while the feature encoding the gazetteer of surnames with the I-PER label has a l value of 1 a2 96, and the feature encoding the word Healy with the I-PER label has a l value of 0 a2 25. Although other features both at the word Healyand at other sites in the sentence contribute to the labelling of Healy, the in uence of the rst feature above dominates. So in this case the addition of the gazetteer features has confused things.</Paragraph>
    </Section>
    <Section position="2" start_page="578" end_page="578" type="sub_section">
      <SectionTitle>
5.2 Type B Errors
</SectionTitle>
      <Paragraph position="0"> We call type B errors those errors that occur at sites where the gazetteer features seem to have been only indirectly responsible for the mislabelling. In these cases the mislabelling appears to be more attributable to the non-gazetteer features, which are in some sense less expressive after being trained with the gazetteer features. Consequently, they are less able to decode words that they could previously label correctly.</Paragraph>
      <Paragraph position="1"> An example of a type B error is given in the sentence extract below: Chanderpaul/O was/O This is the labelling given by standard+g. The correct labelling, given by standard, is I-PER for Chanderpaul. In this case no words in the sentence (including the part not shown) are present in any of the gazetteers so no gazetteer features are active for any labelling of the sentence. Consequently, the gazetteer features do not contribute at all to the labelling decision. Non-gazetteer features in standard+g are, however, unable to nd the correct labelling for Chanderpaul when they previously could in the standard model.</Paragraph>
      <Paragraph position="2"> For both type A and type B errors it is clear that the gazetteer features in standard+g are in some  sense too powerful while the non-gazetteers features have become too weak . The question, then, is: can we train all the features in the model in a more sophisticated way so as to correct for these effects? null</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="578" end_page="578" type="metho">
    <SectionTitle>
6 Feature Dependent Regularisation
</SectionTitle>
    <Paragraph position="0"> One interpretation of the ndings of our error analysis above is that the addition of the gazetteer features to the model is having an implicit over-regularising effect on the other features. Therefore, is it possible to adjust for this effect through more careful explicit regularisation using a prior? Can we directly regularise the gazetteer features more heavily and the non-gazetteer features less? We investigate this possibility in this section.</Paragraph>
    <Paragraph position="1"> The standard+g model is regularised by tting a single Gaussian variance hyperparameter across all features. The optimal value for this single hyperparameter is 45. We now relax this single constraint by allocating a separate variance hyperparameter to different feature subsets, one for the gazetteer features (sgaz) and one for the non-gazetteer features (snon-gaz). The hope is that the differing sub-sets of features are best regularised using different prior hyperparameters. This is a natural approach within most standardly formulated priors for log-linear models. Clearly, by doing this we increase the search space signi cantly. In order to make the search manageable, we constrain ourselves to three scenarios: (1) Hold snon-gaz at 45, and regularise the gazetteer features a little more by reducing sgaz. (2) Hold sgaz at 45, and regularise the non-gazetteer features a little less by increasing snon-gaz. (3) Simultaneously regularise the gazetteer features a little more than at the single variance optimum, and regularise the non-gazetteer features a little less.</Paragraph>
    <Paragraph position="2"> Table 3 gives representative development set F scores for each of these three scenarios, with each scenario separated by a horizontal dividing line. We see that in general the results do not differ signi cantly from that of the single variance optimum. We conjecture that the reason for this is that the regularising effect of the gazetteer features on the non-gazetteer features is due to relatively subtle interactions during training that relate to the dependencies the features encode and how these dependen- null by different amounts with a Gaussian prior does not directly address these interactions but instead just rather crudely penalises the magnitude of the parameter values of different feature sets to different degrees. Indeed this is true for any standardly formulated prior. It seems therefore that any solution to the regularising problem should come through more explicit restricting or removing of the interactions between gazetteer and non-gazetteer features during training.</Paragraph>
  </Section>
  <Section position="9" start_page="578" end_page="578" type="metho">
    <SectionTitle>
7 Combining Separately Trained Models
</SectionTitle>
    <Paragraph position="0"> We may remove interactions between gazetteer and non-gazetteer features entirely by quarantining the gazetteer features and training them in a separate model. This allows the non-gazetteer features to be protected from the over-regularising effect of the gazetteer features. In order to decode taking advantage of the information contained in both models, we must combine the models in some way. To do this we use a logarithmic opinion pool (LOP) (Smith et al., 2005). This is similar to a mixture model, but uses a weighted multiplicative combination of models rather than a weighted additive combination.</Paragraph>
    <Paragraph position="1"> Given models pa and per-model weights wa, the LOP distribution is de ned by:</Paragraph>
    <Paragraph position="3"> with wa a1 0 and [?]a wa a3 1, and where ZLOP a0 oa2 is a normalising function. The weight wa encodes the dependence of the LOP on model a. In the case of a CRF, the LOP itself is a CRF and so decoding is no more complex than for standard CRF decoding.</Paragraph>
    <Paragraph position="4"> In order to use a LOP for decoding we must set the weights wa in the weighted product. In (Smith et  al., 2005) a procedure is described whereby the (normalised) weights are explicitly trained. In this paper, however, we only construct LOPs consisting of two models in each case, one model with gazetteer features and one without. We therefore do not require the weight training procedure as we can easily t the two weights (only one of which is free) using the development set.</Paragraph>
    <Paragraph position="5"> To construct models for the gazetteer and non-gazetteer features we rst partition the feature set of the standard+g model into the subsets outlined in Table 4. The simple structural features model labellabel and label-word dependencies, while the advanced structural features include these features as well as those modelling label-label-word conjunctions. The simple orthographic features measure properties of a word such as capitalisation, presence of a digit, etc., while the advanced orthographic properties model the occurrence of pre xes and sufxes of varying length.</Paragraph>
    <Paragraph position="6"> We create and train different models for the gazetteer features by adding different feature sub-sets to the gazetteer features. We regularise these models in the usual way using a Gaussian prior. In each case we then combine these models with the standard model and decode under a LOP.</Paragraph>
    <Paragraph position="7"> Table 5 gives results for LOP decoding for the different model pairs. Results for the standard+g model are included in the rst row for comparison.</Paragraph>
    <Paragraph position="8"> For each LOP the hyphen separates the two models comprising the LOP. So, for example, in the second row of the table we combine the gazetteer features with simple structural features in a model, train and decode with the standard model using a LOP. The simple structural features are included so as to provide some basic support to the gazetteer features. We see from Table 5 that the rst two LOPs signi cantly outperform the regularised standard+g  model (at a signi cance level of p a1 0 a2 01, on both the test and development sets). By training the gazetteer features separately we have avoided their over-regularising effect on the non-gazetteer features. This relies on training the gazetteer features with a relatively small set of other features. This is illustrated as we read down the table, below the top two rows. As more features are added to the model containing the gazetteer features we obtain decreasing test set F scores because the advantage created from separate training of the features is increasingly lost.</Paragraph>
    <Paragraph position="9"> Table 6 gives the corresponding weights for the LOPs in Table 5, which are set using the development data. We see that in every case the LOP allocates a smaller weight to the gazetteer features model than the non-gazetteer features model and in doing so restricts the in uence that the gazetteer features have in the LOP's labelling decisions.</Paragraph>
    <Paragraph position="10"> Table 7, similar to Table 2 earlier, shows test set labelling errors for the standard model and one of the LOPs. We take the s2g-standard LOP here for illustration. We see from the table that the number of errors in the top-right cell shows a reduction of 29% over the corresponding value in Table 2. We have therefore reduced the number errors of the type we were targeting with our approach. The approach has also had the effect of reducing the number of errors in the bottom-right cell, which further improves model accuracy.</Paragraph>
    <Paragraph position="11"> All the LOPs in Table 5 contain regularised mod- null els. Table 8 gives test set F scores for the corresponding LOPs constructed from unregularised models. As we would expect, the scores are lower than those in Table 5. However, it is interesting to note that the s1g-standard LOP still outperforms the regularised standard+g model.</Paragraph>
    <Paragraph position="12"> In summary, by training the gazetteer features and non-gazetteer features in separate models and decoding using a LOP, we are able to overcome the problems described in earlier sections and can achieve much higher accuracy. This shows that successfully deploying gazetteer features within maxent-style models should involve careful consideration of restrictions on how features interact with each other, rather than simply considering the absolute values of feature parameters.</Paragraph>
  </Section>
  <Section position="10" start_page="578" end_page="578" type="metho">
    <SectionTitle>
8 Gazetteer-Like Features
</SectionTitle>
    <Paragraph position="0"> So far our discussion has focused on gazetteer features. However, we would expect that the problems we have described and dealt with in the last section also occur with other types of features that have similar properties to gazetteer features. By applying similar treatment to these features during training we may be able harness their usefulness to a greater degree than is currently the case when training in a single model. So how can we identify these features? The task of identifying the optimal partitioning for creation of models in the previous section is in general a hard problem as it relies on clustering the features based on their explanatory power relative to all other clusters. It may be possible, however, to devise some heuristics that approximately correspond to the salient properties of gazetteer features (with respect to the clustering) and which can then be used to identify other features that have these properties.</Paragraph>
    <Paragraph position="1"> In this section we consider three such heuristics. All of these heuristics are motivated by the observation that gazetteer features are both highly discriminative and generally very sparse.</Paragraph>
    <Paragraph position="2"> Family Singleton Features We de ne a feature family as a set of features that have the same conjunction of predicates de ned on the observations.</Paragraph>
    <Paragraph position="3"> Hence they differ from each other only in the NER label that they encode. Family singleton features are features that have a count of 1 in the training data when all other members of that feature family have zero counts. These features have a avour of gazetteer features in that they represent the fact that the conjunction of observation predicates they encode is highly predictive of the corresponding NER label, and that they are also very sparse.</Paragraph>
    <Paragraph position="4"> Family n-ton Features These are features that have a count of n (greater than 1) in the training data when all other members of that feature family have zero counts. They are similar to family singleton features, but exhibit gazetteer-like properties less and less as the value of n is increased because a larger value of n represents less sparsity.</Paragraph>
    <Paragraph position="5"> Loner Features These are features which occur with a low mean number of other features in the training data. They are similar to gazetteer features in that, at the points where they occur, they are in some sense being relied upon more than most features to explain the data. To create loner feature sets we rank all features in the standard+g model based on the mean number of other features they are observed with in the training data, then we take subsets of increasing size. We present results for subsets of size 500, 1000, 5000 and 10000.</Paragraph>
    <Paragraph position="6"> For each of these categories of features we add simple structural features (the s1 set from earlier), to provide basic structural support, and then train a regularised model. We also train a regularised model consisting of all features in standard+g except the features from the category in question. We decode these model pairs under a LOP as described earlier.</Paragraph>
    <Paragraph position="7"> Table 9 gives test set F scores for LOPs created from each of the categories of features above  (with abbreviated names derived from the category names). The results show that for the family singleton features and each of the loner feature sets we obtain LOPs that signi cantly outperform the regularised standard+g model (p a1 0 a2 0002 in every case). The family n-ton features' LOP does not do as well, but that is probably due to the fact that some of the features in this set have a large value of n and so behave much less like gazetteer features.</Paragraph>
    <Paragraph position="8"> In summary, we obtain the same pattern of results using our quarantined training and LOP decoding method with these categories of features that we do with the gazetteer features. We conclude that the problems with gazetteer features that we have identi ed in this paper are exhibited by general discriminative features with gazetteer feature-like properties, and our method is also successful with these more general features. Clearly, the heuristics that we have devised in this section are very simple, and it is likely that with more careful engineering better feature partitions can be found.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML