File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0435_metho.xml

Size: 14,689 bytes

Last Modified: 2025-10-06 14:08:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0435">
  <Title>Memory-Based Named Entity Recognition using Unannotated Data</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Memory Based Learning
</SectionTitle>
    <Paragraph position="0"> We used Timbl (Daelemans et al., 2002), a memory-based learner. When presented with training instances, the learner stores them all, and then classifies new data on the basis of its k nearest neighbours in the training set.</Paragraph>
    <Paragraph position="1"> Before classification, the learner assigns weights to each of the features, marking their importance for the learning task. Features with higher weights are treated as more important in classification as those with lower weights.</Paragraph>
    <Paragraph position="2"> Timbl has some parameters which can be adjusted in order to improve learning. For the NER system described in this paper, we varied the parameters k and m. k is the number of nearest neighbours Timbl looks at. m determines the feature metrics, i.e. the importance weights given to each feature, and the way similarity between values of the same feature is computed. This parameter can be adjusted separately for each feature. The two metrics used were weighted overlap and modified value difference.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 System 1: Description
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Features
</SectionTitle>
      <Paragraph position="0"> For the basic English system, 37 features were used. The first seven features were the lowercase versions of the focus word, and a context of three words to the left and the right. The next seven features were the part-of-speech tags of the same seven words. Then followed seven features indicating for each of the seven words if they were capitalized or not. The next six features represented the first and last three letters of the word to be classified.</Paragraph>
      <Paragraph position="1"> These features were included in order to make it possible for the memory-based learner to use word-internal information. Frequent prefixes and suffixes can thus be used to learn names. Finally, ten features indicated if the focus word appears in any of the gazetteers used for this task.</Paragraph>
      <Paragraph position="2"> These gazetteers are discussed in more detail in the next section.</Paragraph>
      <Paragraph position="3"> For the German system, the same features were used, with an additional seven features: for each word in the seven-word window, the stem of the word was also included. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Gazetteers
</SectionTitle>
      <Paragraph position="0"> Ten gazetteers were used to provide features. These gazetteers listed names of all four kinds, as well as words which often appear inside names (such as International (for organization names) and de (for person names)).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Type - token generalization
</SectionTitle>
      <Paragraph position="0"> A module was created to generalize NE tags from types to tokens. It is a simple program which assumes that if two capitalized words have the same form, they will also have the same NE tag. This is potentially problematic, because many words can be used either as part of a name or not, and in this case it indeed proved to be unhelpful.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System 2: Description
</SectionTitle>
    <Paragraph position="0"> For the extended English system, four more features were added to each instance: the first four indicated if the focus word was part of a named entity found in a list of named entities derived from the unannotated data. The second new feature indicated if the focusword is capitalized or uncapitalized most often in the unannotated data.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Gazetteers extracted from conjunctions
</SectionTitle>
      <Paragraph position="0"> First, potential names were identified in the unannotated data. This was done using the gazetteers which were used for the first system, and a simple grammar of names.</Paragraph>
      <Paragraph position="1"> Then we looked for all conjunctions of capitalized strings in the unannotated data. If one of the strings was tagged in its entirety as being of one NE type, and no other strings in the conjunction had another NE tag, it was hypothesized that all strings in this conjunction were of the same type. All strings would then be stored in a gazetteer of NEs of that type.</Paragraph>
      <Paragraph position="2"> The next step was to add four more features to the training and test sets of the NE system. In the training and test texts, strings of capitalized words were matched with the strings in the newly made gazetteers. All instances were enlarged by four binary features, one for each type of NE (L, M, O, P). These features are on when the focus word (and its context in the case of a longer name) matches a string in the associated gazetteer, and off when it does not.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Ratio of capitalized to non-capitalized
</SectionTitle>
      <Paragraph position="0"> occurrence of tokens A last feature added to all instances indicated if the focus word (the word to be classified) appears more often capitalized or uncapitalized in the unannotated corpus. This approach has been used earlier by (Collins, 2002). In order to make this feature, a list was made of all wordforms, converted to lowercase, in the corpus, and the ratio of capitalized to uncapitalized occurrences. The extra feature was binary: on if a word appears more often capitalized than not, and off otherwise.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 System 1: Discussion of results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Role of gazetteers
</SectionTitle>
      <Paragraph position="0"> Two experiments were run to assess the importance of the gazetteers in this experiment: the first used only the word to be classified and its context, the second used binary features indicating inclusion in gazetteers, as well as the features used in the first experiment. Perhaps surprisingly, the English system did worse when gazetteer information was used. This was true using the default parameter settings, and also after (limited) separate parameter optimization. The German system did slightly better on the development data when gazetteers were used.</Paragraph>
      <Paragraph position="1"> The difference between the English and German systems is very surprising, as the lists were not adjusted to include extra German names. They contain mainly English and Dutch names, as a result of previous work on Dutch and English. In order to find an explanation, we looked at the performance (not optimized) of the lists on their own, not using any context or word-internal information at all. The result did not make things at all clearer: the precision of the lists on the German data was striking, even more so than on the English data.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Type - token generalization
</SectionTitle>
      <Paragraph position="0"> Type-token generalization was attempted only on the English data. The intuition behind this approach is that a memory-based learner may recognize a name due to its context, but it will not generalize the classification to other tokens of the same type. However, a concern is that mistakes will be introduced by generalizing ambiguous words to the wrong type, and by repeating mistakes which would otherwise occur only sporadically. In the end, introducing generalization did not make much of a difference. While precision declines marginally (two more phrases were incorrectly tagged as names), recall is unaffected.</Paragraph>
      <Paragraph position="1"> The results in Table 2 were derived using Timbl with default parameters. The lack of optimization explains the low result even without generalization.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Parameter optimization and feature selection
</SectionTitle>
      <Paragraph position="0"> Parameter optimization was used both for system 1 and for system 2. This was combined with limited feature selection. The difference feature selection can make, is already obvious from the results above, and will be shown  in the rest of the paper also. Parameter optimization can have a major effect on performance of machine learning systems in general, and Timbl in particular, as can be seen in Table 3.</Paragraph>
      <Paragraph position="1"> As was shown by Daelemans and Hoste (2002), parameter optimization and feature selection heavily interact in machine learning: separate optimization leads to inferior results to interleaved optimization. Different parameter settings might be best for different feature selections, and vice versa. It would therefore be best to optimize both at the same time, treating feature selection and parameter optimization together as one search space. This was done to a very limited extent for this problem, but because of the time needed for each experiment, a full search of the solution space was impossible.</Paragraph>
      <Paragraph position="2"> Another restriction is the fact that not all parameters of the learner were optimized, again due to time constraints. The two that were found to have a great effect were used only. These are k, the number of nearest neighbours taken into account when classifying a new instance, and m, the feature metric. m was toggled between weighted overlap and modified value difference.</Paragraph>
      <Paragraph position="3"> The results shown in Table 3 are those on the consistently best featureset found, i.e. the one using all information minus gazetteers.</Paragraph>
      <Paragraph position="4"> On the German data, parameter optimization and feature selection were also found to be beneficial, but optimization had to be cut short due to time constraints.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 System 2: Discussion of results
</SectionTitle>
    <Paragraph position="0"> In this system, extra information is added to the training set in the following way: the number of the instances in the training set remains the same, but the number of features for each instance is increased. The information for the extra instances is found in the unannotated data, so this should bring the benefit of using this extra information source. At the same time, only the hand-tagged training set is used, which means that no extra noise is introduced into the training set.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Gazetteers extracted from conjunctions
</SectionTitle>
      <Paragraph position="0"> In this step, four new features were added to each instance in the training and test sets, one for each type of NE.</Paragraph>
      <Paragraph position="1"> Even though gazetteers were already in use, we extracted new gazetteers from the unannotated data. The hope was that these gazetteers would be more useful for this particular task, as they would be corpus-specific. The gazetteers which were used originally, and which did not improve performance, were mainly taken off the internet, and partially hand-crafted. This means that they are general-purpose gazetteers. Also, they were a mixture of Dutch and English names. The new gazetteers were only English, and only included those names which were found in the Reuters corpus.</Paragraph>
      <Paragraph position="2"> Once the gazetteers were extracted, their entries were matched against the text in the training data. When a string of words in the training data matched a name, this would be reflected in the new features. For example, if New York was found both in the locations gazetteer and in the training set, then both New and York would receive a feature value Ltag (for location tag) for the newly added location feature. The results in Table 4 show that this strategy was successful.</Paragraph>
      <Paragraph position="3"> The results were found using Timbl with default settings. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Ratio of capitalized to non-capitalized
</SectionTitle>
      <Paragraph position="0"> occurrence of tokens Next, another feature was added to the training and test instances. This feature is another binary feature, and it indicates if the focus word of the instance is found more often in its capitalized form, or in its non-capitalized form. This feature can help the process of NER in different ways. One of them is the identification of sentence-initial words. They are always capitalized in English, but if they tend to appear uncapitalized more often, they are probably not a name. Another way they can help is in finding words which are sometimes names, and sometimes ordinary words (e.g. Apple). They should not be tagged as a name if the uncapitalized version occurs more frequently.</Paragraph>
      <Paragraph position="1"> This approach was also successful. Results shown in Table 5 were once again obtained by using Timbl with default settings.</Paragraph>
      <Paragraph position="2"> English devel. Precision Recall Fb=1 No cap. info 75.88% 82.88% 79.22 With cap. info 77.18% 84.20% 80.54</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Combination of conjunction lists and
</SectionTitle>
      <Paragraph position="0"> capitalization information Finally, all features were combined, and a number of optimization and (limited) feature selection runs were executed. The best run found used all five of the extra features derived from the unannotated data. This is good news, because it means that using unannotated data can help to improve NER of English.</Paragraph>
      <Paragraph position="1"> Both results shown in Table 6 are those of the best runs after optimization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML