File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2005_metho.xml

Size: 11,058 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2005">
  <Title>Exploiting Named Entity Taggers in a Second Language</Title>
  <Section position="4" start_page="25" end_page="25" type="metho">
    <SectionTitle>
3 Data sets
</SectionTitle>
    <Paragraph position="0"> In this paper we report results of experimenting with two data sets. The corpus in Spanish is that used in the CoNLL 2002 competitions for the NE extraction task. This corpus is divided into three sets: a training set consisting of 20,308 NEs and two different sets for testing, testa which has 4,634 NEs and testb with 3,948 NEs, the former was designated to tune the parameters of the classifiers (development set), while testb was designated to compare the results of the competitors. We performed experiments with testa only.</Paragraph>
    <Paragraph position="1"> For evaluating NER on Portuguese we used the corpus provided by &amp;quot;HAREM: Evaluation contest on named entity recognition for Portuguese&amp;quot;. This corpus contains newspaper articles and consists of 8,551 words with 648 NEs.</Paragraph>
  </Section>
  <Section position="5" start_page="25" end_page="28" type="metho">
    <SectionTitle>
4 Two-step Named Entity Recognition
</SectionTitle>
    <Paragraph position="0"> Our approach to NER consists in dividing the problem into two subproblems that are addressed sequentially. We first solve the problem of determining boundaries of named entities, we called this process Named Entity Delimitation (NED). Once we have determined which words belong to named entities, we then get to the task of classifying the named entities into categories, this process is what we called Named Entity Classification (NEC). We explain the two procedures in the following subsections.</Paragraph>
    <Section position="1" start_page="25" end_page="26" type="sub_section">
      <SectionTitle>
4.1 Named Entity Delimitation
</SectionTitle>
      <Paragraph position="0"> We used the BIO scheme for delimiting named entities. In this approach each word in the text is labeled with one out of three possible classes: The B tag is assigned to words believed to be the beginning of a NE, the I tag is for words that belong to an entity but that are not at the beginning, and the O tag is for all words that do not satisfy any of the previous two conditions.</Paragraph>
      <Paragraph position="1">  learning setting for NER in Spanish. The fragment presented in the table, &amp;quot;El Ej'ercito Mexicano puso en marcha el Plan DN-III&amp;quot;, translates as &amp;quot;The Mexican Army launched the DN-III plan&amp;quot;</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
Internal Features External Features
</SectionTitle>
      <Paragraph position="0"> Word Caps Position POS tag BIO tag Class El 3 1 DA O O Ej'ercito 2 2 NC B B Mexicano 2 3 NC I I puso 2 4 VM O O en 2 5 SP O O marcha 2 6 NC O O el 3 7 DA O O Plan 2 8 NC B B DN-III 3 9 NC I I  In our approach, NED is tackled as a learning task. The features used as attributes are automatically extracted from the documents and are used to train a machine learning algorithm. We used a modified version of C4.5 algorithm (Quinlan, 1993) implemented within the WEKA environment (Witten and Frank, 1999).</Paragraph>
      <Paragraph position="1"> For each word we combined two types of features: internal and external; we consider as internal features the word itself, orthographic information and the position in the sentence. The external features are provided by the hand coded NER system for Spanish, these are the Part-of-Speech tag and the BIO tag. Then, the attributes for a given word w are extracted using a window of five words anchored in the word w, each word described by the internal and external features mentioned previously.</Paragraph>
      <Paragraph position="2"> Within the orthographic information we consider 6 possible states of a word. A value of 1 in this attribute means that the letters in the word are all capitalized. A value of 2 means the opposite: all letters are lower case. The value 3 is for words that have the initial letter capitalized. 4 means the word has digits, 5 is for punctuation marks and 6 refers to marks representing the beginning and end of sentences. The hand coded system used in this work was developed by the TALP research center (Carreras and Padr'o, 2002). They have developed a set of NLP analyzers for Spanish, English and Catalan that include practical tools such as POS taggers, semantic analyzers and NE extractors. This NER system is based on hand-coded grammars, lists of trigger words and gazetteer information.</Paragraph>
      <Paragraph position="3"> In contrast to other methods we do not perform binary classifications, as (Carreras et al., 2003b), thus we do not build specialized classifiers for each of the tags. Our classifier learns to discriminate among the three classes and assigns labels to all the words, processing them sequentially. In Table 1 we present an example taken from the data used in the experiments where internal and external features are extracted for each word in a sentence.</Paragraph>
      <Paragraph position="4">  For all results reported here we show the overall average of several runs of 10-fold cross-validation. We used common measures from information retrieval: precision, recall and F1 and we present results from individual classes as we believe it is important in a learning setting such as this, where nearly 90% of the instances belong to one class. Table 2 presents comparative results using the Spanish corpus. We show four different sets of results, the first ones are from the hand coded system, they are labeled NER system for Spanish. Then we present results of training a classifier with only the internal features described above, these results are labeled Internal features. In a third experiment we trained the classifier using only the output of the NER system, these are under column External features. Finally, the results of our system are presented in column labeled Our method. We can see that even though the NER system performs very well by itself, by training the C4.5 algorithm on its outputs we improve performance in all the cases, with the exception of precision for class B. Given that the hand coded system was built for this collection, it is very encouraging to see our method outperforming this system. In Table 3 we show results of applying our method to the Portuguese corpus. In this case the improvements are much more impressive, particularly for class B, in all the cases the best results are obtained from our technique. This was expected as we are using a system developed for a different language. But we can see that our method yields very competitive results for Portuguese, and although by using only the internal features we can outperform the hand coded system, by combining the information using our method we can increase accuracies.  From the results presented above, it is clear that the method can perform NED in Spanish and Portuguese with very high accuracy. Another insight suggested by these results is that in order to perform NED in Portuguese we do not need an existing NED system for Spanish, the internal features performed well by themselves, but if we have one available, we can use the information provided by it to build a more accurate NED method.</Paragraph>
    </Section>
    <Section position="3" start_page="27" end_page="28" type="sub_section">
      <SectionTitle>
4.2 Named Entity Classification
</SectionTitle>
      <Paragraph position="0"> As mentioned previously, we build our NE classifiers using the output of a hand coded system. Our assumption is that by using machine learning algorithms we can improve performance of NE extractors without a considerable effort, as opposed to that involved in extending or rewriting grammars and lists of trigger words and gazetteers. Another assumption underlying this approach is that of believing that the misclassifications of the hand coded system for Spanish will not affect the learner. We believe that by having available the correct NE classes in the training corpus, the learner will be capable of generalizing error patterns that will be used to assign the correct NE. If this assumption holds, learning from other's mistakes, the learner will end up outperforming the hand coded system.</Paragraph>
      <Paragraph position="1"> In order to build a training set for the learner, each instance is described with the same attributes as for the NED task described in section 4.1, with the addition of a new attribute. Since NEC is a more difficult task, we consider useful adding as attribute the suffix of each word. Then, for each instance word we consider its suffix, with a maximum size of 5 characters. null Another important difference between this classification task and NED relies in the set of target values. For the Spanish corpus the possible class values are the same as those used in CoNLL-2002 competition task: person, organization, location and miscellaneous. However, for the Portuguese corpus we have 10 possible classes: person, object, quantity, event, organization, artifact, location, date, abstraction and miscellaneous. Thus the task of adapting the system for Spanish to perform NEC in Portuguese is much more complex than that of NED given that the Spanish system only discerns the four NE classes defined on the CoNLL-2002. Regardless of this, we believe that the learner will be capable of achieving good accuracies by using the other attributes in the learning task.</Paragraph>
      <Paragraph position="2">  Similarly to the NED case we trained C4.5 classifiers for the NEC task, results are presented in Tables 4 and 5. Again, we perform comparisons between the hand coded system and the use of different subsets of attributes. For the case of Spanish NEC, we can see in Table 4, that our method using internal and external features presents the best results. The improvements are impressive, specially for the NE class Miscellaneous where the hand coded system achieved an F measure below 1 while our system achieved an F measure of 56.7. In the case of NEC in Portuguese the results are very encouraging. The  hand coded system performed poorly but by training a C4.5 algorithm results are improved considerably, even for the classes that the hand coded system was not capable of recognizing. As expected, the external features did not solve the NEC by themselves but contribute for improving the performance. This, and the results from using only internal features, suggest that we do not need complex linguistic resources in order to achieve good results. Additionally, we can see that for some cases the classifiers were not able of performing an accurate classification, as in the case of classes object and miscellaneous. This may be due to a poor representation of the classes in the training set, for instance the class object has only 4 instances. We believe that if we have more instances available the learners will improve these results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML