File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0434_metho.xml

Size: 9,922 bytes

Last Modified: 2025-10-06 14:08:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0434">
  <Title>A Robust Risk Minimization based Named Entity Recognition System</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 System description
</SectionTitle>
    <Paragraph position="0"> Following the approach employed in our text chunking system (Zhang et al., 2002), we treat the named entity recognition problem as a sequential token-based tagging problem. We denote by {wi} (i = 0,1,...,m) the sequence of tokenized text, which is the input to our system. In token-based tagging, the goal is to assign a class-label ti, taking its value from a predefined set of labels, to every token wi.</Paragraph>
    <Paragraph position="1"> For named entity recognition, and text segmentation in general, the entities (segments) can be encoded as a token-based tagging problem by using various encoding schemes. In this paper, we shall only use the IOB1 encoding scheme which is provided in the CoNLL-2003 shared task.</Paragraph>
    <Paragraph position="2"> The goal of our learning system is to predict the class-label value ti associated with each token wi. In our system, this is achieved by estimating the conditional probability P(ti = c|xi) for every possible class-label value c, where xi is a feature vector associated with token i. It is essentially a sufficient statistic in our model: we assume that P(ti = c|xi) = P(ti = c|{wi},{tj}j[?]i). The feature vector xi can depend on previously predicted class-labels {tj}j[?]i, but the dependency is typically assumed to be local. Given such a conditional probability model, in the decoding stage, we estimate the best possible sequence of ti's using a dynamic programming approach, similar to what is described in (Zhang et al., 2002).</Paragraph>
    <Paragraph position="3"> In our system, the conditional probability model has the following parametric form:</Paragraph>
    <Paragraph position="5"> where T(y) = min(1,max(0,y)) is the truncation of y into the interval [0,1]. wc is a linear weight vector and bc is a constant. Parameters wc and bc can be estimated from the training data.</Paragraph>
    <Paragraph position="6"> Given training data (xi,ti) for i = 1,...,n. It was shown in (Zhang et al., 2002) that such a model can be estimated by solving the following optimization problem</Paragraph>
    <Paragraph position="8"> where yic = 1 when ti = c and yic = [?]1 otherwise. The function f is defined as:</Paragraph>
    <Paragraph position="10"> This risk function is closely related to Huber's loss function in robust estimation. We shall call a classification method that is based on approximately minimizing this risk function robust risk minimization. The generalized Winnow method in (Zhang et al., 2002) implements such a method. The numerical algorithm used for experiments in this paper is a variant, and is similar to the one given in (Damerau et al., 2003).</Paragraph>
    <Paragraph position="11"> The main purpose of the paper is to investigate the impact of local linguistic features for the Named Entity detection task. The basic linguistic features considered here are all aligned with the tokens. Specifically we will consider features listed in Table 1. These features are represented using a binary encoding scheme where each component of the feature vector x corresponds to an occurrence of a feature that is listed above. We use a window of +-1 centered at the current token unless indicated otherwise in Table 1.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"> We study the performance of our system with different feature combinations on the English development set.</Paragraph>
    <Paragraph position="1"> Our results are presented in Table 2. All of these results are significantly better than the baseline performance of 71.18. We will now discuss the implications of these experimental results.</Paragraph>
    <Paragraph position="2"> The small difference between Experiment 1 and Experiment 2 implies that tokens by themselves, whether represented as mixed case text or not, do not significantly affect the system performance.</Paragraph>
    <Paragraph position="3"> Experiment 3 shows that even without case information, the performance of a statistical named entity recognition system can be greatly enhanced with token prefix and suffix information. Intuitively, such information allows us to build a character-based token-model which can predict whether an (unseen) English word looks like an entity-type or not. The performance of this experiment is comparable to that of the mixed-case English text plus capitalization feature reported in Experiment 4.</Paragraph>
    <Paragraph position="4"> Experiment 4 suggests that capitalization is a very useful feature for mixed case text, and can greatly enhance the perform of a named entity recognition system. With token prefix and suffix information that incorporates a character-based entity model, the system performance is further enhanced, as reported in Experiment 5.</Paragraph>
    <Paragraph position="5"> Up to Experiment 5, we have only used very simple token-based linguistic features. Despite their simplicity, these features give very significant performance enhancement. In addition, such features are readily available for many languages, implying that they can be used in a language independent statistical named entity recognition system.</Paragraph>
    <Paragraph position="6"> In Experiment 6, we added the provided part-of-speech and chunking information. Clearly they only lead to a relatively small improvement. We believe that most information contained in part-of-speech has already been captured in the capitalization and prefix/suffix features. The chunking information might be more useful, though its value is still quite limited.</Paragraph>
    <Paragraph position="7"> By adding the four supplied dictionaries, we observe a small, but statistically significant improvement. The performance is reported in Experiment 7. At this point we have only used information provided by the shared task.</Paragraph>
    <Paragraph position="8"> Further performance enhancement can be achieved by using extra information that is not provided in the shared Feature ID Feature description A Tokens that are turned into all upper-case, in a window of +-2. B Tokens themselves, in a window of +-2.</Paragraph>
    <Paragraph position="9"> C The previous two predicted tags, and the conjunction of the previous tag and the current token. D Initial capitalization of tokens in a window of +-2.</Paragraph>
    <Paragraph position="10"> E More elaborated word type information: initial capitalization, all capitalization, all digitals, or digitals containing punctuations. F Token prefix (length three and four), and token suffix (length from one to four). G POS tagged information provided in shared the task.</Paragraph>
    <Paragraph position="11"> H chunking information provided in the shared task: we use a bag-of-word representation of the chunk at the current token. I The four dictionaries provided in the shared task: PER, ORG, LOC, and MISC. J A number of additional dictionaries from different sources: some trigger words for ORG, PER, LOC; lists of location, person, and organizations.  task. In this study, we will only report performance with additional dictionaries we have gathered from various different sources. With these additional dictionaries, our system achieved a performance of 92, as reported in Experiment 8. Table 4 presents the performance of each entity type separately.</Paragraph>
    <Paragraph position="12"> Clearly the construction of extra linguistic features is open ended. It is possible to improve system performance with additional and higher quality dictionaries. Although dictionaries are language dependent, they are often fairly readily available and providing them does not pose a major impediment to customizing a language independent system. However, for more difficult cases, it may be necessary to provide high precision, manually developed rules to capture particular linguistic patterns. Language dependent features of this kind are harder to develop than dictionaries and correspondingly pose a greater obstacle to customizing a language independent system. We have found that such features can appreciably improve the performance of our system, but discussion is beyond the scope of this paper. A related idea is to combine the outputs of different systems. See (Florian et al., 2003) for such a study. Fortunately, as our experiments indicate, special purpose patterns may not be necessary for quite reasonable accuracy.</Paragraph>
    <Paragraph position="13"> In Table 4, we report the performance of our system on the German data. We shall note that the performance is significantly lower than the corresponding English performance. Our experience indicates that even for English, the real-world performance of a statistical named entity recognizer can be very low. The performance we reported for the German data is achieved by using the following features: B+C+D+E+F+G+H+I+J (with some small modifications), plus the German word lemma feature provided by the task.</Paragraph>
    <Paragraph position="14"> The additional German dictionaries were provided to us by Radu Florian. Without these additional dictionaries (in this case, all information we use is provided by the CoNLL task), the overall performance is listed in Table 3.</Paragraph>
    <Paragraph position="15"> It is also interesting to note that without any dictionary information, the overall performance drops to an Fb=1 score of 65.5 on the development set, and to 70.2 on the test set. Clearly for this data, dictionary information helps more on the development data than on the test data.</Paragraph>
    <Paragraph position="16">  provided by the CoNLL task on German data.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML