File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2010_metho.xml

Size: 10,785 bytes

Last Modified: 2025-10-06 14:09:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-2010">
  <Title>A Machine Learning Approach to German Pronoun Resolution</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Boosting
</SectionTitle>
    <Paragraph position="0"> All of the systems described in the previous section use a single classifier to resolve coreference.</Paragraph>
    <Paragraph position="1"> Our intuition, however, is that a combination of classifiers is better suited for this task. The concept of ensemble learning (Dietterich, 2000) is based on the assumption that combining the hypotheses of several classifiers yields a hypothesis that is much more accurate than that of an individual classifier.</Paragraph>
    <Paragraph position="2"> One of the most popular ensemble learning methods is boosting (Schapire, 2002). It is based on the observation that finding many weak hypotheses is easier than finding one strong hypothesis. This is achieved by running a base learning algorithm over several iterations. Initially, an importance weight is distributed uniformly among the training examples. After each iteration, the weight is redistributed, so that misclassified examples get higher weights. The base learner is thus forced to concentrate on difficult examples.</Paragraph>
    <Paragraph position="3"> Although boosting has not yet been applied to coreference resolution, it has outperformed stateof-the-art systems for NLP tasks such as part-ofspeech tagging and prepositional phrase attachment (Abney et al., 1999), word sense disambiguation (Escudero et al., 2000), and named entity recognition (Carreras et al., 2002).</Paragraph>
    <Paragraph position="4"> The implementation used for this project is BoosTexter (Schapire and Singer, 2000), a toolkit freely available for research purposes. In addition to labels, BoosTexter assigns confidence weights that reflect the reliability of the decisions.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 System Description
</SectionTitle>
    <Paragraph position="0"> Our system resolves pronouns in three stages: preprocessing, classification, and postprocessing.</Paragraph>
    <Paragraph position="1"> Figure 1 gives an overview of the system architecture, while this section provides details of each component.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Training and Test Data
</SectionTitle>
      <Paragraph position="0"> The system was trained with data from the Heidelberg Text Corpus (HTC), provided by the European Media Laboratory in Heidelberg, Germany.</Paragraph>
      <Paragraph position="1">  The HTC is a collection of 250 short texts (30-700 tokens) describing architecture, historical events and people associated with the city of Heidelberg. To examine its domain (in)dependence, the system was tested on 40 unseen HTC texts as well as on 25 articles from the Spiegel magazine, the topics of which include current events, science, arts and entertainment, and travel.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The MMAX Annotation Tool
</SectionTitle>
      <Paragraph position="0"> The manual annotation of the training data was done with the MMAX (Multi-Modal Annotation in XML) annotation tool (M&amp;quot;uller and Strube, 2001). The fist step of coreference annotation is to identify the markables, i.e. noun phrases that refer to real-word entities. Each markable is annotated with the following attributes: a0 np form: proper noun, definite NP, indefinite NP, personal pronoun, possessive pronoun, or demonstrative pronoun.</Paragraph>
      <Paragraph position="1"> a0 grammatical role: subject, object (direct or indirect), or other.</Paragraph>
      <Paragraph position="2"> a0 agreement: this attribute is a combination of person, number and gender. The possible values are 1s, 1p, 2s, 2p, 3m, 3f, 3n, 3p.</Paragraph>
      <Paragraph position="3"> a0 semantic class: human, physical object (includes animals), or abstract. When the semantic class is ambiguous, the &amp;quot;abstract&amp;quot; option is chosen.</Paragraph>
      <Paragraph position="4"> a0 type: if the entity that the markable refers to is new to the discourse, the value is &amp;quot;none&amp;quot;. If the markable refers to an already mentioned entity, the value is &amp;quot;anaphoric&amp;quot;. An anaphoric markable has another attribute for its relation to the antecedent. The values for this attribute are &amp;quot;direct&amp;quot;, &amp;quot;pronominal&amp;quot;, and &amp;quot;ISA&amp;quot; (hyponym-hyperonym).</Paragraph>
      <Paragraph position="5"> To mark coreference, MMAX uses coreference sets, such that every new reference to an already mentioned entity is added to the set of that entity. Implicitly, there is a set for every entity in the discourse - if an entity occurs only once, its set contains one markable.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Feature Vector
</SectionTitle>
      <Paragraph position="0"> The features used by our system are summarised in Table 4.3. The individual features for anaphor</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Feature Description
</SectionTitle>
      <Paragraph position="0"> pron the pronoun ana npform NP form of the anaphor ana gramrole grammatical role of the anaphor ana agr agreement of the anaphor ana semclass* semantic class of the anaphor ante npform NP form of the antecedent ante gramrole grammatical role of the antecedent null ante agr agreement of the antecedent ante semclass* semantic class of the an- null tecedent dist distance in markables between anaphor and antecedent (1 .. 20) same agr same agreement of anaphor and antecedent? same gramrole same grammatical role of anaphor and antecedent? same semclass* same semantic class of anaphor and antecedent?  tures were only used for 10-fold cross-validation on the manually annotated data and antecedent - NP form, grammatical role, semantic class - are extracted directly from the annotation. The relational features are generated by comparing the individual ones. The binary target function - coreferent, non-coreferent - is determined by comparing the values of the member attribute. If both markables are members of the same set, they are coreferent, otherwise they are not.</Paragraph>
      <Paragraph position="1"> Due to lack of resources, the semantic class attribute cannot be annotated automatically, and is therefore used only for comparison with (Strube et al., 2002).</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Noun Phrase Chunking, NER and
POS-Tagging
</SectionTitle>
      <Paragraph position="0"> To identify markables automatically, the system uses the noun phrase chunker described in (Schmid and Schulte im Walde, 2000), which displays case information along with the chunks.</Paragraph>
      <Paragraph position="1"> The chunker is based on a head-lexicalised probabilistic context free grammar (H-L PCFG) and achieves an F-measure of 92 for range only and 83 for range and label, whereby a range of a noun chunk is defined as &amp;quot;all words from the beginning of the noun phrase to the head noun&amp;quot;. This is different from manually annotated markables, which can be complex noun phrases.</Paragraph>
      <Paragraph position="2"> Despite good overall performance, the chunker fails on multi-word proper names in which case it marks each word as an individual chunk.1 Since many pronouns refer to named entities, the chunker needs to be supplemented by a named entity recogniser. Although, to our knowledge, there currently does not exist an off-the-shelf named entity recogniser for German, we were able to obtain the system submitted by (Curran and Clark, 2003) to the 2003 CoNLL competition. In order to run the recogniser, the data needs to be tokenised, tagged and lemmatised, all of which is done by the Tree-Tagger (Schmid, 1995).</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Markable Creation
</SectionTitle>
      <Paragraph position="0"> After the markables are identified, they are automatically annotated with the attributes described in Section 4.4. The NP form can be reliably determined by examining the output of the noun chunker and the named entity recogniser. Pronouns and named entities are already labeled during chunking. The remaining markables are labelled as definite NPs if their first words are definite articles or possessive determiners, and as indefinite NPs otherwise. Grammatical role is determined by the case assigned to the markable - subject if nominative, object if accusative. Although datives and genitives can also be objects, they are more likely to be adjuncts and are therefore assigned the value &amp;quot;other&amp;quot;.</Paragraph>
      <Paragraph position="1"> For non-pronominal markables, agreement is determined by lexicon lookup of the head nouns.</Paragraph>
      <Paragraph position="2"> Number ambiguities are resolved with the help of the case information. Most proper names, except for a few common ones, do not appear in the lexicon and have to remain ambiguous. Although it is impossible to fully resolve the agreement ambiguities of pronominal markables, they can be classi1An example is [Verteidigunsminister Donald] [Rumsfeld] ([Minister of Defense Donald] [Rumsfeld]).</Paragraph>
      <Paragraph position="3"> fied as either feminine/plural or masculine/neuter.</Paragraph>
      <Paragraph position="4"> Therefore we added two underspecified values to the agreement attribute: 3f 3p and 3m 3n. Each of these values was made to agree with both of its subvalues.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.6 Antecedent Selection
</SectionTitle>
      <Paragraph position="0"> After classification, one non-pronominal antecedent has to be found for each pronoun. As BoosTexter assigns confidence weights to its predictions, we have a choice between selecting the antecedent closest to the anaphor (closest-first) and the one with the highest weight (best-first).</Paragraph>
      <Paragraph position="1"> Furthermore, we have a choice between ignoring pronominal antecedents (and risking to discard all the correct antecedents within the window) and resolving them (and risking multiplication of errors). In case all of the instances within the window have been classified as non-coreferent, we choose the negative instance with the lowest weight as the antecedent. The following section presents the results for each of the selection strategies.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML