File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1055_evalu.xml

Size: 4,860 bytes

Last Modified: 2025-10-06 13:58:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1055">
  <Title>Using Machine Learning to Maintain Rule-based Named-Entity Recognition and Classification Systems Georgios Petasis +, Frantz Vichot SS, Francis Wolinski SS</Title>
  <Section position="8" start_page="2" end_page="2" type="evalu">
    <SectionTitle>
5.3 Results for the French System
</SectionTitle>
    <Paragraph position="0"> The corpus used for the French experiment contained dispatches from the Agence France-Presse from April 1998 until January 2001. The thematic domain of the corpus was shareholding events. This corpus contained six thousand documents, including 180,983 instances of NEs with the following distribution: companies (45%), locations (45%), persons (7%) and associations (non commercial organisations) (3%).</Paragraph>
    <Paragraph position="1"> For the purposes of this experiment, the corpus was chronologically split in two parts. The part containing the chronologically earlier messages was used for training purposes while the second part, containing the most recent messages, was used in order to evaluate our approach. In this experiment, we mainly focused on four NE categories, instead of the three categories used for the Greek experiment. This differentiation originates in the fact that the French NERC system further categorises organisations into associations (non-profit organisations) and companies. null  The contingency matrix giving an overview of the cases of disagreement of the two systems is shown in Table 2. It appears that in 91% of the cases the two systems are in agreement.</Paragraph>
    <Paragraph position="2">  associat. person location company associat. 808 6 31 618 person 3 4,498 46 509 location 11 51 6,870 2,526 company 296 67 534 34,946 Examining the disagreement cases gave us important insight regarding problems of the rule-based system. The following sections present some interesting examples.</Paragraph>
    <Paragraph position="3">  Similarly to the Greek experiment, the examination of disagreements revealed some interesting problems in the recognition of NEs. For instance, &amp;quot;Europe 1&amp;quot; is a well-known French radio station, also written sometimes as &amp;quot;Europe Un&amp;quot; (Europe One). The rule-based system failed to identify &amp;quot;Europe Un&amp;quot; and only identified &amp;quot;Europe&amp;quot; as a location. The source of the problem is the lack of a mapping between fully written numbers and numerical figures.</Paragraph>
    <Paragraph position="4"> Another example is the phrase &amp;quot;Le Mans Re&amp;quot;, which is a shortened version of the company name &amp;quot;Les mutuelles du Mans Reassurance&amp;quot; (a Reinsurance company). The rule-based system recognised only &amp;quot;Le Mans&amp;quot; as a location, due to the well-known French city. What is needed here is an extension of the segmentation rules to include &amp;quot;Re&amp;quot; as a &amp;quot;company designator&amp;quot;, such as &amp;quot;Motor&amp;quot;, &amp;quot;Bank&amp;quot; or &amp;quot;Telecom&amp;quot;. null  Most of the classification problems that were identified concerned NEs already known to the system that meanwhile have acquired new meanings. For example, &amp;quot;Ariane II rachete&amp;quot; (Ariane II buys) is classified as a person, due to the word &amp;quot;Ariane&amp;quot; contained in the lexicon as a person forename. In reality, &amp;quot;Ariane II&amp;quot; is a new company that should also be included in the lexicon database. Another example is &amp;quot;Orange&amp;quot; already included in the lexicon as an old French city. In the meanwhile, a new French company has been created having the same name, as in the example &amp;quot;Orange, valorisee par les analystes&amp;quot; (Orange, estimated by analysts). Also in this case, the lexicon must be updated with a second entry for this entity, categorised as a company. Besides lexicon omissions, some problems regarding the classification grammar were also revealed. First, overly general rules were identified, such as the one that classifies entities starting from &amp;quot;A&amp;quot; and followed by numbers as French highway names. This rule wrongly classified the NE &amp;quot;A3XX&amp;quot; as a highway, while the text was referring to an airplane model: &amp;quot;L'A3XX, un avion&amp;quot; (The A3XX, an air plane). Our approach also succeeded in locating well-known NEs used in a new context. For example, the rule-based NERC system recognises &amp;quot;Taittinger&amp;quot; as a company while the system learned by C4.5 disagrees with this classification in the sentence &amp;quot;la famille Taittinger&amp;quot; (the family Taittinger). In this case, the grammar should be updated with a rule saying that the word &amp;quot;family&amp;quot; in front of a proper name suggests a person name.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML