File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/i05-2013_evalu.xml

Size: 8,303 bytes

Last Modified: 2025-10-06 13:59:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2013">
  <Title>Automatic recognition of French expletive pronoun occurrences</Title>
  <Section position="5" start_page="75" end_page="77" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> I have worked on the French newspaper Le Monde. More precisely, I have worked on a corpus of 3.782.613 tokens extracted from the corpus Le Monde'94. Unitex segments this corpus into 71.293 sentences. It contains 13.611 occurrences of token il, and 20.549 occurrences of third person subject pronouns, i.e. il, elle, ils, elles (he, she, it, they). So il is the most frequent third person subject pronoun, with a rate of 66%.</Paragraph>
    <Paragraph position="1">  S is the symbol for the pattern aiming at representing a sentence. This pattern is made up of a non-empty sequence of tokens which includes a finite verb.</Paragraph>
    <Paragraph position="2"> From this corpus, 8544 sentences which include at least one occurrence of il have been extracted, and they add up to around 10.000 occurrences of il (a complex sentence with embedded clauses may include several occurrences of il). These sentences have been given as input to ILIMP and the results - the tags [IMP], [ANA] and [AMB]- have been manually evaluated. The evaluators were asked to follow only their intuition.</Paragraph>
    <Paragraph position="3"> The result of this evaluation is the following: the precision rate is 97,5\%. We are going to examine the 2,5\% errors, putting aside [AMB].</Paragraph>
    <Section position="1" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
4.1 Errors from morphological ambiguities
</SectionTitle>
      <Paragraph position="0"> Errors coming from morphological ambiguities are (of course) counted as the other errors coming from the realization of ILIMP (which are examined in the next sections). Recall (Section 2.2) that the pre-processing in Unitex does not include any disambiguation: it is not a tagger. To illustrate the consequences of this point, consider the pattern in (7a) in which &lt;V6:W&gt; targets verbs of Table 6 in the past participle, e.g. choisi (chosen), and S a sequence of tokens which includes a finite verb (see note 3). This pattern aims at targeting impersonal clauses such as (7b).</Paragraph>
      <Paragraph position="1"> Nevertheless, it also targets (7c), in which the pronoun il is thus wrongly tagged as [IMP]. This error comes from the fact that the dictionary DELAF rightly includes two entries for the word metres - finite form of the verb metrer and plural form of the noun metre - and Unitex does not make any distinction between these two entries.</Paragraph>
      <Paragraph position="2"> Therefore, the sequence le beton pour soutenir une toiture de 170 metres is interpreted as including a finite verb, and hence follows pattern S.</Paragraph>
      <Paragraph position="3"> (7)a Il[IMP] &lt;avoir.V:3s&gt; ete &lt;V6:W&gt; (ADV) que S b Il a ete choisi que les seances se feraient le matin vers 9h (It has been chosen that sessions would take place around 9 am) c Il a ete choisi plutot que le beton pour soutenir une toiture de 170 metres (It has been chosen rather than concrete to support a 170 meter roof) Any tagger should tag the word metres in (7c) as a noun. Taking as input not a raw text pre-processed by Unitex but the output of a tagger  would avoid the error on il in (7c). However ILIMP would be dependent of the errors of a tagger. What is best? More generally, assuming that a syntactic parser relies upon a modular approach in which a set of modules - tagger, named entity recognition module, ILIMP, chunker, etc. - collaborates, the question of the order in which the modules should be chained arises.</Paragraph>
      <Paragraph position="4"> Let us have this question open, and come back to the errors of ILIMP taking as input a raw text.</Paragraph>
      <Paragraph position="5"> 4.2 il wrongly tagged as [IMP] instead of [ANA]: 0,3\% Very few errors: 33. This is surprising when considering the frequent appeal to &amp;quot;brutal&amp;quot; heuristics. As an illustration, il in the pattern Il y a is systematically tagged as [IMP]. This heuristic gives two errors, as in (8a), but around 1500 right tags, as in (8b).</Paragraph>
      <Paragraph position="6"> (8)a Il revient de Rimini. Il y a donne la replique a Madeleine. (He is back from Rimini. He gave there the cue to Madeleine.) b Il y a beaucoup de trafic a 8h (There is a lot of traffic at 8 am) 4.3 il wrongly tagged as [ANA] instead of [IMP]: 2\% More errors. This type of errors comes from the fact that [ANA] is the default value. These errors are thus directly imputable to gaps in the patterns making up ILIMP.</Paragraph>
      <Paragraph position="7"> Among these gaps, there are first those coming from my laziness/tiredness/lack of time. For example, I have introduced quotation marks at some places in patterns but not everywhere. Hence, il is wrongly tagged as [ANA] in (9a) just because of the quotation marks. Similarly, I wrote some patterns for cases with subject inversion, but I did not take time to write all of them, hence the error in (9b).</Paragraph>
      <Paragraph position="8"> (9)a Il[ANA] etait &amp;quot;meme souhaitable&amp;quot; que celui-ci soit issu ... (It was &amp;quot;even desirable&amp;quot; that this one be from ...) b Est-il [ANA] inconcevable ... (Is it inconceivable that ...) Secondly, there are lexical gaps. In particular, some adjectives which can be the head of impersonal clauses are missing: the list of 682 adjectives I have compiled needs to be completed.</Paragraph>
      <Paragraph position="9"> Thirdly, there are syntactic gaps. In particular, I have considered any extraposed clausal subject as obligatory, whereas there exist cases where such a subject is not realized, for example, in phrases introduced by comme (as), (10). I have created a pattern to take into account such phrases but it does not handle all of them.</Paragraph>
      <Paragraph position="10"> (10) comme il a ete annonce (as it has been said Finally, gaps are found for impersonal clauses with a nominal extraposed subject. In particular, I have written no pattern for verbs in the passive form used in a refined register, see section 2.1. To conclude this section on the occurrences of il wrongly tagged as [ANA], I would like to add that though the first three types of errors can be avoided with a little effort, this is not the case for the last type.</Paragraph>
    </Section>
    <Section position="2" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
4.4 Other errors: 0,2\%
</SectionTitle>
      <Paragraph position="0"> Some errors come from the fact that the word il is not used as a subject pronoun but as part of a named entity in a foreign language, see (11)</Paragraph>
    </Section>
    <Section position="3" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
4.5 Evaluation on other corpora
</SectionTitle>
      <Paragraph position="0"> An evaluation of ILIMP has also been realized on French literary texts written in the XIXth century. It concerns 1858 occurrences of il. The precision rate falls compared to the journalistic genre: it goes from 97,5\% to 96,8\%. This fall comes, on the one hand, from impersonal expressions which are not used anymore, (11), on the other hand, from a high number of sentences with subject inversion, as in (9b) in Section 4.3.</Paragraph>
      <Paragraph position="1"> Recall that I have not handled subject inversion systematically.</Paragraph>
      <Paragraph position="2"> (11) Mais peut-etre etait-il un peu matin pour organiser un concert (But maybe was it a little bit morning to organize a concert) The percentage of impersonal il in literary texts increases compared to Le Monde corpus: it goes from 42\% to 49,8\%. In a more general way, I  This kind of error would be avoided if ILIMP took as input a text in which the named entities are recognized.  expect important differences on the percentage of il with an impersonal use according to the genre of corpora  , though I don't expect significant differences on the precision rate of ILIMP (especially if the three first types of errors described in Section 4.2 are corrected). This is because the list of lexical heads for impersonal clauses is closed and stable.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML