File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2005_metho.xml

Size: 9,159 bytes

Last Modified: 2025-10-06 14:07:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-2005">
  <Title>Scaled log likelihood ratios for the detection of abbreviations in text corpora</Title>
  <Section position="2" start_page="2" end_page="3" type="metho">
    <SectionTitle>
2 Scaling log likelihood ratios
</SectionTitle>
    <Paragraph position="0"> Since a pure log l approach falsely classifies many non-abbreviations as being abbreviations, we use log l as a basic ranking which is scaled by several factors. These factors have been experimentally developed by measuring their effect in terms of precision and recall on a training corpus from WSJ.</Paragraph>
    <Paragraph position="1">  The result of the scaling operation is a much more compact ranking of the true positives in the corpus. The effect of the scaling methods on the data presented in (4) are illustrated in (5).</Paragraph>
    <Paragraph position="2"> By applying the scaling factors, the asymptotic relation to the kh  distribution cannot be retained.</Paragraph>
    <Paragraph position="3"> The threshold value of the classification is hence no longer determined by the kh  distribution, but determined on the basis of the classification results derived from the training corpus. The scaling factors, once they have been determined on the basis of the training corpus, have not been modified any further. In this sense, the method described here can be characterized as a corpusfilter method, where a given corpus is used to  This is the corresponding kh  value for a confidence degree of 99.99 %.</Paragraph>
    <Paragraph position="4">  The training corpus had a size of 6 MB. filter the initial results (cf. Grefenstette 1999:128f.).</Paragraph>
    <Paragraph position="5">  (5) Result of applying scaling factors  In the present setting, applying the scaling factors to the training corpus has led to to a threshold value of 1.0. Hence, a value above 1.0 allows a classification of a given pair as an abbreviation, while a value below that leads to an exclusion of the candidate. An ordering of the candidates from table (5) is given in (6), where the threshold is indicated through the dashed line.</Paragraph>
    <Paragraph position="6">  (6) Ranking according to S(log l)  As can be witnessed in (6), the scaling methods are not perfect. In particular, ounces is still wrongly considered as an initial element of an abbreviation, poiting to a weakness of the approach which will be discussed in section 5.</Paragraph>
  </Section>
  <Section position="3" start_page="3" end_page="7" type="metho">
    <SectionTitle>
3 The scaling factors
</SectionTitle>
    <Paragraph position="0"> We have employed three different scaling factors, as given in (7), (8), and (9).</Paragraph>
    <Paragraph position="1">  factor is applied to the log l of a candidate pair. The weighting factors are formulated in such a way that allows a tension between them (cf. section 3.4). The effect of this tension is that an increase following from one factor may be cancelled out or reduced by a decrease following from the application of another factor, and vice versa.</Paragraph>
    <Paragraph position="2">  additionally weighted by the ranking which is determined by the occurrence of pairs of the form (word, *) in relation to pairs of the form (word, !*). If events of the second type are either rare or at least lower than events of the first type, the scaling factor leads to an increase of the initial log l value.</Paragraph>
    <Paragraph position="3">  in general leads to a reduction of the initial log l value. S  also has a significant effect on log l if the occurrence of word with * equals the occurrence of word without *. In this case, S  will be 0. Since the log l values are multiplied with each scaling factor, a value of 0 for S  will lead to a value of 0 throughout. Hence the pair (word, *) will be excluded from being an abbreviation. This move seems extremely plausible: if  reflecting an even higher likelihood that the pair should actually count as an abbreviation. word occurs approximately the same time with and without a following *, it is quite unlikely that the pair (word, *) forms an abbreviation.  Similarly, the value of S  will be negative if the number of occurrences of word without * is higher than the number of occurrences of word with *. Again, the resulting decrease reflects that the pair (word, *) is even more unlikely to be an abbreviation.</Paragraph>
    <Paragraph position="4"> Both the relative difference (S  ) allow a scaling that abstracts away from the absolute figure of occurrence, which strongly influences log l.</Paragraph>
    <Section position="1" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.3 Length of abbreviations: S
</SectionTitle>
      <Paragraph position="0"> Scaling factor (9), finally, leads to a reduction of log l depending on the length of the word which preceeds a period. This scaling factor follows the idea that an abbreviation is more likely to be short.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="7" type="sub_section">
      <SectionTitle>
3.4 Interaction of scaling factors
</SectionTitle>
      <Paragraph position="0"> As was already mentioned, the scaling factors can interact with each other. Consequently, an increase by a factor may be reduced by another one. This can be illustrated with the pair (U.N,  *) in (6). The application of the scaling factors does not change the value as the initial log l calculation.</Paragraph>
      <Paragraph position="1"> (13) S  , which however is fully compensated by the application of S  .</Paragraph>
      <Paragraph position="2">  Obviously, this assumption is only valid if the absolute number of occurrence is not too small.  As an illustration, consider the pairs (outstanding, *) and (Adm, *). The first pair occurs 260 times in our training corpus, the second one 51 times. While (outstanding, !*) occurs 246 times, (Adm, !*) never occurs. Still, the log l value for (outstanding, *) is 804.34, while the log l value for (Adm, *) is just 289.38, reflecting a bias for absolute numbers of occurrence.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="7" end_page="10" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> The scaling methods described in section 3 have been applied to test corpora from English (Wall Street Journal, WSJ) and German (Neue Zurcher Zeitung, NZZ). The scaled log l was calculated for all pairs of the form (word, *). The test corpora were annotated in the following fashion: If the value was higher than 1, the tag &lt;A&gt; was assigned to the pair. All other candidates were tagged as &lt;S&gt;.</Paragraph>
    <Paragraph position="1">  The automatically classified corpora were compared with their hand-tagged references.</Paragraph>
    <Paragraph position="2">  We have chosen two different types of test corpora: First, we have used two test corpora of an approximate size of 2 and 6 MB, respectively. The WSJ corpus contained 19,776 candidates of the form (word, *); the NZZ corpus contained 37,986 such pairs. Second, we have tried to determine the sensitivity of the present approach to data sparseness. Hence, the approach was applied to ten individual articles from each WSJ and NZZ. For English, these articles contained between 7 and 26 candidate pairs, for German the articles comprised between 16 and 52 pairs. The reference annotation allowed the determination of a baseline which determines the percentage of correctly classified end-of-sentence marks if each pair (word, *) is classified as an end-of-sentence mark.</Paragraph>
    <Paragraph position="3">  The baseline varies from corpus to corpus, depending on a variety of factors (cf. Palmer/Hearst 1997). In the following tables, we have reported two measures: first, the error rate, which is defined in (15), and second, the F measure (cf. van Rijsbergen 1979:174), which is  A tokenizer should treat pairs which have been annotated with &lt;A&gt; as single tokens, while tokens which have been annotated with &lt;S&gt; should be treated as two separate tokens. Three-dot-ellipses are currently not considered. Also &lt;A&gt;&lt;S&gt; tags are not considered in the experiments (cf. section 5).  Following this baseline, we assume that correctly classified end-of-sentence marks count as true positives in the evaluations.</Paragraph>
    <Paragraph position="4"> a weighted measure of precision and recall, as defined in (16).</Paragraph>
    <Section position="1" start_page="10" end_page="10" type="sub_section">
      <SectionTitle>
4.1 Results of first experiment
</SectionTitle>
      <Paragraph position="0"> The results of the classification process for the larger files are reported in table (17). F(B) and F(S) are the F measure of the baseline, and the present approach, respectively. E(B) is the error rate of the baseline, and E(S) is the error rate of the scaled log l approach.</Paragraph>
      <Paragraph position="1">  As (17) shows, the application of the scaled log l leads to significant improvements for both files. In particular, the error rate has dropped from over 30 % to 0.6 % in the WSJ corpus. For both files, the accuracy is beyond 99 %.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML