File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1015_evalu.xml

Size: 8,323 bytes

Last Modified: 2025-10-06 13:59:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1015">
  <Title>Sentence Compression for Automated Subtitling: A Hybrid Approach</Title>
  <Section position="4" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> The evaluation of a sentence compression module is not an easy task. The output of the system needs to be judged manually for its accuracy. This is a very time consuming task. Unlike (Jing, 2001), we do not compare the system results with the human sentence reductions. Jing reports a succes rate of 81.3% for her program, but this measure is calculated as the percentage of decisions on which the system agrees with the decisions taken by the human summarizer.</Paragraph>
    <Paragraph position="1"> This means that 81.3% of all system decisions are correct, but does not say anything about how many sentences are correctly reduced.</Paragraph>
    <Paragraph position="2"> In our evaluation we do not expect the compressor to simulate human summarizer behaviour. The results presented here are calculated on the sentence level: the amount of valid reduced sentences, being those reductions which are judged by human raters to be accurate reductions: grammatical sentences with (more or less) the same meaning as the input sentence, taking into account the meaning of the previous sentences on the same topic.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Method
</SectionTitle>
      <Paragraph position="0"> To estimate the available number of characters in a subtitle, it is necessary to estimate the average pronunciation time of the input sentence, provided that it is unknown. We estimate sentence duration by counting the number of syllables in a sentence and multiplying this with the average duration per syllable (ASD).</Paragraph>
      <Paragraph position="1"> The ASD for Dutch is reported to be about 177 ms (Koopmans-van Beinum and van Donzel, 1996), which is the syllable speed without including pauses between words or sentences.</Paragraph>
      <Paragraph position="2"> We did some similar research on CGN using the ASD as a unit of analysis, while we consider both the situation without pauses and the situation with pauses. Results of this research are presented in table 2.</Paragraph>
      <Paragraph position="3"> ASD no pauses pauses included  We extract the word duration from all the files in each component of CGN. A description of the components can be found in (Oostdijk et al., 2002). We created a syllable counter for Dutch words, which we evaluated on all words in the CGN lexicon. For 98.3% of all words in the lexicon, syllables are counted correctly. Most errors occur in very low frequency words or in foreign words.</Paragraph>
      <Paragraph position="4"> By combining word duration information and the number of syllables we can calculate the average speaking speed.</Paragraph>
      <Paragraph position="5"> We evaluated sentence compression in three different conditions: The fastest ASD in our ASD-research was 185 ms (one speaker, no pauses), which was used for Condition A. We consider this ASD as the maximum speed for Dutch.</Paragraph>
      <Paragraph position="6"> The slowest ASD (256 ms) was used for Condition C. We consider this ASD to be the minimum speed for Dutch.</Paragraph>
      <Paragraph position="7"> We created a testset of 100 sentences mainly focused on news broadcasts in which we use the real pronunciation time of each sentence in the testset which results in an ASD of 192ms. This ASD was used for Condition B, and is considered as the real speed for news broadcasts.</Paragraph>
      <Paragraph position="8"> We created a testset of 300 sentences, of which 200 were taken from transcripts of television news, and 100 were taken from the 'broadcast news' component of CGN.</Paragraph>
      <Paragraph position="9"> To evaluate the compressor, we estimate the duration of each sentence, by counting the number of syllables and multiplying that number with the ASD for that condition. This leads to an estimated pronunciation time. This is converted to the number of characters, which is available for the subtitle. We know the average time for subtitle presentation at the VRT (Flemish Broadcasting Coorporation) is 70 characters in 6 seconds, which gives us an average of 11.67 characters per second.</Paragraph>
      <Paragraph position="10"> So, for example, if we have a test sentence of 15 syllables, this gives us an estimated pronunciation time of 2.775 seconds (15 syllables a9 185 ms/syllable) in condition A. When converting this to the available characters, we multiply 2.775 seconds by 11.67 characters/second, resulting in 32 (2.775s a9 11.67 ch/s = 32.4 ch) available characters.</Paragraph>
      <Paragraph position="11"> In condition B (considered to be real-time) for the part of the test-sentences coming from CGN, the pronunciation time was not estimated, as it was available in CGN.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> The results of our experiments on the sentence compression module are presented in table 3.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sentence Level
</SectionTitle>
      <Paragraph position="0"> The sentence compressor does not generate output for all test sentences in all conditions: In those cases where no output was generated, the sentence compressor was not able to generate a sentence alternative which was shorter than the maximum number of characters available for that sentence.</Paragraph>
      <Paragraph position="1"> The cases where no output is generated are not considered as errors because it is often impossible, even for humans, to reduce a sentence by about 40%, without changing the content too much. The amount of test sentences where no output was generated is presented in table 3. The high percentage of sentences where no output was generated in conditions A and B is most probably due to the fact that the compression rates in these conditions are higher than they would be in a real life application. Condition C seems to be closer to the real life compression rate needed in subtitling.</Paragraph>
      <Paragraph position="2"> Each condition has an average reduction rate over the 300 test sentences. This reduction rate is based on the available amount of characters in the subtitle and the number of characters in the source sentence.</Paragraph>
      <Paragraph position="3"> A rater scores a compressed sentence as + when it is grammatically correct and semantically equivalent to the input sentence. No essential information should be missing. A sentence is scored as +/when it is grammatically correct, but some information is missing, but is clear from the context in which the sentence occurs. All other compressed sentences get scored as -.</Paragraph>
      <Paragraph position="4"> Each sentence is evaluated by two raters. The lowest score of the two raters is the score which the sentence gets. Interrater agreement is calculated on a 2 point score: if both raters score a sentence as + or+/-or both raters score a sentence as-, it is considered an agreed judgement. Interrater agreement results are presented in table 3.</Paragraph>
      <Paragraph position="5"> Sentence compression results are presented in table 3. We consider both the + and +/- results as reasonable compressions.</Paragraph>
      <Paragraph position="6"> The resulting percentages of reasonable compressions seem to be rather low, but one should keep in mind that these results are based on the sentence level. One little mistake in one sentence can lead to an inaccurate compression, although the major part of the decisions taken in the compression process can still be correct. This makes it very hard to compare our results to the results presented by Jing (2001), but we presented our results on sentence evaluations as it gives a clearer idea on how well the system would actually perform in a real life application.</Paragraph>
      <Paragraph position="7"> As we do not try to immitate human subtitling behaviour, but try to develop an equivalent approach, our system is not evaluated in the same way as the system deviced by Jing.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML