File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1015_metho.xml
Size: 16,288 bytes
Last Modified: 2025-10-06 14:09:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1015"> <Title>Sentence Compression for Automated Subtitling: A Hybrid Approach</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 contains the conclusions. 2 From Full Sentence to Compressed </SectionTitle> <Paragraph position="0"> Sentence The sentence compression tool is inspired by (Jing, 2001). Although her goal is text summarization and not subtitling, her sentence compression system could serve this purpose.</Paragraph> <Paragraph position="1"> She uses multiple sources of knowledge on which her sentence reduction is based. She makes use of a corpus of sentences, aligned with human-written sentence reductions which is similar to the parallel corpus we use (Vandeghinste and Tjong Kim Sang, 2004). She applies a syntactic parser to analyse the syntactic structure of the input sentences. As there was no syntactic parser available for Dutch (Daelemans and Strik, 2002), we created ShaRPA (Vandeghinste, submitted), a shallow rule-based parser which could give us a shallow parse tree of the input sentence. Jing uses several other knowledge sources, which are not used (not available for Dutch) or not yet used in our system (like WordNet). In figure 1 the processing flow of an input sentence is sketched.</Paragraph> <Paragraph position="2"> First we describe how the sentence is analysed (2.1), then we describe how the actual sentence compression is done (2.2), and after that we describe how words can be reduced for extra compression (2.3). The final part describes the selection of the ouput sentence (2.4).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Sentence Analysis </SectionTitle> <Paragraph position="0"> In order to apply an accurate sentence compression, we need a syntactic analysis of the input sentence.</Paragraph> <Paragraph position="1"> In a first step, the input sentence gets tagged for parts-of-speech. Before that, it needs to be transformed into a valid input format for the part-of-speech tagger. The tagger we use is TnT (Brants, 2000) , a hidden Markov trigram tagger, which was trained on the Spoken Dutch Corpus (CGN), Internal Release 6. The accuracy of TnT trained on CGN is reported to be 96.2% (Oostdijk et al., 2002).</Paragraph> <Paragraph position="2"> In a second step, the sentence is sent to the Abbreviator. This tool connects to a database of common abbreviations, which are often pronounced in full words (E.g. European Union becomes EU) and replaces the full form with its abbreviation. The database can also contain the tag of the abbreviated part (E.g. the tag for EU is N(eigen,zijd,ev,basis,stan) [E: singular non-neuter proper noun]).</Paragraph> <Paragraph position="3"> In a third step, all numbers which are written in words in the input are replaced by their form in digits. This is done for all numbers which are smaller than one million, both for cardinal and ordinal numerals. null In a fourth step, the sentence is sent to ShaRPa, which will result in a shallow parse-tree of the sentence. The chunking accuracy for noun phrases (NPs) has an F-value of 94.7%, while the chunking accuracy of prepositional phrases (PPs) has an F-value of 95.1% (Vandeghinste, submitted).</Paragraph> <Paragraph position="4"> A last step before the actual sentence compression consists of rule-based clause-detection: Relative phrases (RELP), subordinate clauses (SSUB) and OTI-phrases (OTI is om ... te + infinitive1) are detected. The accuracy of these detections was evaluated on 30 files from the CGN component of read-aloud books, which contained 7880 words. The evaluation results are presented in table 1.</Paragraph> <Paragraph position="5"> The errors are mainly due to a wrong analysis of coordinating conjunctions, which is not only the weak point in the clause-detection module, but also in ShaRPa. A full parse is needed to accurately solve this problem.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Sentence Compression </SectionTitle> <Paragraph position="0"> For each chunk or clause detected in the previous steps, the probabilities of removal, non-removal and reduction are estimated. This is described in more detail in 2.2.1.</Paragraph> <Paragraph position="1"> Besides the statistical component in the compression, there are also a number of rules in the compression program, which are described in more detail in 2.2.2.</Paragraph> <Paragraph position="2"> The way the statistical component and the rule-based component are combined is described in Chunk and clause removal, non-removal and reduction probabilities are estimated from the frequencies of removal, non-removal and reduction of certain types of chunks and clauses in the parallel corpus. The parallel corpus consists of transcripts of television programs on the one hand and the subtitles of these television programs on the other hand. A detailed description of how the parallel corpus was collected, and how the sentences and chunks were aligned is given in (Vandeghinste and Tjong Kim Sang, 2004).</Paragraph> <Paragraph position="3"> All sentences in the source corpus (transcripts) and the target corpus (subtitles) are analysed in the same way as described in section 2.1, and are chunk aligned. The chunk alignment accuracy is about 95% (F-value).</Paragraph> <Paragraph position="4"> We estimated the removal, non-removal and reduction probabilities for the chunks of the types NP, PP, adjectival phrase (AP), SSUB, RELP, and OTI, based on their chunk removal, non-removal and reduction frequencies.</Paragraph> <Paragraph position="5"> For the tokens not belonging to either of these types, the removal and non-removal probabilities were estimated based on the part-of-speech tag for those words. A reduced tagset was used, as the original CGN-tagset (Van Eynde, 2004) was too fine-grained and would lead to a multiplication of the number of rules which are now used in ShaRPa. The first step in SharPa consists of this reduction.</Paragraph> <Paragraph position="6"> For the PPs, the SSUBs and the RELPs, as well as for the adverbs, the chunk/tag information was considered as not fine-grained enough, so the estimation of the removal, non-removal and reduction probabilities for these types are based on the first word of those phrases/clauses and the reduction, removal and non-removal probabilities of such phrases in the parallel corpus, as the first words of these chunk-types are almost always the heads of the chunk. This allows for instance to make the distinction between several adverbs in one sentence, so they do not all have the same removal and non-removal probabilities. A disadvantage is that this approach leads to sparse data concerning the less frequent adverbs, for which a default value (average over all adverbs) will be employed.</Paragraph> <Paragraph position="7"> An example : A noun phrase.</Paragraph> <Paragraph position="8"> de grootste Belgische bank [E: the largest Belgian bank] After tagging and chunking the sentence and after detecting subordinate clauses, for every non-terminal node in the shallow parse tree we retrieve the measure of removal (X), of non-removal (=) and of reduction2 (a0 ). For the terminal nodes, only the measures of removal and of non-removal are used.</Paragraph> <Paragraph position="10"> For every combination the probability estimate is calculated. So if we generate all possible compressions (including no compression), the phrase de grootste Belgische bank will get the probability estimate a2a4a3a6a5a8a7a10a9a12a11a13a2a4a3a15a14a17a16a18a9a19a2a20a3a21a5a22a14a18a9a23a2a20a3a21a5a22a14a18a9 a2a4a3a15a14a24a5a26a25a28a27a29a2a4a3a15a2a24a30a31a7a32a16a24a5 . For the phrase de Belgische bank the probability estimate is a2a20a3a6a2a17a5a33a9a29a11a13a2a4a3a15a14a17a16a34a9 a2a4a3a15a35a24a5a36a9a23a2a4a3a6a5a26a14a36a9a23a2a4a3a6a14a17a5a17a25a37a27a38a2a20a3a6a2a26a2a22a7a32a35a26a35 , and so on for the other alternatives.</Paragraph> <Paragraph position="11"> In this way, the probability estimate of all possible alternatives is calculated.</Paragraph> <Paragraph position="12"> As the statistical information allows the generation of ungrammatical sentences, a number of rules were added to avoid generating such sentences. The procedure keeps the necessary tokens for each kind of node. The rules were built in a bootstrapping manner null In some of these rules, this procedure is applied recursively. These are the rules implemented in our system: a39 If a node is of type SSUB or RELP, keep the first word.</Paragraph> <Paragraph position="13"> a39 If a node is of type S, SSUB or RELP, keep - the verbs. If there are prepositions which are particles of the verb, keep the prepositions. If there is a prepositional phrase which has a preposition which is in the complements list of the verb, keep the necessary tokens3 of that prepositional phrase.</Paragraph> <Paragraph position="14"> 2These measures are estimated probabilities and do not need to add up to 1, because in the parallel training corpus, sometimes a match was detected with a chunk which was not a reduction of the source chunk or which was not identical to the source chunk: the chunk could be paraphrased, or even have become longer.</Paragraph> <Paragraph position="15"> 3Recursive use of the rules - each token which is in the list of negative words. These words are kept to avoid altering the meaning of the sentence by dropping words which negate the meaning. null - the necessary tokens of the te + infinitives (TI).</Paragraph> <Paragraph position="16"> - the conjunctions.</Paragraph> <Paragraph position="17"> - the necessary tokens of each NP.</Paragraph> <Paragraph position="18"> - the numerals.</Paragraph> <Paragraph position="19"> - the adverbially used adjectives.</Paragraph> <Paragraph position="20"> a39 If a node is of type NP, keep - each noun.</Paragraph> <Paragraph position="21"> - each nominalised adjectival phrase.</Paragraph> <Paragraph position="22"> - each token which is in the list of negative words.</Paragraph> <Paragraph position="23"> - the determiners.</Paragraph> <Paragraph position="24"> - the numerals.</Paragraph> <Paragraph position="25"> - the indefinite prenominal pronouns.</Paragraph> <Paragraph position="26"> a39 If a node is of type PP, keep - the preposition.</Paragraph> <Paragraph position="27"> - the determiners.</Paragraph> <Paragraph position="28"> - the necessary tokens of the NPs.</Paragraph> <Paragraph position="29"> a39 If the node is of type adjectival phrase, keep - the head of the adjectival phrase.</Paragraph> <Paragraph position="30"> - the prenominal numerals.</Paragraph> <Paragraph position="31"> - each word which is in the list of negative words.</Paragraph> <Paragraph position="32"> a39 If the node is of type OTI, keep - the verbs.</Paragraph> <Paragraph position="33"> - the te + infinitives.</Paragraph> <Paragraph position="34"> a39 If the node is of type TI, keep the node. a39 If the node is a time phrase4, keep it. These rules are chosen because in tests on earlier versions of the system, using a different test set, ungrammatical output was generated. By using these rules the output should be grammatical, provided that the input sentence was analysed correctly. 4A time phrase, as defined in ShaRPa is used for special phrases, like dates, times, etc. E.g. 27 september 1998, kwart voor drie [E: quarter to three].</Paragraph> <Paragraph position="35"> In the current version of the system, in a first stage all variations on a sentence are generated in the statistical part, and they are ranked according to their probability. In a second stage, all ungrammatical sentences are (or should be) filtered out, so the only sentence alternatives which remain should be grammatical ones.</Paragraph> <Paragraph position="36"> This is true, only if tagging as well as chunking were correct. If errors are made on these levels, the generation of an ungrammatical alternative is still possible.</Paragraph> <Paragraph position="37"> For efficiency reasons, a future version of the system should combine the rules and statistics in one stage, so that the statistical module only generates grammatically valid sentence alternatives, although there is no effect on correctness, as the resulting sentence alternatives would be the same if statistics and rules were better integrated.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Word Reduction </SectionTitle> <Paragraph position="0"> After the generation of several grammatical reductions, which are ordered according to their probability estimated by the product of the removal, non-removal and reduction probabilities of all its chunks, for every word in every compressed alternative of the sentence it is checked whether the word can be reduced.</Paragraph> <Paragraph position="1"> The words are sent to a WordSplitter-module, which takes a word as its input and checks if it is a compound by trying to split it up in two parts: the modifier and the head. This is done by lexicon lookup of both parts. If this is possible, it is checked whether the modifier and the head can be recompounded according to the word formation rules for Dutch (Booij and van Santen, 1995), (Haeseryn et al., 1997). This is done by sending the modifier and the head to a WordBuilding-module, which is described in more detail in (Vandeghinste, 2002).</Paragraph> <Paragraph position="2"> This is a hybrid module combining the compounding rules with statistical information about the frequency of compounds with the samen head, the frequency of compounds with the same modifier, and the number of different compounds with the same head.</Paragraph> <Paragraph position="3"> Only if this module allows the recomposition of the modifier and the head, the word can be considered to be a compound, and it can potentially be reduced to its head, removing the modifier.</Paragraph> <Paragraph position="4"> If the words occur in a database which contains a list of compounds which should not be split up, the word cannot be reduced. For example, the word voetbal [E: football] can be split up and recompounded according to the word formation rules for Dutch (voet [E: foot] and bal [E: ball]), but we should not replace the word voetbal with the word bal if we want an accurate compression, with the same meaning as the original sentence, as this would alter the meaning of the sentence too much.</Paragraph> <Paragraph position="5"> The word voetbal has (at least) two different meanings: soccer and the ball with which soccer is played. Reducing it to bal would only keep the second meaning. The word gevangenisstraf [E: prison sentence] can be split up and recompounded (gevangenis [E: prison] and straf [E: punishment]). We can replace the word gevangenisstraf by the word straf. This would still alter the meaning of the sentence, but not to the same amount as it would have been altered in the case of the word voetbal.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Selection of the Compressed Sentence </SectionTitle> <Paragraph position="0"> Applying all the steps described in the previous sections results in an ordered list of sentence alternatives, which are supposedly grammatically correct.</Paragraph> <Paragraph position="1"> When word reduction was possible, the wordreduced alternative is inserted in this list, just after its full-words equivalent.</Paragraph> <Paragraph position="2"> The first sentence in this list with a length smaller than the maximal length (depending on the available presentation time) is selected.</Paragraph> <Paragraph position="3"> In a future version of the system, the word reduction information can be integrated in a better way with the rest of the module, by combining the probability of reduction/non-reduction of a word with the probability of the sentence alternative. The reduction probability of a word would then play its role in the estimated probability of the compressed sentence alternative containing this reduced word.</Paragraph> </Section> </Section> class="xml-element"></Paper>