File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2907_intro.xml
Size: 4,503 bytes
Last Modified: 2025-10-06 14:04:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2907"> <Title>Investigating Lexical Substitution Scoring for Subtitle Generation</Title> <Section position="4" start_page="45" end_page="46" type="intro"> <SectionTitle> 2 Background and Setting </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 2.1 Subtitling </SectionTitle> <Paragraph position="0"> Automatic generation of subtitles is a summarization task at the level of individual sentences or occasionally of a few contiguous sentences. Limitations on reading speed of viewers and on the size of the screen that can be filled with text without the image becoming too cluttered, are the constraints that dynamically determine the amount of compression in characters that should be achieved in transforming the transcript into subtitles. Subtitling is not a trivial task, and is expensive and time-consuming when experts have to carry it out manually. As for other NLP tasks,bothstatistical(machinelearning)andlinguistic knowledge-based techniques have been considered for this problem. Examples of the former are (Knight and Marcu, 2002; Hori et al., 2002), and of the latter are (Grefenstette, 1998; Jing and McKeown, 1999). A comparison of both approaches in the context of a Dutch subtitling system is provided in (Daelemans et al., 2004). The required sentence simplification is achieved either by deleting material, or by paraphrasing parts of the sentence into shorter expressions with the same meaning. As a special case of the latter, lexical substitution is often used to achieve a compression target by substituting a word by a shorter synonym. It is on this subtask that we focus in this paper. Table 1 provides a few examples. E.g. by substituting &quot;happen&quot; by &quot;occur&quot; (example3), onecharacterissavedwithoutaffecting the sentence meaning .</Paragraph> </Section> <Section position="2" start_page="45" end_page="46" type="sub_section"> <SectionTitle> 2.2 Experimental Setting </SectionTitle> <Paragraph position="0"> The data used in our experiments was collected in the context of the MUSA (Multilingual Subtitling of Multimedia Content) project (Piperidis et al., 2004)1 and was kindly provided for the current study. The data was provided by the BBC in the form of Horizon documentary transcripts with the corresponding audio and video. The data for two documentaries was used to create a dataset consisting of sentences from the transcripts and the corresponding substitution examples in which selected words are substituted by a shorter Wordnet synonym. More concretely, a substitution example thus consists of an original sentence s = w1 ...wi ...wn, a specific source word wi in the sentence and a target (shorter) WordNet synonym wprime to substitute the source. See Table 1 for examples. The dataset consists of 918 substitution examples originating from 231 different sentences.</Paragraph> <Paragraph position="1"> An annotation environment was developed to allow efficient annotation of the substitution examples with the classes true (admissible substitution, in the given context) or false (inadmissible substitution).</Paragraph> <Paragraph position="2"> About 40% of the examples were judged as true.</Paragraph> <Paragraph position="3"> Part of the data was annotated by an additional annotator to compute annotator agreement. The Kappa score turned out to be 0.65, corresponding to &quot;Substantial Agreement&quot; (Landis and Koch, 1997). Since some of the methods we are comparing need tuning we held out a random subset of 31 original sentences (with 121 corresponding examples) for development and kept for testing the resulting 797 substitution ex- null id sentence source target judgment we delay the movement of the subject's left hand happen occur true 4 subject topic false 5 subject theme false 6 people weren't laughing they were going stone sober. stone rock false 7 if we can identify a place where the seizures are coming from then we can go in and remove just that small area.</Paragraph> <Paragraph position="4"> identify place false 8 my approach has been the first to look at the actual structure of the laugh sound. approach attack false 9 He quickly ran into an unexpected problem. problem job false 10 today American children consume 5 times more Ritalin than the rest of the world combined consume devour false</Paragraph> </Section> </Section> class="xml-element"></Paper>