File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-3002_intro.xml
Size: 2,405 bytes
Last Modified: 2025-10-06 14:00:48
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-3002"> <Title>Efficient parsing strategies for syntactic analysis of closed captions</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper we present on-going research on parsing closed captions (subtitles) from a news broadcast. The research has been conducted as part of an effort to build a prototype of a real-time Machine Translation (MT) system translating news captions from English into Cantonese (Nyberg and Mitamura, 1997). We describe an efficient multi-level chart parser that was designed to handle the kind of language used in our domain within a time that allows for a real-time automatic translation. In order to achieve high parsing speed, we divided an existing English grammar into multiple levels. The parser proceeds in stages. At each stage, rules corresponding to only one level are used. A constituent pruning step is added between levels to insure that constituents not likely to be part of the final parse are removed. This results in a significant parse time and ambiguity reduction. Since the domain is unrestricted, out-of-coverage sentences are to be expected and the parser might not produce a single analysis spanning the whole input. Thus, the set of final constituents has to be extracted from the chart.</Paragraph> <Paragraph position="1"> Despite the incomplete parsing strategy and the radical pruning, the initial evaluation results show that the loss of parsing accuracy is acceptable. The parsing time favorable compares with a Tomita parser and a chart parser parsing time when run on the same grammar and lexicon.</Paragraph> <Paragraph position="2"> The outline of the paper is as follows. In Section 2 we describe the syntactic and semantic characteristics of the input domain. Section 3 provides a short summary of previous published research. Section 4 gives an overview of requirements on the parsing algorithm posed by our application. Section 5 describes how the grammar was partitioned into levels. Section 6 describes the constituent pruning algorithm that we used. In Section 7 we present the method for extracting final constituents from the chart. Section 8 presents the results of an initial evaluation. Finally, we close with future research in Section 9.</Paragraph> </Section> class="xml-element"></Paper>