File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/88/a88-1030_intro.xml
Size: 5,145 bytes
Last Modified: 2025-10-06 14:04:38
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1030"> <Title>FINDING CLAUSES IN UNRESTRICTED TEXT BY FINITARY AND STOCHASTIC METHODS</Title> <Section position="3" start_page="0" end_page="210" type="intro"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> The present paper describes the procedure that was followed in an extended experiment to reliably find basic surface clauses in unrestricted English text, using various combinations of finitary and stochastic methods. The purpose was to make some improvements in the detection and treatment of large prosodic units above the level of fgroups in the Bell Labs text-to-speech system. This system currently relies exclusively on punctuation (commas and periods) for the detection of such units, i.e. tonal minor and major phrases. Commas are correlated with tonal minor phrases, and sentence final periods with tonal major phrases. The notion of fgroup (one or more function words followed by one or more content words), and its implementation in the Bell Labs text-to-speech system is described in Liberman & Buchsbaum (1985).</Paragraph> <Paragraph position="1"> Correct automatic detection of major syntactic boundaries, in particular clause boundaries, is a prerequisite for automatic insertion of final lengthening, boundary tones and pauses at such boundaries within sentences (cf. Allen, Hunnicutt & Klatt 1987,and Altenberg 1987). These prosodic phenomena make significant contribution to the naturalness and intelligibility of synthetic speech. Unfortunately, the task of parsing unrestricted text correctly, in order to find the relevant sentence internal syntactic boundaries has turned out to be very difficult.</Paragraph> <Paragraph position="2"> This paper is a report of an attempt to provide a better foundation for parsing text by the use of simple fmitary and stochastic computational methods. These simple methods have not figured prominently in the theory and practice of natural langauge parsing, with some exceptions (Langendoen 1975, Church 1982, Ejerhed & Church 1983). For an experimental, and more complicated method to derive all prosodic units in the text-to-speech system, i.e. not just tonal minor and major phrases but every type of prosodic unit, from the syntactic structure and length of constituents, see Wright, Bachenko & Fitzpatrick (1986).</Paragraph> <Paragraph position="3"> The first purpose of the experiment was to test the performance of a finite state parser, when the parser was given the rather difficult and substantive tasks of finding basic, non-recursive clauses in continuous text, in which each word had been tagged with a part of speech label.</Paragraph> <Paragraph position="4"> Parts of the tagged Brown corpus were used, representing the genres of both informative and imaginative prose. The clause grammar, consisting of a regular expression for clauses of different kinds, was constructed by the author .-. and tt was first applied to text that was guaranteed to have correct parts of speech assigned to the words, so that problems in constructing the grammar could be isolated from problems in assigning correct parts of speech.</Paragraph> <Paragraph position="5"> The finite state parser that used the clause grammar consisted of a program that matched regular expressions for clauses against the longest substrings of tagged words that fit them, and it was constructed and implemented by K.</Paragraph> <Paragraph position="6"> Church.</Paragraph> <Paragraph position="7"> The second purpose was to see whether basic clauses could also be recognized by stochastic programs, after these had been trained on suitable training material. The training material was prepared by hand-correcting the output of the program that processed the regular expressions for clauses. A stochastic program for assigning unique part of speech tags to words in unrestricted text had been created by K. Church, and trained on the tagged Brown corpus (see Church 1987). The resultant program is 95-99% correct in its performance, depending on the criteria of correctness used, and it can be used as a lexical front end to any kind of parser, i.e. not necessarily stochastic or finite state parsers.</Paragraph> <Paragraph position="8"> However, the question presented itself whether the stochastic procedure that was so successful in recognizing parts of speech could also be applied to more advanced tasks such as recognizing noun phrases and clauses. The present paper concentrates on the parsing of basic clauses. The parsing of noun phrases by the same two methods is compared in Ejerhed (1987), and the stochastic parsing of noun phrases is described in detail in Church (1987).</Paragraph> <Paragraph position="9"> The structure of the paper is as follows. Section 2 defines the target of a basic clause, and reports on the outcome of the search for such milts by the two methods. Section 3 discusses the correlations between clause units as defined by this paper, and the prosodic units of tonal minor and major phrases in the Bell Labs text-to-speech system.</Paragraph> </Section> class="xml-element"></Paper>