File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/a88-1030_metho.xml
Size: 20,746 bytes
Last Modified: 2025-10-06 14:12:07
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1030"> <Title>FINDING CLAUSES IN UNRESTRICTED TEXT BY FINITARY AND STOCHASTIC METHODS</Title> <Section position="4" start_page="210" end_page="223" type="metho"> <SectionTitle> 2. Finding Clauses </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="210" end_page="210" type="sub_section"> <SectionTitle> 2.1 Why Clauses? </SectionTitle> <Paragraph position="0"> Syntactic surface clauses are interesting units of language processing for a variety of reasons. In the surface clause, criteria of form and meaning converge to guarantee both that it can be recognized solely by surface syntactic properties and that it constitutes a meaningful unit (ideally a proposition) in a semantic representation.</Paragraph> <Paragraph position="1"> Clauses have been investigated in psycholinguistic research. Jarvella (1971) found effects of both sentence boundaries and clause boundaries in recall of spoken complex sentences and took them, along with previous results of Jarvella & Pisoni (1970), to support a clause-byclause view of within-sentence processing.</Paragraph> <Paragraph position="2"> Later research on reading comprehension has found effects on gaze duration not only of word length and word frequency, but also of syntactic local ambiguity (garden paths) and of ends of sentences (Just & Carpenter 1984). However, the study of clause units as distinct from sentence units has not been carried out systematically in psycholinguistic experiments so far, and a lot of basic facts remain to be found out about the role of clause units of different kinds in the processes whereby spoken and written language is comprehended.</Paragraph> </Section> <Section position="2" start_page="210" end_page="210" type="sub_section"> <SectionTitle> 2.2 The Definition of A Basic Clause </SectionTitle> <Paragraph position="0"> Finding basic noun phrases is important as a stepping stone to finding clauses, on the assumption that an important subset of them have an initial sequence consisting of a noun phrase followed by a tensed verb as a defining characteristic. The result of scoring the respective success of the two methods of parsing basic noun phrases in sample text portions, reported in Ejerbed (1987), was the following.</Paragraph> <Paragraph position="1"> The regular expression output had 6 errors in 185 noun phrases, i.e. a 3.3% error rate. The stochastic output had 3 errors in 218 noun phrases, i.e. a 1.4% error rate. Both results must be considered good in the absolute sense of an automatic analysis of unrestricted text, but the stochastic method has a clear advantage over the regular expression method. Basic noun phrases can be found, which is of important for clause recognition.</Paragraph> <Paragraph position="2"> The definition of basic clause that was used in this study has the following characteristics: a) it concentrates on certain defining characteristics present at the beginnings of clauses; b) it follows from a particular hypothesis about syntactic working memory: that it is limited to processing one clause at the time; and c) it assumes that the recognition of any beginning of a clause automatically leads to the syntactic closure of the previous clause.</Paragraph> <Paragraph position="3"> It should be clear from the above, that the theoretical reasons for pursuing a recursion-free definition of a basic clause have to do with a theory of linguistic performance, rather than with a theory of linguistic competence, in which memory limitations play no part. It is a hypothesis of the author's current clause-byclause processing theory, that a unit corresponding to the basic clause is a stable and easily recognizable surface unit, and that it is also an important partial result and building block in the construction of a richer linguistic representation that encompasses syntax as well as semantics and discourse structure.</Paragraph> </Section> <Section position="3" start_page="210" end_page="221" type="sub_section"> <SectionTitle> 2.3 A Regular 17rpression for Basic Clauses </SectionTitle> <Paragraph position="0"> Several versions of a regular expression for basic clauses were written by the author and preceded the one presented in Appendix 1, which was,</Paragraph> <Paragraph position="2"> applied to 60 files of Brown corpus tagged text of 2000 words each, newspaper texts A01-A20, scientific texts J01-J20 and fiction texts K01K20. null The first half of the definition of *clause* introduces a few auxiliary definitions: comp for a set of complementizers, punct for a set of punctuation marks, and tense for a set of verb forms that are either certainly tensed (&quot;BED&quot; &quot;BEDZ .... BEM .... BER .... BEZ .... DOD .... DOZ&quot; &quot;HVD .... HVZ .... MD .... V'BD .... VBZ&quot;) or possibly tensed (&quot;BE .... DO .... HV .... VB&quot;). The definition of clause also uses the previously defined *brown-np-regex*. The second and larger part of the definition of *clause* consists of a union of six concatenations.</Paragraph> <Paragraph position="3"> The first defines complete main clauses as consisting of a sequence of an optional coordinating conjunction CC followed by an obligatory basic noun phrase followed by optional non-clausal complements and an optional adverb followed by an obligatory tensed verb followed by anything expcept the punctuations or complementizers indicated in the list after (not .... followed by optional punctuation.</Paragraph> <Paragraph position="4"> The second defines clauses introduced by an obligatory CC followed by an optional adverb followed by an obligatory element which is either a tensed or participial verb form, followed by the same clause ending as in the first definition.</Paragraph> <Paragraph position="5"> The third concatenation defines a subordinate clause as starting with an optional coordinating conjunction followed by an obligatory complementizer followed by the same clause ending as in the first and second definitions.</Paragraph> <Paragraph position="6"> The remaining three definitions are of clause fragments rather than full clauses. Consider the following sentence: The man \[who liked ice cream,\] ate too much.</Paragraph> <Paragraph position="7"> In it, the relative clause makes a basic clause unit that breaks up the main clause into two clause fragments. The third concatenation defines noun phrase fragments that begin with a basic noun phrase followed optionally by one or more prepositional phrases, or sequences of CC np or $ np, followed by the same clause ending as in the other definitions. In the example above, \[the man\] would be a noun phrase fragment.</Paragraph> <Paragraph position="8"> The fifth concatenation defines verb phrase fragments, e.g. \[ate too much\].</Paragraph> <Paragraph position="9"> The sixth concatenation defines clause fragments that are adjuncts, i.e. adverbial phrases, prepositional phrases and adjective phrases. The typical case in which such a fragment is recognized is when it precedes another clause: \[On a clear day,\] \[you can see forever\].</Paragraph> </Section> <Section position="4" start_page="221" end_page="221" type="sub_section"> <SectionTitle> 2.4 Output of Regular Expression for Clauses </SectionTitle> <Paragraph position="0"> The regular expression in Appendix 1 was automatically expanded into a deterministic fsa for clause recognition by Church's program. This rule compilation will not be described here. An excerpt from the result of applying it to the 60 files mentioned in the introduction to this section is presented in Appendix 2, where the location and nature of hand-corrections have been highlighted. The hand-correction was guided by the following principles.</Paragraph> <Paragraph position="1"> 1) There should be at most one tensed verb per clause. This inserts a clause boundary after a tensed clause and before a tensed verb in the following kind of case, which the current regular expression marcher does not capture: \[The announcement\] \[that the President was late\] \[was made late in the afternoon\].</Paragraph> <Paragraph position="2"> 2) There should be a clause boundary after a sentence initial prepositional or adverbial phrase and before the sequence np tensed verb, whether or not they are separated by a comma:\[At the summit in Iceland\] \[Gorbachev insisted ...\].</Paragraph> <Paragraph position="3"> 3) There should be a clause boundary before CC followed by a tensed verb. Although the second concatenation in the clause regex aimes at capturing such clauses, it is not always successful in doing so because there is no way, given the current implementation of negation in the regular expression program, to state that a clause should end before a concatenation of items, i.e. before (* CC tense). Only single items can be negated at present. Example: /The Purchasing Departments are well operated\] \[and follow generally accepted practices\].</Paragraph> <Paragraph position="4"> 4) There should be a clause boundary before a preposition (IN) followed by a wh-word, i.e.</Paragraph> <Paragraph position="5"> before (* IN (+ WDT WPO WP$ WRB WQL)).</Paragraph> <Paragraph position="6"> For the same reason given under 3), there is no way currendy to state that a clause should end before such a sequence. Example: \[The City Executive Committee deserves praise for the manner\] \[in which the election was conductedl.</Paragraph> <Paragraph position="7"> Several interesting observation were made in the course of doing these hand-corrections. For one, there were errors in the Brown corpus assignment of tags, in particular several errors confusing VBD and VBN, and there were errors where the sequence TO VB was tagged IN NN.</Paragraph> <Paragraph position="8"> More seriously, it turned out that the words as and like, which have the property of functioning either like prepostions IN or subordinating conjunctions CS were always tagged CS, thus leading to incorrect recognition of clauses in many cases. Another problem for recognizing clauses on the basis of identifying tensed verbs was that the tag VB is applied to forms that are either infinitival or present tensed (or subjunctive), depending on context. It would have been better if such forms had been considered lexically ambiguous and given distinct tags. However, by and large the tagged Brown corpus is a very good and useful product, both in the choice of tags, and in the consistency with which they have been applied. Doing the hand-correction also forced the realization that the clause recognition program, like the noun phrase recognition program, dependes crucially on accurate assignment of parts of speech to all words, on order to work well. For this task, Church's stochastic parts program is admirably suited, since it gives correct assignments in a very large number of cases, and it holds the potential of further improvement in its performance with further training.</Paragraph> </Section> <Section position="5" start_page="221" end_page="223" type="sub_section"> <SectionTitle> 2.5 Stochastic Recognition of Clauses </SectionTitle> <Paragraph position="0"> As stated before, the regex *clause* was applied to sixty texts in the Brown corpus, and the output was hand-corrected. The hand-corrected files, containing an estimated total of at least 20,000 basic clauses, including clause fragments, were then used as training material for a stochastic recognition program. The training consisted of observing the location of clause opens and clause closes, and a special training specifically in locating tensed verbs. After training, the stochastic parts program and thereafter the stochastic clause recognizer was applied by K. Church to a large amount of Associated Press newswire text from May 26, 1987 (526 blocks, 2381353(8) bytes). An excerpt of the result is presented in Appendix 3. The result, again, is strikingly good.</Paragraph> <Paragraph position="1"> A comparison of the nature and amount of errors in recognizing basic clauses in a sample of uncorrected regex output, and a sample of output from the stochastic clause program, can be made on the basis of Tables I and 2 at the end.</Paragraph> <Paragraph position="2"> It appears that the stochastic program is more successful than the current regular expression method. However, certain improvements in the regex program could change that. What is needed is the facility to process generalized regular expressions, which admit the operations of complement and intersection, in addition to the operations of concatenation, union and Kleene star that characterize regular expressions. In any case there axe some interesting differences in the kinds of errors made by the currant regex program and the stochastic one for recognizing clauses. The regex program systematically errs by underrecognizing, never by overrecognizing, and in the selected portions that were scored, it only puts a few clause boundaries in the wrong place. It misses lots of clause boundaries, but the ones it gets are mostly correct.</Paragraph> <Paragraph position="3"> The stochastic program, on the other hand, is able to get many clause boundaries correctly that elude the regular expression matcher, e.g.</Paragraph> <Paragraph position="4"> clauses not introduced by complementizers. The stochastic program errs both by overrecognizing and underrecognizing clauses, and sometimes it also places the clause open or clause close in the wrong place. Some cases of incorrect clause recognition are due to incorrect assignments of parts of speech to words. However, the total number of errors with the stochastic method (21) is smaller than the total number of errors with the regex method (40), for approximately the same number of clauses to be recognized, 304 versus 308. This is a very surprising outcome indeed, and if taken literally, without any further weighting of the different types of errors, it means that the error rate for the stochastic method for recognizing clauses is 6.5%, as compared with 13% for the regex method.</Paragraph> <Paragraph position="5"> 3. On the Relation between Clauses and Into. &quot;;on Units Finding baslt, tause units in arbitrary text is necessary in order to locate tonal minor phrases, which, in addition to a phrase accent, also have a boundary tone, and, particularly at slow rates of speech, a pause at the end of the phrase. The current experiment in text analysis has been concerned primarily with informative rather than imaginative prose, and envisages appLications of the text-to-speech system to the reading of informative prose like newspaper text.</Paragraph> <Paragraph position="6"> In the current Bell Labs text-to-speech system, tonal minor phrase boundaries are identified on the basis of commas, and tonal major phrase boundaries ate identified on the basis of periods. Finding more tonal minor phrase boundaries by using syntactic structure, in addition to punctuation, is the problem we are trying to address with the methods described in this paper. In order to know where tonal minor phrase boundaries actually occur in the reading of informative texts, which typically have very long sentences (an average of 21 words compared with 14 words in general fiction based on Brown corpus data), it would be necessary to make recordings of several persons reading both authentic and prepared texts in a rhetorically explicit way, to borrow a phrase from Beckman & Pierrehumbert (1986), and then make extensive speech analyses of them, particularly of fundamental frequency movements and pauses.</Paragraph> <Paragraph position="7"> In the absence of such data for American English, the following kinds of boundaries between clauses and clause fragments were hypothesized to constitute intonation breaks with the status of tonal minor phrase boundaries. They ale marked with # in the examples below.</Paragraph> <Paragraph position="8"> a) After sentence initial adverbials and before np tense: \[At the summit in Iceland\] # \[Gorbachev insisted...\] b) After a relative clause and before a tensed verb: \[A House Committee\] \[which heard his local option proposal\] # \[is expected\] \[to give it a favorable report.\] c) After other noun phrases with clausal complements and before a tensed verb: \[The announcement\] \[that the President was late\] # \[was made by the Press Secretary to the waiting journalists.l d) Before a set of complementizers categorized CS in the Brown corpus, it is frequently the case that there is an intonation break: \[that/CS ...\],</Paragraph> <Paragraph position="10"> However, there are some exceptions to this , in particular: (i) Comparatives: \[This is not as/QL fast/J J\] \[as/CS I would like ... \] or \[The theorem is more/RBR general~J J\] \[thanlCS what we have described\] (ii) The words as/CS and like/CS when used as prepositions, i.e. followed by noun phrases that are not subjects of clauses: \[Jenkins left the White House in 1984,\] \[and joined Wedtech\] \[as/CS its director of marketing two years later.\] For testing purposes, short passages of seven consecutive sentences each from the Brown files, and four sentences each from the AP newswire stories were synthesized by the author, using the Bell Labs text-to-speech system. Those boundaries between clauses and clause fragments that are identified above were implemented in the same way that commas are, i.e. with a phrase accent belonging to the tonal minor phrase, final lengthening, a boundary tone, and a short pause of 200 ms. The results have not yet been subjected to perceptual tests.</Paragraph> <Paragraph position="11"> There are some studies of the relation between clause units and intonation units that provide relevant data for future work. Garding (1967) studied prosodic features in spontaneous and read Swedish speech. She found that in the spontaneous speech, pauses were equally divided between syntactic pauses and hesitation pauses, a syntactic pause being defined as one that coincides with a syntactic boundary. In the read speech, all pauses were syntactic pauses: &quot;They appear between main clause and subordinate clause, before adverbial modifiers and between the different parts of an enumeration. The pause length is shortest in enumerations and before relative clauses (4-10 cs) and longest before adverbial modifiers and between complete sentences.&quot; (p. 48).</Paragraph> <Paragraph position="12"> In a study of the intonational properties of relative clauses in British English, Taglicht (1977) compared the speech of a news broadcast with impromptu speech, and found that both genres separated nonrestrictive relative clauses prosodically. The news broadcast also separated a large proportion (71%) of the restrictive relative clauses prosodically.</Paragraph> <Paragraph position="13"> A recent and very extensive study of the grammtical properties of intonation units, or tone units (I'U) is Altenberg (1987). He studied a monologue of 48 minutes duration from the London-Lurid Corpus of spoken English, and his results concerning the correlation of clause boundaries and tone unit boundaries are presented in Table 3 at the end.</Paragraph> </Section> </Section> <Section position="5" start_page="223" end_page="226" type="metho"> <SectionTitle> 4. Conclusion </SectionTitle> <Paragraph position="0"> The study reported above shows that basic clauses, including basic noun phrases, are stable and surface recognizable units in the definitions they were given here, and that both finitary and stochastic methods can be used to find them in unrestricted text with a high degree of success.</Paragraph> <Paragraph position="1"> The comparison between the error rate of these two methods showed that the stochastic method performed better both in the recognition of basic noun phrases and basic clauses, which is an unexpected result.</Paragraph> <Paragraph position="2"> APPENDIX 1 Regular expression for basic clauses.</Paragraph> <Paragraph position="3"> Sample of output of applying the regular expression *clause* as defined in Appendix l, to Brown newspaper story A01. Hand-corrections are marked by double asterisks for underrecognized, and single asterisks for overrecognized clause boundaries.</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> Sample of output of stochastic procedure for finding clause boundaries. Tensed verbs should be in bold face. In the recognition of these clauses, the constraint was enforced that there be at most one tensed verb per clause. Hand-corrections marked as in Appendix 2.</Paragraph> <Paragraph position="9"/> </Section> class="xml-element"></Paper>