File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1073_metho.xml
Size: 26,233 bytes
Last Modified: 2025-10-06 14:12:41
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1073"> <Title>THE USE OF PROSODY IN SYNTACTIC DISAMBIGUATION</Title> <Section position="5" start_page="372" end_page="372" type="metho"> <SectionTitle> CORPUS </SectionTitle> <Paragraph position="0"> Our methodology involved (1) recording pairs of structurally ambiguous sentences, (2) presenting the resulting utterances to naive listeners for perceptual judgements, and (3) comparing the phonological and phonetic characteristics of the spoken utterances with listeners' ability to disambiguate them. The recordings, which formed the basis for both perceptual experiments and phonetic and phonological analyses, are described below.</Paragraph> <Paragraph position="1"> We used 35 sentences pairs, ambiguous in that the two members of each pair contained the same string of phones, and could be associated with two contrasting syntactic bracketings. The sentences manifested seven types of structural ambiguity: (1) parenthetical clauses vs. non-parenthetical subordinate clauses, (2) appositions vs. attached noun (or prepositional) phrases, (3) main clauses linked by coordinating conjunctions vs. a main clause and a subordinate clause, (4) tag questions vs. attached noun phrases, (5) far vs. near attachment of final phrase, (6) left vs. right attachment of middle phrase, and (7) particles vs. prepositions.</Paragraph> <Paragraph position="2"> Note that &quot;high vs. low&quot; attachment is probably a more accurate syntactic description than &quot;far vs. near&quot; attachment. However high vs. low attachment could involve the same site in the string of words being parsed, and our instances of far (high) attachment all involve attachment to phrases ending in a word that is not neighboring the word to be attached. Therefore, we instead use the more descriptive terms &quot;far&quot; and &quot;near&quot;. In each of the 7 categories, there were 5 pairs of ambiguous sentences. In presentation, each sentence was preceded by a disambiguating context of one or two sentences. The target sentences were fully voiced to facilitate pitch tracking for acoustic analysis. We use the term size of syntactic break to reflect the number of syntactic brackets that would occur between two pairs of words: more brackets correspond to a larger syntactic break. The site with the largest number of brackets is the major syntactic break. For structural categories 1-4, sentence A of the pair involved a larger syntactic break than sentence B. For the attachment ambiguities 5-7, sentence A of the pair had the larger syntactic break later in the sentence than did sentence B.</Paragraph> <Paragraph position="3"> The sentences were recorded by four professional FM public radio newscasters, one male and three female, who were naive with respect to the purposes of the experiment. The newscasters were asked to read the sentences in context, using their standard radio style of speaking. In a pilot study, we found the FM radio style to have more clearly and consistently marked prosodic cues than a non-professional speaking style \[18\]. Our hope was that this style would be easier to label prosodically, and therefore the contributions of specific phonological cues would be easier to identify. The announcers were presented with the written sentences in context paragraphs, with the sentence types and AJB members of the pairs assigned to two recording sessions, so that the two contrasting members of a pair did not occur in the same session. The speakers were not told that there were special target sentences within the paragraphs. The recording sessions were separated by at least a few days and often several weeks, to minimize the possibility that the announcers would produce unnatural versions in an attempt to emphasize potential differences between the two members of a pair. Our goal was to create sentence pairs that were segmentally identical but syntactically different, so that we could investigate the relationship between syntax and prosody independent of any differences contributed by the segments. Although they were not prosedically incorrect, tag sentences in which the tags were read as questions were rerecorded as statements so that the question boundary tone cue would not confound the potential contribution of other prosodic cues.</Paragraph> </Section> <Section position="6" start_page="372" end_page="373" type="metho"> <SectionTitle> PERCEPTUAL EXPERIMENTS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="372" end_page="373" type="sub_section"> <SectionTitle> Methods </SectionTitle> <Paragraph position="0"> For the perceptual experiments, the spoken context sentences were edited out so that the target sentences could be presented in isolation. The 35 sentence pairs produced by a single speaker were presented to listeners in two sessions; only one member of each pair was heard in each session using a mixed assignment of half type A and half type B sentences in each session (analogous to the strategy used for recording the sentences). The different syntactic types were interleaved, and A versions always appeared before B versions on the answer sheet. The listeners heard the sentences in a small conference room from a portable stereo. The tape player was stopped between sentences until subjects were ready to continue; the subjects were under no time constraints to make their judgemerits. Each listening session (35 sentences) took approximately 40 minutes, and was conducted without any additional breaks. Listening sessions were separated by at least three weeks to minimize listener recall of the previous session's sentences. Listeners were given an answer sheet with both disambiguating contexts written out for each sentence; the target sentence was printed in bold at the end of each context. They were asked to mark the context which they thought best matched what they heard, with an additional marker if they were confident of their decision. Subjects were rewarded with pizza and soft drinks after the session.</Paragraph> <Paragraph position="1"> The subjects were all native speakers of American English, naive with respect to the purpose of the experiments. Most were engineering students, recruited through flyers advertising the free pizza. For the second two speakers, to attract more subjects, we increased the incentive by offering an additional $50 prize to the person who scored highest on this task. The number of listeners who heard both sessions for each of the different speakers was 13 for Speaker F1A, 15 for F2B, 17 for F3A and 12 for M1B. Different subjects partici- null pated in the experiments for the different speakers, although there was some overlap in the subject pool. Four subjects participated in all four experiments.</Paragraph> </Section> <Section position="2" start_page="373" end_page="373" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> For the analysis, we assume that the speaker produced the intended version of the sentence, and define a correct listener response as one which identifies that version. Accuracy is the percentage of correct listener responses. Confidence is the percent of the time that listeners indicated that they were confident of the response choice. Table 1 summarizes average subject accuracy for the different types of ambiguity. The averages are taken over the four speaker averages, so as not to more heavily weight the utterances that were heard by more listeners. The averages for each speaker are taken across five versions of each structural type, as well as across the various listeners (12-17 per talker).</Paragraph> <Paragraph position="1"> Table 1 shows that subjects could reliably disambiguate many, but not all of the ambiguities. Subjects were rarely confident and incorrect, and the confidence is somewhat correlated (0.64) with the accuracy. On the average, subjects did well above chance (84% correc0 in assigning the sentences to their appropriate contexts, although subjects were confident of their judgments only 52% of the time. Also on average, main-subordinate (3B) sentences and near attachments (SB) were close to the chance level; parentheticals (1A), far attachments (5A) and non-tags (4B) were recognized at levels greater than chance but not reliably; and all other sentence types were reliably disambiguated.</Paragraph> <Paragraph position="2"> Type 1. Parenthetical or not 2. Apposition or not 3. M-M vs. M-S 4. Tags or not 5. Far/near attachment speakers, for ambiguous sentence interpretation. The Version A/B figures are based on 285 total observations of each class. An asterisk marks the A and B version responses that had high accuracy in listener responses. (High accuracy was defined to be average accuracy minus the standard deviation greater than 50%.)</Paragraph> </Section> </Section> <Section position="7" start_page="373" end_page="374" type="metho"> <SectionTitle> PHONOLOGICAL ANALYSIS </SectionTitle> <Paragraph position="0"> The perceptual experiments described above clearly show that speakers can encode prosodic cues to structural ambiguities in ways that listeners can use reliably. This section attempts to find a phonological answer to the question: How do they do it? To approach this question, we labeled discrete, prosodic phenomena (specifically, prosodic phrase boundaries and prominences) that could mark structural contrasts phonologically. We then analyzed the relationship between these labels and the patterns in the perceptual accuracy study. There are other prosodic cues (e.g., the type of pitch accent), and there are other phonological correlates of the prosodic structure (e.g., phonological processes at prosodic boundaries) which can likely play a role in disambiguation. However, analysis of these phenomena was beyond the scope of the present study. In the following section, we describe our labeling system and analyze the associated constituents in terms of their relationship to the syntactic structures in our corpus, and the accuracy with which sentences are identified.</Paragraph> <Section position="1" start_page="373" end_page="373" type="sub_section"> <SectionTitle> Perceptual Labels </SectionTitle> <Paragraph position="0"> We chose labels based on three criteria: (1) they should be used consistently within and across labelers, (2) they should be rather close to surface forms (to make eventual automatic detection more tractable and to improve labeler consistency), and (3) they should provide a mechanism for communicating information to a parser. For these reasons, our notation differs somewhat from that of other systems, although it is similar in many respects.</Paragraph> <Paragraph position="1"> We used seven levels to represent perceptual groupings (or, viewed another way, degrees of separation) between words. These seven levels appeared adequate for our corpus and also reflected the levels of prosodic constituents described in the literature. Our labeling experience led us to adopt the maximum number of levels suggested in the literature, although not all are universally accepted. We used numbers to express the degree of decoupling between each pair of words as follows: 0 - boundary within a clitic group, 1 - normal word boundary, 2 - boundary marking a grouping of words generally having only one prominence, 3 - intermediate phrase boundary, 4 - intonational phrase boundary, 5 - boundary marking a grouping of intonational phrases, and 6 - sentence boundary.</Paragraph> <Paragraph position="2"> Break indices of 4, 5, and 6 are &quot;major&quot; prosodic boundaries; constituents defined by these boundaries are often referred to as 'intonation phrases' (e.g., see \[2\]), and are marked by a boundary tone. Boundary tones were labeled using two types of falls (final fall and non-final fall), and two types of rises (continuation rise and question rise). The break index 3 corresponds to the unit referred to as an 'intermediate phrase' in \[2\] or a 'phonological phrase' in \[14\]. The 'phrase accent' pitch marker theoretically associated with the intermediate phrase was not labeled.</Paragraph> <Paragraph position="3"> Prominent syllables in the sentences were labeled using P1 for major phrasal prominence; P0 for a lesser prominence; and C for contrastive stress, which occurred rarely in these sentences (marked on 1% of the total words for four speakers).</Paragraph> <Paragraph position="4"> The prosodic cues were labeled perceptually by three listeners using multiple passes. The data were first labeled by the listeners individually; any differences in markings were then discussed; and then the sentence was replayed a few times to allow the labelers to revise their markings. Finally, a majority vote of the labels (which at this point had a correlation of 0.96 across labelers) was used as the final hand-marked label set. All labeling was perceptual.</Paragraph> </Section> <Section position="2" start_page="373" end_page="374" type="sub_section"> <SectionTitle> Analysis </SectionTitle> <Paragraph position="0"> To separate semantic effects from effects that should occur throughout the syntactic class, we paid particular attention to those cues that reliably occurred in the A versions of one class, but never in the contrasting B versions, or vice versa. We also paid particular attention to those sentences that had high accuracy and confidence and to the outlier sentences. Below we mention some general results and then discuss briefly the individual classes investigated.</Paragraph> <Paragraph position="1"> General Observations: We found that prosodic boundary cues are associated with almost all reliably identified sentences. Presence of an intonational phrase boundary (break index 4 or 5) was often, but not always, a reliable cue and was most often observed at embedded or conjoined clause boundaries (marked by commas in the text). In addition, a difference in the relative size of prosodic break indices, or in the location of the largest break regardless of size, was frequently the only disambiguating information in the labels for the smaller syntactic constituents that were reliably disambiguated. By and large, relatively larger break indices tended to mean that syntactic attachment was higher rather than lower. In contrast to the pervasive association of boundary cues with successful disambiguation, prominence seemed to play mainly a supporting role, and was the sole cue in only a few sentences.</Paragraph> <Paragraph position="2"> Parenthetical (A) vs. non-parentheticals (B): The A versions always have break indices larger than 3 surrounding the parenthetical, except for one talker's rendition of one sentence. The B members have break indices less than 4 at one or both of the corresponding sites. In all cases, the sentences with major prosodic breaks surrounding the parenthetical were identified as version A by 75% or more listeners, and sentences without the major prosodic breaks were identified as version B 80% of the time or more. This generalization includes an anomalous A version having a 3 at the parenthetical boundary, which was identified in accordance with the indices rather than in accordance with the speaker's intent.</Paragraph> <Paragraph position="3"> Apposition (A) vs. non-apposition (B): The A version of the pair, the appositive, always has a major prosodic break both before and immediately following the appositive, The B version of the pair typically has a small break index at one or both of the corresponding sites. Two speakers produced a major break at the 'wrong' location, i.e., after &quot;are&quot; in &quot;Wherever you are in Romania or Bulgaria, remember me.&quot; This predicts that the sets should be clearly separable, except for this sentence, which is what we found: All were labeled by the naive listeners at 87% accuracy or higher, except for this sentence, which was 73% correct.</Paragraph> <Paragraph position="4"> Main-main (A) vs. main-subordinate sentences (B): The A versions of the pairs were typically well-identified, whereas the B versions tended to be close to the chance level. This could be the result of a syntactic response bias if the conjunction constructions are preferred over the deleted &quot;that&quot; in the alternants. This is interesting since the bracketings differ for the two versions of the sentence, and yet the two versions are apparently not well separated perceptually.</Paragraph> <Paragraph position="5"> The prosodic transcriptions suggest a reason: both versions of the sentence have a major prosodic boundary in the same location, associated with the embedded (B) or conjoined (A) sentence.</Paragraph> <Paragraph position="6"> Tags (A) vs. non-tags (B): The A members all have a major prosodic break before the tag, and these were all identified as A versions (92% or more or the time). One talker produced one B version with a major prosodic boundary in the &quot;wrong&quot; place, and 92% of the listeners identified this utterance as version A, in accordance with the prosody. Two other B versions were frequently misidentifled; these sentences had no boundary tone, but did have a break index of 3 (the largest in these sentences) at the site corresponding to the boundary of the tag.</Paragraph> <Paragraph position="7"> Far (A) vs. near (B) attachment sentencey. The A versions showed a tendency to have the largest break index in the sentence before the phrase to be attached to a &quot;far&quot; site (i.e., a site other than to a phrase ending in the immediately preceding word). This pattern occurred in 15 of the 20 A utterances and only one of the B utterances. One talker's production of one A version had a 2 at the site in question, and a majority of the listeners labeled this as version B, which happened with none of the other A versions. Thus, the loca-tion of a relatively large break index at the site in question appears to block the &quot;near&quot; (low) attachment, and a relatively small index appears to enhance it.</Paragraph> <Paragraph position="8"> Left (A) vs. right (B) attachment sentences: For every rendition by every talker, there was a smaller break index at the attachment location than at the other end of the word or phrase to be attached.</Paragraph> <Paragraph position="9"> For the four sentence pairs that differed in comma location, the difference between the two break indices was large (2 or more), typically 0 or 1 in the location without a comma and 3, 4 or 5 in the location with the comma. These utterances were very reliably identified, with greater than 92% accuracy for all but one case.</Paragraph> <Paragraph position="10"> Particles (A) vs. prepositions (B): There is less frequently a major prosodic break before a prepositional phrase compared to conjoined or embedded sentences: 60% of the prepositional phrases in this class followed a major prosodic break, compared to 90% observed in the context of clauses. The real structural clue appears to be not the absolute size of the break index but its relative size. For all A versions, we observed a smaller break index between the verb and particle, compared to the indices before the verb or after the particle. For the B versions, the relations were reversed: there was a tendeney to have a larger break between the verb and preposition, compared to those before the verb or after the preposition.</Paragraph> <Paragraph position="11"> There was little systematic difference in the speakers' use of presodie cues. There were some differences in individual sentences which accounted for the variation in listener responses, but no consistent characteristics attributed to any one speaker. The correlation of break indices between pairs of speakers was 0.94-0.95, and the relative frequencies of prominences for the different speakers were also very similar. This result is consistent with the finding in \[5\] of a high correlation in duration patterns between different versions of the same utterance read by non-professional speakers.</Paragraph> </Section> </Section> <Section position="8" start_page="374" end_page="375" type="metho"> <SectionTitle> PHONETIC ANALYSIS </SectionTitle> <Paragraph position="0"> We have thus far presented evidence that naive listeners can reliably use prosody to separate structurally ambiguous sentences, and phonological evidence that suggests how listeners might use prosody to assign syntactic structure. Other studies have focussed on syntactic differences associated with disambiguation. Our evidence shows that the prosodic structure can point to the syntactic differences in systematic ways: sentences with certain correspondences between syntactic and prosodic structures are reliably disambiguated, whereas others are not. In this section we investigate some of the phonetic evidence that might be responsible for the prosodic disambiguation. Since previous work suggests that the primary prosodic cues are duration and intonation, the present study is confined to these two cues. However, we acknowledge that other cues, such as the application or non-application of phonological rules, contribute to the perception of prosodic boundaries. We tried to minimize such effects by asking the speakers to reread sentences in which overt segmental cues were produced, i.e., where the gross phonetic transcription of the two versions of the sentence would differ.</Paragraph> <Paragraph position="1"> In the results presented here, segment duration normalization is determined automatically using an HMM-based speech recognition system, the SRI Decipher system, which uses phonological rules to generate bushy pronunciation networks that should enable more accurate phonetic transcription and alignment than single pronunciation speech recognizers \[22\]. Each phone duration was normalized according to speaker- and phone-dependent means as described in \[15\]. The variance of normalized duration in different contexts tends to be large, because the normalization has not accounted for effects such as syllable position, phonological and phonetic context, and speaking rate. In other work, we have found that variance can be reduced by adapting the phone means according to a local estimate of the speaking rate, which also plays a role in determining phoneme duration.</Paragraph> <Paragraph position="2"> We observed longer normalized durations for phones preceding major phrase boundaries and for phones bearing major prominences compared to other contexts. As mentioned earlier, it has long been noted that syntactic breaks are often associated with duration lengthening in the phrase-final syllable, though the scope of the lengthening is in dispute. We measured average normalized duration in the rhyme of the final syllable of all words and found that higher break indices are generally associated with greater normalized duration. The fact that duration is affected by constituents at many levels in the prosodic hierarchy is interesting, and consistent with our observations that relative break index size is meaningful even below the level of the intonational phrase (4,5). However, more research is needed on this question, since only the difference between the groups 0-3 (without boundary tone) and 4-6 (with boundary tone) is statistically significant; differences within those groups are not. Pauses are also associated with major prosodic boundaries, occurring at 48/212 (23%) boundaries marked with 4 and 17/25 (67%) boundaries marked with 5. Sentence-final pauses could not be measured for these sentences, which were always the final sentence in a paragraph. In only one case did a pause occur after a 3.</Paragraph> <Paragraph position="3"> Our analysis of normalized duration of the vowel nucleus for the different prominence markings revealed that: (1) major prominences (P1, C) tend to be longer than unmarked or minor (P0) prominences, although the effect is small before major prosodic breaks; (2) word-final syllables tend to be longer than non-word-final syllables; (3) syllables are longer in words before major breaks than before smaller breaks, though the effect is more dramatic for word-final syllables than for non-word-final syllables; and (4) the effects seem to be somewhat independent: the longest syllables are those with a major prominence, in word-final position, before a major break.</Paragraph> <Paragraph position="4"> Intonational cues observed included boundary tones, pitch range changes and pitch accents. Boundary tones are involved for the break indices 4, 5 and 6. Sentence-final (6) boundary tones are typically final falls; level (5) boundary tones are usually perceived as incomplete falls; and intonational phrase (4) boundary tones are most often continuation rises but occasionally are perceived as partial falls. Tags were sometimes associated with a sentence-final question rise, though we tried to eliminate this cue as much as possible by asking the radio announcers to reread versions when this occurred. Another intonational cue was a perceived drop in pitch baseline and range in a parenthetical phrase, relative to the rest of the sentence. This pitch range change was not always perceived for appositives. In examining the associated fundamental frequency (F0) contours, we observed a region of reduced F0 excursion during the period of perceived range change. Though intonation is an important cue, duration and pauses alone provide enough information to automatically label break indices with a high correlation (greater than 0.86) to hand-labeled break indices \[15\].</Paragraph> <Paragraph position="5"> Since prominence was not consistently associated with specific syntactic structures in any systematic pattern (with the exception of particles), it appears that the disambiguating role of prominences (or pitch accents) differs from that of boundary phenomena, being associated more with the semantics rather than with the syntax of an utterance. In other words, we suspect, with others, that prominence is related more to the contextual focus of the sentence.</Paragraph> </Section> class="xml-element"></Paper>