File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1029_metho.xml
Size: 10,244 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1029"> <Title>Digression: What's Statistical Parsing Good For?</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Commas and Constituency </SectionTitle> <Paragraph position="0"> Insofar as commas are used as separators or delimiters, we should see correlation of comma position with constituent structure of sentences. A simple test reveals that this is so. We define the start count sci of a string position i as the number of constituents whose left boundary is at i. The end count eci is defined analogously. For example, in Figure 2, sc0 is 4, as the constituents labeled JJ, NPB, S, and S start there; ec0 is 0. We compute the end count for positions that have a comma by first dropping the comma from the tree. Thus, at position 5, sc5 is 2 (constituents DT, NPB) and ec5 is 4 (constituents JJ, ADJP, VP, S).</Paragraph> <Paragraph position="1"> We expect to find that the distributions of sc and ec for positions in which a comma is inserted should differ from those in which no comma appears. Figure 3 reveals that this intuition is correct. The charts show the percentage of string positions with each possible value of sc (resp. ec) for those positions with commas and those 5Counterintuitively, Beeferman et al. (1998) come to the opposite expectation, and their reported results bear out their intuition. We have no explanation for this disparity with our results.</Paragraph> <Paragraph position="2"> shown circled, and the two leading to sc5 = 2 are shown circled and shaded.</Paragraph> <Paragraph position="3"> without. We draw the data again from sections 02-22 of the Wall Street Journal, using as the specification for the constituency of sentences the parses for these sentences from the Penn Treebank. The distributions are quite different, hinting at an opportunity for improved comma restoration.</Paragraph> <Paragraph position="4"> The ec distribution is especially well differentiated, with a cross-over point at about 2 constituents. We can add this kind of information, a single bit specifying an ec value of k or greater (call it ceci), to the language model, as follows. We replace p(yi j yi 2yi 1) with the probability p(yi j yi 2yi 1 ceci). We smooth the model using lower order models p(yi j yi 1 ceci), p(yi j ceci), p(yi).6 These distributions can be estimated from the training data directly, and smoothed appropriately.</Paragraph> <Paragraph position="5"> Adding just this one bit of information provides significant improvement to comma restoration performance. As it turns out, a k value of 3 turns out to maximize performance.7 Compared to the baseline, F-measure increases to 63.2% and sentence accuracy to 52.3%. This experiment shows that constituency information, even in rarefied form, can provide significant performance improvement in comma restoration. (Figure 1 lists performance figures as model 3.) Of course, this result does not provide a practical algorithm for comma restoration, as it is based on a probabilistic model that requires data from a manually constructed parse for the sentence to be restored. To make the method practical, we might replace the Treebank parse with a statistically generated parse. In the sequel, we use Collins's statistical parser (Collins, 1997) as our canonical automated approximation of the Treebank. We can train a similar model, but using ec values extracted from ends for string positions with and without commas. Chart (a) shows the percentage of constituents with various values of sc (number of constituents starting at the position) for string positions with commas (square points) and without (diamond points). Chart (b) shows the corresponding pattern for values of ec (number of constituents ending).</Paragraph> <Paragraph position="6"> Collins parses of the training data, and use the model to restore commas on a test sentence again using ec values from the Collins parse of the test datum. This model, listed as model 5 in Table 1, has an F-measure of 64.5%, better than the pure trigram model (62.2%), but not as good as the oracular Treebank-trained model (68.4%).</Paragraph> <Paragraph position="7"> The other metrics show similar relative orderings.</Paragraph> <Paragraph position="8"> In this model, since the test sentence has no commas initially, we want to train the model on the parses of sentences that have had commas removed, so that the model is being applied to data as similar as possible to that on which it was trained. We would expect, and experiments verify (model 6), that training on the parses with commas retained yields inferior performance (in particular, F-measure of 64.1% and sentence accuracy of 48.6%).</Paragraph> <Paragraph position="9"> Again consistent with expectations, if we could clairvoyantly know the value of ceci based on a Collins parse of the test sentence with the commas that we are trying to restore (model 7), performance is improved over model 5; F-measure rises to 66.8%.</Paragraph> <Paragraph position="10"> The steady rise in performance from models 6 to 5 to 7 to 3 exactly tracks the improved nature of the syntactic information available to the system. As the quality of the syntactic information better approximates ground truth, our ability to restore commas gradually and monotonically improves.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Using More Detailed Syntactic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Information 4.1 Using full end count information </SectionTitle> <Paragraph position="0"> The examples above show that even a tiny amount of syntactic information can have a substantive advantage for comma restoration. In order to use more information, we might imagine using values of ec directly, rather than thresholding. However, this quickly leads to data sparsity problems. To remedy this, we assume independence between the bigram in the conditioning context and the syntactic information, that is, we take p(yi j yi 2yi 1eci) p(yi j yi 2yi 1)p(yi j yi 1eci) p(yi) This model8 (model 8) has an F-measure of 72.1% due to a substantial increase in recall, demonstrating that the increased articulation in the syntactic information available provides a concomitant benefit. Although the sentence accuracy is slightly less than that with thresholded ec, we will show in a later section that this model combines well with other modifications to generate further</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Using part of speech </SectionTitle> <Paragraph position="0"> Additional information from the parse can be useful in predicting comma location. In this section, we incorporate part of speech information into the model, generating model 10. We estimate the joint probability of each word xi and its part of speech Xi as follows: p(xi; Xi j xi 2; Xi 2; xi 1; Xi 1; ec) p(xi j xi 2xi 1ec)p(Xi j Xi 2Xi 1) The first term is computed as in model 8, the second backing off to bigram and unigram models. Adding a part of speech model in this way provides a further improvement in performance. F-measure improves to 74.8%, sentence accuracy to 57.9%, a 28% improvement over the baseline. null These models (8 and 10), like model 3, assumed availability of the Treebank parse and part of speech tags. Using the Collins-parse-generated parses still shows improvement over the corresponding model 5: an F-measure of 70.1% and sentence accuracy of 54.9%, twice the improvement over the baseline as exhibited by model</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. 5 Related Work </SectionTitle> <Paragraph position="0"> We compare our comma restoration methods to those of Beeferman et al. (1998), as their results use only textual information to predict punctuation. Several researchers have shown prosodic information to be useful in predicting punctuation (Christensen et al., 2001; Kim and Woodland, 2001) (along with related phenomena such as disfluencies and overlapping speech (Shriberg et al., 2001)).</Paragraph> <Paragraph position="1"> These studies, typically based on augmenting a Markovian language model with duration or other prosodic cues as conditioning features, show that prosody information is orthogonal to language model information; combined models outperform models based on each type of information separately. We would expect therefore, that our techniques would similarly benefit from the addition of prosodic information.</Paragraph> <Paragraph position="2"> In the introduction, we mentioned the problem of sentence boundary detection, which is related to the punctuation reconstruction problem especially with regard to predicting sentence boundary punctuation such as periods, question marks, and exclamation marks. (This problem is distinct from the problem of sentence boundary disambiguation, where punctuation is provided, but the categorization of the punctuation as to whether or not 9An alternative method of resolving the data sparsity issues is to back off the model p(yi j yi 2yi 1eci), for instance to p(yi j yi 2yi 1) or to p(yi j yi 1eci). Both of these perform less well than the approximation in model 8.</Paragraph> <Paragraph position="3"> it marks a sentence boundary is at issue (Palmer and Hearst, 1994; Reynar and Ratnaparkhi, 1997).) Stolcke and Shriberg (1996) used HMMs for the related problem of linguistic segmentation of text, where the segments corresponded to sentences and other self-contained units such as disfluencies and interjections. They argue that a linguistic segmentation is useful for improving the performance and utility of language models and speech recognizers. Like the present work, they segment clean text rather than automatically transcribed speech. Stevenson and Gaizauskas (Stevenson and Gaizauskas, 2000) and Goto and Renals (Gotoh and Renals, 2000) address the sentence boundary detection problem directly, again using lexical and, in the latter, prosodic cues.</Paragraph> </Section> class="xml-element"></Paper>