File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2067_metho.xml
Size: 21,290 bytes
Last Modified: 2025-10-06 14:10:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-2067"> <Title>Parsing and Subcategorization Data</Title> <Section position="4" start_page="515" end_page="515" type="metho"> <SectionTitle> 2 Experiment Design </SectionTitle> <Paragraph position="0"> Three models will be investigated for parsing and extracting SCCs from the parser's output: 1. punc: leaving punctuation in both training and test data.</Paragraph> <Paragraph position="1"> 2. no-punc: removing punctuation from both training and test data.</Paragraph> <Paragraph position="2"> 3. punc-no-punc: removing punctuation from only the test data.</Paragraph> <Paragraph position="3"> Following the convention in the parsing community, for written language, we selected sections 02-21 of WSJ as training data and section 23 as test data (Collins, 1999). For spoken language, we designated section 2 and 3 of Switchboard as training data and files of sw4004 to sw4135 of section 4 as test data (Roark, 2001). Since we are also interested in extracting SCCs from the parser's output, 1We use punctuation to refer to sentence-internal punctuation unless otherwise specified.</Paragraph> <Paragraph position="4"> label clause type desired SCCs we eliminated from the two test corpora all sentences that do not contain verbs. Our experiments proceed in the following three steps: 1. Tag test data using the POS-tagger described in Ratnaparkhi (1996).</Paragraph> <Paragraph position="5"> 2. Parse the POS-tagged data using Bikel's parser.</Paragraph> <Paragraph position="6"> 3. Extract SCCs from the parser's output. The extractor we built first locates each verb in the parser's output and then identifies the syntactic categories of all its sisters and combines them into an SCC. However, there are cases where the extractor has more work to do. * Finite and Infinite Clauses: In the Penn Treebank, S and SBAR are used to label different types of clauses, obscuring too much detail about the internal structure of each clause. Our extractor is designed to identify the internal structure of different types of clause, as shown in Table 1.</Paragraph> <Paragraph position="7"> * Passive Structures: As noted above, Roland and Jurafsky (Roland and Juraf null sky, 1998) have noticed that written language tends to have a much higher percentage of passive structures than spoken language. Our extractor is also designed to identify passive structures from the parser's output.</Paragraph> </Section> <Section position="5" start_page="515" end_page="518" type="metho"> <SectionTitle> 3 Experiment Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="515" end_page="517" type="sub_section"> <SectionTitle> 3.1 Parsing and SCCs </SectionTitle> <Paragraph position="0"> We used EVALB measures Labeled Recall (LR) and Labeled Precision (LP) to compare the parsing performance of different models. To compare the accuracy of SCCs proposed from the parser's output, we calculated SCC Recall (SR) and SCC Precision (SP). SR and SP are defined as follows:</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> The results for parsing WSJ and Switchboard and extracting SCCs are summarized in Table 2.</Paragraph> <Paragraph position="5"> The LR/LP figures show the following trends: 1. Roark (2001) showed LR/LP of 86.4%/86.8% for punctuated written language, 83.4%/84.1% for unpunctuated written language. We achieve a higher accuracy in both punctuated and unpunctuated written language, and the decrease if punctuation is removed is less 2. For spoken language, Roark (2001) showed LR/LP of 85.2%/85.6% for punctuated spoken language, 84.0%/84.6% for unpunctuated spoken language. We achieve a lower accuracy in both punctuated and unpunctuated spoken language, and the decrease if punctuation is removed is less. The trends in (1) and (2) may be due to parser differences, or to the removal of sentences lacking verbs. 3. Unsurprisingly, if the test data is unpunctu- null ated, but the models have been trained on punctuated language, performance decreases sharply.</Paragraph> <Paragraph position="6"> In terms of the accuracy of extraction of SCCs, the results follow a similar pattern. However, the utility of punctuation turns out to be even smaller. Removing punctuation from both the training and test data results in a 0.8% drop in the accuracy of SCC extraction for written language and a 0.3% drop for spoken language.</Paragraph> <Paragraph position="7"> Figure 1 exhibits the relation between the accuracy of parsing and that of extracting SCCs. If we consider WSJ and Switchboard individually, there seems to exist a positive correlation between the accuracy of parsing and that of extracting SCCs. In other words, higher LR/LP indicates higher SR/SP. However, Figure 1 also shows that although the parser achieves a higher F-measure value for paring WSJ, it achieves a higher F-measure value for generating SCCs from Switchboard. null The fact that the parser achieves a higher accuracy of extracting SCCs from Switchboard than WSJ merits further discussion. Intuitively, it seems to be true that the shorter an SCC is, the more likely that the parser is to get it right. This intuition is confirmed by the data shown in Figure 2. Figure 2 plots the accuracy level of extracting SCCs by SCC's length. It is clear from Figure 2 that as SCCs get longer, the F-measure value drops progressively for both WSJ and Switchboard. Again, Roland and Jurafsky (1998) have suggested that one major subcategorization difference between written and spoken corpora is that spoken corpora have a much higher percentage of the zero-anaphora construction. We then examined the distribution of SCCs of different length in WSJ and Switchboard. Figure 3 shows that SCCs of length 02 account for a much higher percentage in Switchboard than WSJ, but it is always the other way around for SCCs of non-zero length. This observation led us to believe that the better performance that Bikel's parser achieves in extracting SCCs from Switchboard may be attributed to the following two factors: 1. Switchboard has a much higher percentage of SCCs of length 0.</Paragraph> <Paragraph position="8"> 2. The parser is very accurate in extracting shorter SCCs.</Paragraph> <Paragraph position="9"> 2Verbs have a length-0 SCC if they are intransitive and have no modifiers.</Paragraph> </Section> <Section position="2" start_page="517" end_page="518" type="sub_section"> <SectionTitle> 3.2 Extraction of Dependents </SectionTitle> <Paragraph position="0"> In order to estimate the effects of SCCs of length 0, we examined the parser's performance in retrieving dependents of verbs. Every constituent (whether an argument or adjunct) in an SCC generated by the parser is considered a dependent of that verb. SCCs of length 0 will be discounted because verbs that do not take any arguments or adjuncts have no dependents3. In addition, this way of evaluating the extraction of SCCs also matches the practice in some NLP tasks such as semantic role labeling (Xue and Palmer, 2004). For the task of semantic role labeling, the total number of dependents correctly retrieved from the parser's output affects the accuracy level of the task.</Paragraph> <Paragraph position="1"> To do this, we calculated the number of dependents shared by between each SCC proposed from the parser's output and its corresponding SCC pro3We are aware that subjects are typically also considered dependents, but we did not include subjects in our experiments</Paragraph> <Paragraph position="3"> posed from Penn Treebank. We based our calculation on a modified version of Minimum Edit Distance Algorithm. Our algorithm works by creating a shared-dependents matrix with one column for each constituent in the target sequence (SCCs proposed from Penn Treebank) and one row for each constituent in the source sequence (SCCs proposed from the parser's output). Each cell shared-dependent[i,j] contains the number of constituents shared between the first i constituents of the target sequence and the first j constituents of the source sequence. Each cell can then be computed as a simple function of the three possible paths through the matrix that arrive there. The algorithm is illustrated in Table 3.</Paragraph> <Paragraph position="4"> Table 4 shows an example of how the algorithm works with NP-S-that-PP-in-INF as the target sequence and NP-NP-PP-in-ADVP-INF as the source sequence. The algorithm returns 3 as the number of dependents shared by two SCCs.</Paragraph> <Paragraph position="5"> We compared the performance of Bikel's parser in retrieving dependents from written and spoken language over all three models using Dependency Recall (DR) and Dependency Precision (DP). These metrics are defined as follows:</Paragraph> <Paragraph position="7"> The results of Bikel's parser in retrieving dependents are summarized in Figure 4. Overall, the parser achieves a better performance for WSJ over all three models, just the opposite of what have been observed for SCC extraction. Interestingly, removing punctuation from both the training and test data actually slightly improves the F-measure.</Paragraph> <Paragraph position="8"> This holds true for both WSJ and Switchboard.</Paragraph> <Paragraph position="9"> This Dependency F-measure differs in detail from similar measures in Xue and Palmer (2004). For present purposes all that matters is the relative value for WSJ and Switchboard.</Paragraph> </Section> </Section> <Section position="6" start_page="518" end_page="519" type="metho"> <SectionTitle> 4 Extraction of SCFs from Spoken </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="518" end_page="518" type="sub_section"> <SectionTitle> Language </SectionTitle> <Paragraph position="0"> Our experiments indicate that the SCCs generated by the parser from spoken language are as accurate as those generated from written texts. Hence, we would expect that the current technology for extracting SCFs, initially designed for written texts, should work equally well for spoken language.</Paragraph> <Paragraph position="1"> We previously built a system for automatically extracting SCFs from spoken BNC, and reported accuracy comparable to previous systems that work with only written texts (Li and Brew, 2005). However, Korhonen (2002) has shown that a direct comparison of different systems is very difficult to interpret because of the variations in the number of targeted SCFs, test verbs, gold standards and in the size of the test data. For this reason, we apply our SCF acquisition system separately to a written and spoken corpus of similar size from BNC and compare the accuracy of acquired SCF sets.</Paragraph> </Section> <Section position="2" start_page="518" end_page="518" type="sub_section"> <SectionTitle> 4.1 Overview </SectionTitle> <Paragraph position="0"> As noted above, previous studies on automatic extraction of SCFs from corpora usually proceed in two steps and we adopt this approach.</Paragraph> <Paragraph position="1"> 1. Hypothesis Generation: Identify all SCCs from the corpus data.</Paragraph> <Paragraph position="2"> 2. Hypothesis Selection: Determine which SCC is a valid SCF for a particular verb.</Paragraph> </Section> <Section position="3" start_page="518" end_page="519" type="sub_section"> <SectionTitle> 4.2 SCF Extraction System </SectionTitle> <Paragraph position="0"> We briefly outline our SCF extraction system for automatically extracting SCFs from corpora, which was based on the design proposed in Briscoe and Carroll (1997).</Paragraph> <Paragraph position="1"> 1. A Statistical Parser: Bikel's parser is used to parse input sentences.</Paragraph> <Paragraph position="2"> 2. An SCF Extractor: An extractor is use to extract SCCs from the parser's output. 3. An English Lemmatizer: MORPHA (Minnen et al., 2000) is used to lemmatize each verb.</Paragraph> <Paragraph position="3"> 4. An SCF Evaluator: An evaluator is used to filter out false SCCs based on their likelihood. null An SCC generated by the parser and extractor may be a correct SCC, or it may contain an adjunct, or it may simply be wrong due to tagging or parsing errors. We therefore need an SCF evaluator capable of filtering out false cues. Our evaluator has two parts: the Binomial Hypothesis Test (Brent, 1993) and a back-off algorithm (Sarkar and Zeman, 2000).</Paragraph> <Paragraph position="4"> 1. The Binomial Hypothesis Test (BHT): Let p be the probability that an scfi occurs with verbj that is not supposed to take scfi. If a verb occurs n times and m of those times it co-occurs with scfi, then the scfi cues are false cues is estimated by the summation of the binomial distribution for m [?] k [?] n:</Paragraph> <Paragraph position="6"> If the value of P(m+,n,p) is less than or equal to a small threshold value, then the null hypothesis that verbj does not take scfi is extremely unlikely to be true. Hence, scfi is very likely to be a valid SCF for verbj. The number of verb tokens 115,524 109,678 number of verb types 5,234 4,789 verb types seen more than 10 times 1,102 998 number of acquired SCFs 2,688 1,984 average number of SCFs per verb 2.43 1.99 value of m and n can be directly computed from the extractor's output, but the value of p is not easy to obtain. Following Manning (1993), we empirically determined the value of p. It was between 0.005 to 0.4 depending on the likelihood of an SCC being a valid SCF.</Paragraph> </Section> </Section> <Section position="7" start_page="519" end_page="519" type="metho"> <SectionTitle> 2. Back-off Algorithm: Many SCCs generated </SectionTitle> <Paragraph position="0"> by the parser and extractor tend to contain some adjuncts. However, for many SCCs, one of its subsets is likely to be the correct SCF. Table 5 shows some SCCs generated by the extractor and the corresponding SCFs.</Paragraph> <Paragraph position="1"> The Back-off Algorithm always starts with the longest SCC for each verb. Assume that this SCC fails the BHT. The evaluator then eliminates the last constituent from the rejected cue, transfers its frequency to its successor and submits the successor to the BHT again. In this way, frequency can accumulate and more valid frames survive the BHT.</Paragraph> <Section position="1" start_page="519" end_page="519" type="sub_section"> <SectionTitle> 4.3 Results and Discussion </SectionTitle> <Paragraph position="0"> We evaluated our SCF extraction system on written and spoken BNC. We chose one million word written corpus (WC) and a comparable spoken corpus (SC) from BNC. Table 6 provides relevant information on the two corpora. We only keep the verbs that occur at least 10 times in our training data.</Paragraph> <Paragraph position="1"> To compare the performance of our system on WC and SC, we calculated the type precision, type gold standard COMLEX Manually Constructed recall and F-measure. Type precision is the percentage of SCF types that our system proposes which are correct according some gold standard and type recall is the percentage of correct SCF types proposed by our system that are listed in the gold standard. We used the 14 verbs 4 selected by Briscoe and Carroll (1997) and evaluated our results of these verbs against the SCF entries in two gold standards: COMLEX (Grishman et al., 1994) and a manually constructed SCF set from the training data. It makes sense to use a manually constructed SCF set while calculating type precision and recall because some of the SCFs in a syntax dictionary such as COMLEX might not occur in the training data at all. We constructed separate SCF sets for the written and spoken BNC.</Paragraph> <Paragraph position="2"> The results are summarized in Table 7. As shown in Table 7, the accuracy achieved for WC and SC are very comparable: Our system achieves a slightly better result for WC when using COM-LEX as the gold standard and for SC when using manually constructed SCF set as gold standard, suggesting that it is feasible to apply the current technology for automatically extracting SCFs to spoken language.</Paragraph> </Section> </Section> <Section position="8" start_page="519" end_page="520" type="metho"> <SectionTitle> 5 Conclusions and Future Work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="519" end_page="520" type="sub_section"> <SectionTitle> 5.1 Use of Parser's Output </SectionTitle> <Paragraph position="0"> In this paper, we have shown that it is not necessarily true that statistical parsers always perform worse when dealing with spoken language.</Paragraph> <Paragraph position="1"> The conventional accuracy metrics for parsing (LR/LP) should not be taken as the only metrics in determining the feasibility of applying statistical parsers to spoken language. It is necessary to consider what information we want to extract out of parsers' output and make use of.</Paragraph> <Paragraph position="2"> 1. Extraction of SCFs from Corpora: This task takes SCCs generated by the parser and extractor as input. Our experiments show that begin, believe, cause, expect, find, give, help, like, move, produce, provide, seem and sway. We replaced sway with show because sway occurs less than 10 times in our training data. the SCCs generated for spoken language are as accurate as those generated for written language. We have also shown that it is feasible to apply the current SCF extraction technology to spoken language.</Paragraph> </Section> </Section> <Section position="9" start_page="520" end_page="520" type="metho"> <SectionTitle> 2. Semantic Role Labeling: This task usually </SectionTitle> <Paragraph position="0"> operates on parsers' output and the number of dependents of each verb that are correctly retrieved by the parser clearly affects the accuracy of the task. Our experiments show that the parser achieves a much lower accuracy in retrieving dependents from the spoken language than written language. This seems to suggest that a lower accuracy is likely to be achieved for a semantic role labeling task performed on spoken language. We are not aware that this has yet been tried.</Paragraph> <Section position="1" start_page="520" end_page="520" type="sub_section"> <SectionTitle> 5.2 Punctuation and Speech Transcription Practice </SectionTitle> <Paragraph position="0"> Both our experiments and Roark's experiments show that parsing accuracy measured by LR/LP experiences a sharper decrease for WSJ than Switchboard after we removed punctuation from training and test data. In spoken language, commas are largely used to delimit disfluency elements. As noted in Engel et al. (2002), statistical parsers usually condition the probability of a constituent on the types of its neighboring constituents. The way that commas are used in speech transcription seems to have the effect of increasing the range of neighboring constituents, thus fragmenting the data and making it less reliable. On the other hand, in written texts, commas serve as more reliable cues for parsers to identify phrasal and clausal boundaries.</Paragraph> <Paragraph position="1"> In addition, our experiment demonstrates that punctuation does not help much with extraction of SCCs from spoken language. Removing punctuation from both the training and test data results in rougly a 0.3% decrease in SR/SP. Furthermore, removing punctuation from both training and test data actually slightly improves the performance of Bikel's parser in retrieving dependents from spoken language. All these results seem to suggest that adding punctuation in speech transcription is of little help to statistical parsers including at least three state-of-the-art statistical parsers (Collins, 1999; Charniak, 2000; Bikel, 2004). As a result, there may be other good reasons why someone who wants to build a Switchboard-like corpus should choose to provide punctuation, but there is no need to do so simply in order to help parsers.</Paragraph> <Paragraph position="2"> However, segmenting utterances into individual units is necessary because statistical parsers require sentence boundaries to be clearly delimited.</Paragraph> <Paragraph position="3"> Current statistical parsers are unable to handle an input string consisting of two sentences. For example, when presented with an input string as in (1) and (2), if the two sentences are separated by a period (1), Bikel's parser wrongly treats the second sentence as a sentential complement of the main verb like in the first sentence. As a result, the extractor generates an SCC NP-S for like, which is incorrect. The parser returns the same parse after we removed the period (2) and let the parser parse it again.</Paragraph> <Paragraph position="4"> (1) I like the long hair. It was back in high school.</Paragraph> <Paragraph position="5"> (2) I like the long hair It was back in high school.</Paragraph> <Paragraph position="6"> Hence, while adding punctuation in transcribing a Switchboard-like corpus is not of much help to statistical parsers, segmenting utterances into individual units is crucial for statistical parsers. In future work, we plan to develop a system capable of automatically segmenting speech utterances into individual units.</Paragraph> </Section> </Section> class="xml-element"></Paper>