File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-0905_abstr.xml
Size: 5,931 bytes
Last Modified: 2025-10-06 13:41:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0905"> <Title>Verb Subcategorization Frequency Differences between Business- News and Balanced Corpora: The Role of Verb Sense IDouglas Roland, ~&quot;Danid Jurafsky, &quot;3Lise Menn,'Susanne Gahl, IElizabeth Elder and IChris</Title> <Section position="1" start_page="0" end_page="28" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We explore the differences in verb subeategorization frequencies across several corpora in an effort to obtain stable cross corpus subcategonzation probabilities for use in norming psychological experiments.</Paragraph> <Paragraph position="1"> For the 64 single sense verbs we looked at, subeategorizatlon preferences were remarkably stable between British and American corpora, and between balanced corpora and financial news corpora. Of the verbs that did show differences, these differences were generally found between the balanced corpora and the financial news data. We show that all or nearly all of these shifts in subcategorization are realised via (often subtle) word sense differences.</Paragraph> <Paragraph position="2"> This is an interesting observation in itself, and also suggests that stable cross corpus subcategorization frequencies may be found when verb sense is adequately controlled.</Paragraph> <Paragraph position="3"> Introduction Verb subcategorizafion probabilities play an important role in both computational linguistic applications (e.g. Carroll, Minnen, and Briscoe 1998, Charniak 1997, Collins 1996/1997, Joshi and Srinivas 1994, Kim, Srinivas, and Tmeswell 1997, Stolcke et al. 1997) and psycholinguisfic models of language processing (e.g. Boland 1997, Clifton et al. 1984, Ferreira & McClure 1997, Fodor 1978, Garnsey et al. 1997, Jurafsky 1996, MacDonald 1994, Mitchell & Holmes 1985, Tanenhaus et al. 1990, Trueswell et al.</Paragraph> <Paragraph position="4"> 1993).</Paragraph> <Paragraph position="5"> Previous research, however, has shown that subcategorization probabilities vary widely in different corpora. Studies such as Merlo (1994), Gibson et al. (1996), and Roland & Jurafsky (1997) have found subcategorization frequency differences between traditional corpus data and data from psychological experiments. Biber (1993) and Biber et al. (1998) have shown that that word frequency, word sense (as defined by collocates), the distribution of synonymous words and the use of syntactic structures varies with corpus genre. Roland & Jurafsky (1998, 2000 in press) showed that there were subcategorization frequency differences between various written and spoken corpora, and furthermore showed that that these subcategorization frequency differences are caused by variation in word sense as well as genre and discourse type differences among the corpora.</Paragraph> <Paragraph position="6"> While the subcategorization probabilities in a computational language model can be adjusted to match a particular corpus, cross corpus differences in such probabilities pose an important problem when using corpora for norming psychological experiments. If each corpus generates a separate set of probabilities, which probabilities are the correct ones to use as a model of human language processing? In an attempt to use corpora to provide norming data for 64 verbs for experimental purposes, we investigate in detail how verb frequencies and verb subcategorization frequencies differ among three corpora: the British National Corpus (BNC), the Wall Street Journal corpus (WSJ), and the Brown Corpus (Brown). For the 64 verbs, we randomly selected a set of sentences from each corpus and hand-coded them for transitivity, passive versus active voice, and whether the selected usage was an instance of the most common sense of the verb.</Paragraph> <Paragraph position="7"> We then ask two questions: Do these verbs have the same subcategorizafion probabilities across corpora, and, when there are differences, what is the cause. If a set of factors causing the differences can be identified and controlled for, then a stable set of cross-corpus probabilities suitable for norming psychological experiments can be generated.</Paragraph> <Paragraph position="8"> While previous work has shown that differences between corpora do exist, and that word sense differences play a large role in realising these differences, much less is known about the effect of other factors on subcategorizafion variation across corpora. For example, are there gross subcategorization differences between British and American English? To what extent does the business-genre nature of the Wall Street Journal corpus affect subcategorization probabilities? Finally, while Roland and Jurafsky (2000 in press) suggested that sense differences played a major role in subcategorization biases, they were only able to test their hypothesis on a small number of verbs.</Paragraph> <Paragraph position="9"> Our eventual goal is an understanding of many levels of verb differences across corpora, including verb frequency, frequency of transitive versus intransitive uses, frequency of other subcategonzafion frames, and frequency of active versus passive use. This paper reports our preliminary results on the first two of these issues. Verb usage was surprisingly unaffected by differences between British and American English. Those differences that did occur seem mostly to be caused by differences in the distribution of verb senses across corpora. The business-genre nature of the Wall Street Journal corpus caused certain verbs to appear more often in particular senses that had a strong effect on its subcategorization frequencies. Even after controlfing for the broad sense of the verb, we found subcategorization differences caused by the &quot;micro-differences&quot; in sense, including quite specific arguments to the verb.</Paragraph> </Section> class="xml-element"></Paper>