File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/91/p91-1027_concl.xml
Size: 3,670 bytes
Last Modified: 2025-10-06 13:56:44
<?xml version="1.0" standalone="yes"?> <Paper uid="P91-1027"> <Title>AUTOMATIC ACQUISITION OF SUBCATEGORIZATION FRAMES FROM UNTAGGED TEXT</Title> <Section position="9" start_page="212" end_page="212" type="concl"> <SectionTitle> 4 CONCLUSIONS </SectionTitle> <Paragraph position="0"> The ultimate goal of this work is to provide the NLP community with a substantially complete, automatically updated dictionary of subcategorization frames. The methods described above solve several important problems that had stood in the way of that goal. Moreover, the results obtained with those methods are quite encouraging.</Paragraph> <Paragraph position="1"> Nonetheless, two obvious barriers still stand on the path to a fully automated SF dictionary: a decision algorithm that can handle random error, and techniques for detecting many more types of SFs.</Paragraph> <Paragraph position="2"> Algorithms are currently being developed to resolve raw SF observations into genuine lexical properties and random error. The idea is to automatically generate statistical models of the sources of error. For example, purpose adjuncts like &quot;John quit to pursue a career in finance&quot; are quite rare, accounting for only two percent of the apparent infinitival complements. Furthermore, they are distributed across a much larger set of matrix verbs than the true infinitival complements, so any given verb should occur with a purpose adjunct extremely rarely. In a histogram sorting verbs by their apparent frequency of occurrence with infinitival complements, those that in fact have appeared with purpose adjuncts and not true sub-categorized infinitives will be clustered at the low frequencies. The distributions of such clusters can be modeled automatically and the models used for identifying false positives.</Paragraph> <Paragraph position="3"> The second requirement for automatically generating a full-scale dictionary is the ability to detect many more types of SFs. SFs involving certain prepositional phrases are particularly chal: lenging. For example, while purpose adjuncts (mistaken for infinitival complements) are relatively rare, instrumental adjuncts as in &quot;John hit the nail with a hammer&quot; are more common. The problem, of course, is how to distinguish them from genuine, subcategorized PPs headed by with, as in &quot;John sprayed the lawn with distilled water&quot;. The hope is that a frequency analysis like the one planned for purpose adjuncts will work here as well, but how successful it will be, and if successful how large a sample size it will require, remain to be seen.</Paragraph> <Paragraph position="4"> The question of sample size leads back to an evaluation of the initial priorities, which favored simplicity, speed, and accuracy, over efficient use of the corpus. There are various ways in which the high-priority criteria can be traded off against efficiency. For example, consider (2c): one might expect that the overwhelming majority of occurrences of &quot;is V-ing&quot; are genuine progressives, while a tiny minority are cases copula. One might also expect that the occasional copula constructions are not concentrated around any one present participle but rather distributed randomly among a large population. If those expectations are true then a frequency-modeling mechanism like the one being developed for adjuncts ought to prevent the mistaken copula from doing any harm. In that case it might be worthwhile to admit &quot;is V-ing', where V is known to be a (possibly ambiguous) verb root, as a verb, independent of the Case Filter mechanism.</Paragraph> </Section> class="xml-element"></Paper>