File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/05/w05-1529_abstr.xml

Size: 1,413 bytes

Last Modified: 2025-10-06 13:44:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1529">
  <Title>Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Robust Extraction of Subcategorization Data from Spoken Language</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Subcategorization data has been crucial for various NLP tasks. Current method for automatic SCF acquisition usually proceeds in two steps: first, generate all SCF cues from a corpus using a parser, and then filter out spurious SCF cues with statistical tests. Previous studies on SCF acquisition have worked mainly with written texts; spoken corpora have received little attention. Transcripts of spoken language pose two challenges absent in written texts: uncertainty about utterance segmentation and disfluency.</Paragraph>
    <Paragraph position="1"> Roland &amp; Jurafsky (1998) suggest that there are substantial subcategorization differences between spoken and written corpora. For example, spoken corpora tend to have fewer passive sentences but many more zero-anaphora structures than written corpora. In light of such subcategorization differences, we believe that an SCF set built from spoken language may, if of acceptable quality, be of particular value to NLP tasks involving syntactic analysis of spoken language.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML