File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1064_intro.xml
Size: 3,822 bytes
Last Modified: 2025-10-06 14:03:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1064"> <Title>Creating a CCGbank and a wide-coverage CCG lexicon for German</Title> <Section position="3" start_page="0" end_page="505" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> A number of wide-coverage TAG, CCG, LFG and HPSG grammars (Xia, 1999; Chen et al., 2005; Hockenmaier and Steedman, 2002a; O'Donovan et al., 2005; Miyao et al., 2004) have been extracted from the Penn Treebank (Marcus et al., 1993), and have enabled the creation of wide-coverage parsers for English which recover local and non-local dependencies that approximate the underlying predicate-argument structure (Hockenmaier and Steedman, 2002b; Clark and Curran, 2004; Miyao and Tsujii, 2005; Shen and Joshi, 2005). However, many corpora (B&quot;ohomv'aetal., 2003; Skut et al., 1997; Brants et al., 2002) use dependency graphs or other representations, and the extraction algorithms that have been developed for Penn Treebank style corpora may not be immediately applicable to this representation. As a consequence, research on statistical parsing with &quot;deep&quot; grammars has largely been confinedtoEnglish. Free-word order languages typically pose greater challenges for syntactic theories (Rambow, 1994), and the richer inflectional morphology of these languages creates additional problems both for the coverage of lexicalized formalisms such as CCG or TAG, and for the usefulness of dependency counts extracted from the training data.</Paragraph> <Paragraph position="1"> On the other hand, formalisms such as CCG and TAG are particularly suited to capture the crossing dependencies that arise in languages such as Dutch or German, and by choosing an appropriate linguistic representation, some of these problems may be mitigated.</Paragraph> <Paragraph position="2"> Here, we present an algorithm which translates the German Tiger corpus (Brants et al., 2002) into CCG derivations. Similar algorithms have been developed by Hockenmaier and Steedman (2002a) to create CCGbank, a corpus of CCG derivations (Hockenmaier and Steedman, 2005) from the Penn Treebank, by C,akici (2005) to extract a CCG lexicon from a Turkish dependency corpus, and by Moortgat and Moot(2002) toinduce atype-logical grammar for Dutch.</Paragraph> <Paragraph position="3"> The annotation scheme used in Tiger is an extension of that used in the earlier, and smaller, German Negra corpus (Skut et al., 1997). Tiger is better suited for the extraction of subcategorization information (and thus the translation into &quot;deep&quot; grammars of any kind), since it distinguishes between PP complements and modifiers, and includes &quot;secondary&quot; edges to indicate shared arguments in coordinate constructions. Tiger also includes morphology and lemma information.</Paragraph> <Paragraph position="4"> Negra is also provided with a &quot;Penn Treebank&quot;style representation, which uses flat phrase structure trees instead of the crossing dependency structures in the original corpus. This version has been used by Cahill et al. (2005) to extract a German LFG. However, Dubey and Keller (2003) have demonstrated that lexicalization does not help a Collins-style parser that is trained on this corpus, and Levy and Manning (2004) have shown that its context-free representation is a poor approximation to the underlying dependency structure. The resource presented here will enable future research to address the question whether &quot;deep&quot; grammars such as CCG, which capture the underlying dependencies directly, are better suited to parsing German than linguistically inadequate context-free approximations.</Paragraph> <Paragraph position="5"> 1. Standard main clause</Paragraph> </Section> class="xml-element"></Paper>