XML Viewer - p06-1064

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1064_metho.xml
Size: 17,359 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1064">
  <Title>Creating a CCGbank and a wide-coverage CCG lexicon for German</Title>
  <Section position="4" start_page="505" end_page="505" type="metho">
    <SectionTitle>
2 German syntax and morphology
</SectionTitle>
    <Paragraph position="0"> Morphology German verbs are inflected for person, number, tense and mood. German nouns and adjectives are inflected for number, case and gender, and noun compounding is very productive.</Paragraph>
    <Paragraph position="1"> Word order German has three different word orders that depend on the clause type. Main clauses (1) are verb-second. Imperatives and questions are verb-initial (2). If a modifier or one of the objects is moved to the front, the word order becomes verb-initial (2). Subordinate and relative clauses are verb-final (3):  (1) a. Peter gibt Maria das Buch.</Paragraph>
    <Paragraph position="2"> Peter gives Mary the book.</Paragraph>
    <Paragraph position="3"> b. ein Buch gibt Peter Maria.</Paragraph>
    <Paragraph position="4"> c. dann gibt Peter Maria das Buch.</Paragraph>
    <Paragraph position="5"> (2) a. Gibt Peter Maria das Buch? b. Gib Maria das Buch! (3) a. dass Peter Maria das Buch gibt.</Paragraph>
    <Paragraph position="6"> b. das Buch, das Peter Maria gibt.</Paragraph>
    <Paragraph position="7"> Local Scrambling In the so-called &amp;quot;Mittelfeld&amp;quot; all orders of arguments and adjuncts are potentially possible. In the following example, all 5! permutations are grammatical (Rambow, 1994): (4) dass [eine Firma] [meinem Onkel] [die M&amp;quot;obel] [vor drei Tagen] [ohne Voranmeldung] zugestellt hat. that [a company] [to my uncle] [the furniture] [three  days ago] [without notice] delivered has.</Paragraph>
    <Paragraph position="8"> Long-distance scrambling Objects of embedded verbs can also be extraposed unboundedly within the same sentence (Rambow, 1994): (5) dass [den Schrank] [niemand] [zu reparieren] versprochen hat.</Paragraph>
    <Paragraph position="9"> that [the wardrobe] [nobody] [to repair] promised has.</Paragraph>
  </Section>
  <Section position="5" start_page="505" end_page="506" type="metho">
    <SectionTitle>
3 A CCG for German
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="505" end_page="506" type="sub_section">
      <SectionTitle>
3.1 Combinatory Categorial Grammar
</SectionTitle>
      <Paragraph position="0"> CCG (Steedman (1996; 2000)) is a lexicalized grammar formalism with a completely transparent syntax-semantics interface. Since CCG is mildly context-sensitive, it can capture the crossing dependencies that arise in Dutch or German, yet is efficiently parseable.</Paragraph>
      <Paragraph position="1"> In categorial grammar, words are associated with syntactic categories, such as CBD2C6C8 or B4CBD2C6C8B5BPC6C8 for English intransitive and transitive verbs. Categories of the form CGBPCH or CGD2CH are functors, which take an argument CH to their left or right (depending on the the direction of the slash) and yield a result CG. Every syntactic category is paired with a semantic interpretation (usually a AL-term).</Paragraph>
      <Paragraph position="2"> Like all variants of categorial grammar, CCG uses function application to combine constituents, but it also uses a set of combinatory rules such as composition (BU) and type-raising (CC). Non-orderpreserving type-raising is used for topicalization:  the use of additional &amp;quot;type-changing&amp;quot; rules to deal with complex adjunct categories (e.g. B4C6C8D2C6C8B5 B5 CBCJD2CVCLD2C6C8 for ing-VPs that act as noun phrase modifiers). Here, we also use a small number of such rules to deal with similar adjunct cases.</Paragraph>
    </Section>
    <Section position="2" start_page="506" end_page="506" type="sub_section">
      <SectionTitle>
3.2 Capturing German word order
</SectionTitle>
      <Paragraph position="0"> We follow Steedman (2000) in assuming that the underlying word order in main clauses is always verb-initial, and that the sententce-initial subject is in fact topicalized. This enables us to capture different word orders with the same lexical category (Figure 1). We use the features CBCJDABDCL and CBCJDAD0CPD7D8CL to distinguish verbs in main and subordinate clauses.</Paragraph>
      <Paragraph position="1"> Main clauses have the feature CBCJCSCRD0CL, requiring either asentential modifier with category CBCJCSCRD0CLBPCBCJDABDCL, a topicalized subject (CBCJCSCRD0CLBPB4CBCJDABDCLBPC6C8CJD2D3D1CLB5), or a type-raised argument (CBCJCSCRD0CLBPB4CBCJDABDCLD2CGB5), where CG can be any argument category, such as a noun phrase, prepositional phrase, or a non-finite VP.</Paragraph>
      <Paragraph position="2"> Here is the CCG derivation for the subordinate clause (CBCJCTD1CQCL)example: dass Peter Maria das Buch gibt</Paragraph>
      <Paragraph position="4"> For simplicity's sake our extraction algorithm ignores the issues that arise through local scrambling, and assumes that there are different lexical category for each permutation.</Paragraph>
      <Paragraph position="5">  Type-raising and composition are also used to deal with wh-extraction and with long-distance scrambling (Figure 2).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="506" end_page="509" type="metho">
    <SectionTitle>
4 Translating Tiger graphs into CCG
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="506" end_page="506" type="sub_section">
      <SectionTitle>
4.1 The Tiger corpus
</SectionTitle>
      <Paragraph position="0"> The Tiger corpus (Brants et al., 2002) is a publicly available  corpus of ca. 50,000 sentences (almost 900,000 tokens) taken from the Frankfurter Rundschau newspaper. The annotation is based on a hybrid framework which contains features of phrase-structure and dependency grammar. Each sentence is represented as a graph whose nodes are labeled with syntactic categories (NP, VP, S, PP, etc.) and POS tags. Edges are directed and labeled with syntactic functions (e.g. head, subject, accusative object, conjunct, appositive). The edge labels are similar to the Penn Treebank function tags, but provide richer and more explicit information. Only 72.5% of the graphs have no crossing edges; the remaining 27.5% are marked as dis- null Variants of CCG, such as Set-CCG (Hoffman, 1995) and Multimodal-CCG (Baldridge, 2002), allow a more compact lexicon for free word order languages.</Paragraph>
      <Paragraph position="1">  http://www.ims.uni-stuttgart.de/projekte/TIGER continuous. 7.3% of the sentences have one or more &amp;quot;secondary&amp;quot; edges, which are used to indicate double dependencies that arise in coordinated structures which are difficult to bracket, such as right node raising, argument cluster coordination or gapping. There are no traces or null elements to indicate non-local dependencies or wh-movement.</Paragraph>
      <Paragraph position="2"> Figure 2 shows the Tiger graph for a PP whose NP argument is modified by a relative clause.</Paragraph>
      <Paragraph position="3"> There is no NP level inside PPs (and no noun level inside NPs). Punctuation marks are often attached at the so-called &amp;quot;virtual&amp;quot; root (VROOT) of the entire graph. The relative pronoun is a dative object (edge label DA) of the embedded infinitive, and is therefore attached at the VP level. The relative clause itself has the category S; the incoming edge is labeled RC (relative clause).</Paragraph>
    </Section>
    <Section position="2" start_page="506" end_page="509" type="sub_section">
      <SectionTitle>
4.2 The translation algorithm
</SectionTitle>
      <Paragraph position="0"> Our translation algorithm has the following steps:</Paragraph>
      <Paragraph position="2"> else fail; else fail; else fail; 1. Creating a planar tree: After an initial pre- null processing step which inserts punctuation that is attached to the &amp;quot;virtual&amp;quot; root (VROOT) of the graph in the appropriate locations, discontinuous graphs are transformed into planar trees. Starting at the lowest nonterminal nodes, this step turns the Tiger graph into a planar tree without crossing edges, where every node spans a contiguous substring. This is required as input to the actual translation step, since CCG derivations are planar binary trees. If the first to the CXth child of a node CG span a contiguous substring that ends in the CYth word, and the B4CXB7BDB5th child spans a sub-string starting at CZ BQ CYB7BD, we attempt to move the first CX children of CG to its parent C8 (if the head position of C8 is greater than CX). Punctuation marks and adjuncts are simply moved up the tree and treated as if they were originally attached to C8. This changes the syntactic scope of adjuncts, but typically only VP modifiers are affected which could also be attached at a higher VP or S node without a change in meaning. The main exception  are extraposed relative clauses, which CCG treats as sentential modifiers with an anaphoric dependency. Arguments that are moved up are marked as extracted, and an additional &amp;quot;extraction&amp;quot; edge (explained below) from the original head is introduced to capture the correct dependencies in the CCG derivation. Discontinuous dependencies between resumptive pronouns (&amp;quot;place holders&amp;quot;, PH) and their antecedents (&amp;quot;repeated elements&amp;quot;, RE) are also dissolved.</Paragraph>
      <Paragraph position="3"> 2. Additional preprocessing: In order to obtain the desired CCG analysis, a certain amount of pre-processing is required. We insert NPs into PPs, nouns into NPs  , and change sentences whose  first element is a complementizer (dass, ob,etc.) into an SBAR (a category which does not exist in the original Tiger annotation) with S argu- null The span of nouns is given by the NK edge label. ment. This is necessary to obtain the desired CCG derivations where complementizers and prepositions take a sentential or nominal argument to their right, whereas they appear at the same level as their arguments in the Tiger corpus. Further pre-processing is required to create the required structures for wh-extraction and certain coordination phenomena (see below).</Paragraph>
      <Paragraph position="4"> In figure 2, preprocessing of the original Tiger graph (top) yields the tree shown in the middle (edge labels are shown as Penn Treebank-style function tags).</Paragraph>
      <Paragraph position="5">  We will first present the basic translation algorithm before we explain how we obtain a derivation which captures the dependency between the relative pronoun and the embedded verb.</Paragraph>
      <Paragraph position="6">  We treat reflexive pronouns as modifiers.</Paragraph>
      <Paragraph position="7">  3. The basic translation step Our basic transla null tion algorithm is very similar to Hockenmaier and Steedman (2005). It requires a planar tree without crossing edges, where each node is marked as head, complement or adjunct. The latter information is represented in the Tiger edge labels, and only a small number of additional head rules is required. Each individual translation step operates on local trees, which are typically flat.</Paragraph>
      <Paragraph position="8">  ) from right to left to create a left-branching tree. Thealgorithm starts atthe root category and recursively traverses the tree.</Paragraph>
      <Paragraph position="9">  The CCG category of complements and of the root of the graph is determined from their Tiger label. VPs are CBCJBMCLD2C6C8, where the feature CJBMCL distinguishes bare infinitives, zu-infinitives, passives, and (active) past participles. With the exception of passives, these features can be determined from the POS tags alone.</Paragraph>
      <Paragraph position="10">  the English CCGbank, our grammar ignores number and person agreement.</Paragraph>
      <Paragraph position="11"> Special cases: Wh-extraction and extraposition In Tiger, wh-extraction is not explicitly marked. Relative clauses, wh-questions and free relatives are all annotated as S-nodes,and the wh-word is a normal argument of the verb. After turning the graph into a planar tree, we can identify these constructions by searching for a relative pronoun in the leftmost child of an S node (which may be marked as extraposed in the case of extraction from an embedded verb). As shown in figure 2, we turn this S into an SBAR (a category which does not exist in Tiger) with the first edge as complementizer and move the remaining chil- null Eventive (&amp;quot;werden&amp;quot;) passive is easily identified by context; however, we found that not all stative (&amp;quot;sein&amp;quot;) passives seem to be annotated as such.</Paragraph>
      <Paragraph position="12">  In some contexts, measure nouns (e.g. Mark, Kilometer) lack case annotation.</Paragraph>
      <Paragraph position="13"> dren under a new S node which becomes the second daughter of the SBAR. The relative pronoun is the head of this SBAR and takes the S-node as argument. Its category is CBCJDAD0CPD7D8CL, since all clauses with a complementizer are verb-final. In order to capture the long-range dependency, a &amp;quot;trace&amp;quot; is introduced, and percolated down the tree, much like in the algorithm of Hockenmaier and Steedman (2005), and similar to GPSG's slash-passing (Gazdar et al., 1985). These trace categories are appended to the category of the head node (and other arguments are type-raised as necessary). In our case, the trace is also associated with the verb whose argument it is. If the span of this verb is within the span of a complement, the trace is percolated down this complement. When the VP that is headed by this verb is reached, we assume a canonical order of arguments in order to &amp;quot;discharge&amp;quot; the trace.</Paragraph>
      <Paragraph position="14"> If a complement node is marked as extraposed, it is also percolated down the head tree until the constituent whose argument it is is found. When another complement is found whose span includes the span of the constituent whose argument the extraposed edge is, the extraposed category is percolated down this tree (we assume extraction out of adjuncts is impossible).</Paragraph>
      <Paragraph position="15">  In order to capture the topicalization analysis, main clause subjects also introduce a trace. Fronted complements or subjects, and the first adjunct in main clauses are analyzed as described in figure 1.</Paragraph>
      <Paragraph position="16"> Special case: coordination - secondary edges Tiger uses &amp;quot;secondary edges&amp;quot; to represent the dependencies that arise in coordinate constructions such as gapping, argument cluster coordination and right (or left) node raising (Figure 3). In right (left) node raising, the shared elements are arguments or adjuncts that appear on the right periphery of the last, (or left periphery of the first) conjunct. CCG uses type-raising and composition to combine the incomplete conjuncts into one constituent which combines with the shared element: liest immer und beantwortet gerne jeden Brief.</Paragraph>
      <Paragraph position="17"> always reads and gladly replies to every letter.  In our current implementation, each node cannot have more than one forward and one backward extraposed element andoneforwardandonebackward trace. Itmaybepreferable to use list structures instead, especially for extraposition.  Complex coordinations: a Tiger graph with secondary edges  In order to obtain this analysis, we lift such shared peripheral constituents inside the conjuncts of conjoined sentences CS (or verb phrases, CVP) to new S (VP) level that we insert in between the CS and its parent.</Paragraph>
      <Paragraph position="18"> In argument cluster coordination (Figure 3), the shared peripheral element (aussprachen)isthe head.</Paragraph>
      <Paragraph position="19">  In CCG, the remaining arguments and adjuncts combine via composition and typeraising into a functor category which takes the category of the head as argument (e.g. a ditransitive verb), and returns the same category that would result from a non-coordinated structure (e.g. a VP). The result category of the furthest element in each conjunct is equal to the category of the entire VP (or sentence), and all other elements are type-raised and composed with this to yield a category which takes as argument a verb with the required subcat frame and returns a verb phrase (sentence). Tiger assumes instead that there are two conjuncts (one of which is headless), and uses secondary edges  W&amp;quot;ahrend has scope over the entire coordinated structure. to indicate the dependencies between the head and the elements in the distant conjunct. Coordinated sentences and VPs (CS and CVP) that have this annotation are rebracketed to obtain the CCG constituent structure, and the conjuncts are marked as argument clusters. Since the edges in the argument cluster are labeled with their correct syntactic functions, we are able to mimic the derivation during category assignment.</Paragraph>
      <Paragraph position="20"> In sentential gapping, the main verb is shared and appears in the middle of the first conjunct: (6) Er trinkt Bier und sie Wein.</Paragraph>
      <Paragraph position="21"> He drinks beer and she wine.</Paragraph>
      <Paragraph position="22"> Asin the English CCGbank, we ignore this construction, which requires a non-combinatory &amp;quot;decomposition&amp;quot; rule (Steedman, 1990).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML