File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1911_metho.xml
Size: 11,695 bytes
Last Modified: 2025-10-06 14:09:18
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1911"> <Title>Word order variation in German main clauses: A corpus analysis</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Ordering Principles </SectionTitle> <Paragraph position="0"> In this section, we describe the three ordering principles tested in this study. Whereas the first principle attends to the contextual dependencies of a sentence, the scope of the second and the third principle is a sentence.</Paragraph> <Paragraph position="1"> In scrambling languages, the position of verb complements can reflect their connection to the preceding context. In these languages, discourse new information tends to occur towards the end of a sentence, whereas discourse old information is more likely to occur at the beginning (cf. Birner and Ward (1998)). Thus, the information structure of a sentence in a scrambling language such as German, can reflect its fit within a given discourse (Selkirk (1984); Steedman (1991))1: When an object precedes a subject, the object is likely to be given and the 1Different terms and concepts such as theme/rheme, background/focus and given/new are used in the litsubject new; when a subject precedes an object, the subject is likely to be given and the object new, although the canonical subject-first order is also expected when both complements are either new or given. We establish for Negra how often both SVO and OVS main clauses adhere to this basic order pattern.</Paragraph> <Paragraph position="2"> German is also a language with definite and indefinite articles. According to a second linear ordering principle, definite NPs should tend to precede indefinite NPs. We also presume, that definiteness is correlated to the information status of complements. As already Chafe (1976) pointed out, indefiniteness often goes together with newness, and definiteness with givenness or newness. Thus, on the NP itself information status can be partially encoded by the choice of article. If the correlation with givenness drives the positioning of definite and indefinite complements, we should find for both SVO and OVS sentences that definite complements tend to precede indefinite ones. Another possibility is, however, that definiteness is bound to grammatical functions (i.e. subjects are usually definite). In that case, we would expect to find a reversal of the ordering principle for definiteness in OVS sentences.</Paragraph> <Paragraph position="3"> A third common linear ordering principle states that pronouns tend to precede full NPs.</Paragraph> <Paragraph position="4"> Similarly to definiteness, the use of pronouns can potentially encode information structure.</Paragraph> <Paragraph position="5"> Almost by definition, pronouns refer to an antecedent in the discourse and represent therefore given information (with the possible exception of indefinite pronouns). Whereas pronoun complements usually represent given information, full NP complements are not necessarily new. Again, if the correlation with givenness drives the positioning of pronominalized complements, we should find that pronouns tend to precede full NPs in both SVO and OVS sentences. If, however, for example grammatical functions determine which complements are pronominalized, we might not find this tendency. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpus Analysis </SectionTitle> <Paragraph position="0"> The Negra corpus (Skut et al., 1998) is an annotated collection of 20,602 sentences (355,096 erature to express information structure (for a recent overview see Kruiff-Korbayov'a and Steedman (2003)).</Paragraph> <Paragraph position="1"> Since annotations in the present study are based on single referents rather than parts of sentences, we distinguish between given and new.</Paragraph> <Paragraph position="2"> tokens) extracted from the German newspaper Frankfurter Rundschau. The syntactic structure of sentences is represented in dependency trees for which the nodes describe constituents and the edges between the nodes are labeled with grammatical functions expressing syntactic relations. For our study, we choose the 'Penn format', a transformation of the original Negra treebank, in which crossing edges and traces are omitted.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Data extraction </SectionTitle> <Paragraph position="0"> Using a Perl program and the tree-search program Tgrep2 (Rhode, 2002) all OVS sentences in Negra are extracted by looking for objectverb-subject sequences with the same depth of embedding. Object and subject themselves as well as the sentences in which the OVS structure occur could be complex (see 3).</Paragraph> <Paragraph position="1"> (3) OVS.</Paragraph> <Paragraph position="2"> Den Satz von der Vergangenheit, die noch nicht ein null mal vergangen sei, zitiert auch Peter R&quot;uhl in seinem wie stets gescheiten Begleittext zur j&quot;ungsten CD des Trompeters Frank Koglmann, einem der wichtigsten Musiker des europ&quot;aischen Jazz.</Paragraph> <Paragraph position="3"> The sentence (ACC) about the past that has not even passed yet, cites also Peter R&quot;uhl (NOM) in his as always smart accompanying text to the latest CD of the trumpet player Frank Koglmann, one of the most important musicians in European jazz.</Paragraph> <Paragraph position="4"> Clausal objects with verbal constructions in addition to direct and indirect questions are manually omitted from the list after extraction. A total of 625 OVS sentences are kept for analysis (3% of all sentences in Negra). Next, comparable 625 (out of 2773) SVO sentences are chosen.2 Since the total number of SVO sentences in Negra exceeds that of OVS sentences notably, for practical reasons a subset of 625 SVO sentences is selected. Since the selection is random, we assume that findings within the subset are generally valid for SVO sentences. In addition, for each selected OVS and SVO sentence, the two immediately preceding sentences in Negra are extracted and serve as context to determine the information status (given or new) of complements.</Paragraph> <Paragraph position="5"> 2We do not allow SVO sentences in which the object is a reflexive pronoun (e.g. &quot;er f&quot;urchtet sich&quot;, he is afraid), since reflexive pronouns are most unlikely to be fronted. In fact, none of our OVS sentences contained a fronted reflexive pronoun. Furthermore, reflexive pronouns express coreference within a clause, whereas we are interested in references across sentence boundaries.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Data Coding </SectionTitle> <Paragraph position="0"> For each extracted OVS and SVO sentence, the authors annotate information status, definiteness, and pronominalization of its verb complements. For complex subjects and objects, annotations are based on the semantic head of the complement (i.e. noun or pronoun if the semantic head coincides with the syntactic head).</Paragraph> <Paragraph position="1"> Definiteness of NPs is assigned by the determiner and information status of NPs by the noun. Whereas it seems obvious to base annotations on the head of complex complements, the decision is less clear in the case of more than one equivalent NP within a complement (i.e. coordinated NPs). Rather than annotating all NPs, for comparability reasons, we base annotations on the first NP only, whenever a complement consists of a listing of more than one NP. Furthermore, some SVO sentences contain both direct and indirect objects, though none of our OVS sentences do. For these SVO sentences, only the direct object are considered for the annotations. An exception are SVO sentences with reflexive pronouns as indirect object, for which annotations are based on the direct object (1.6% of all SVO sentences).</Paragraph> <Paragraph position="2"> Givenness. Two preceding context sentences are used to determine whether verb complements present new or given information.3 We code complements as given if they present accessible information (Lambrecht, 1994). Accessability can either be textually or inferentially provided. Textual accessability requires an explicit coreferential antecedent (i.e. the occurrence of the same lemma in the context). Inferentially accessible complements do not require an explicit antecedent. Such inferables (Prince, 1981) are assumed to be activated via bridging inferences (Clark, 1977) that logical relations such as synonymy, hyponymy, and meronomy can provide. In such cases, shared general knowledge of the relations between objects and their components or attributes is assumed.</Paragraph> <Paragraph position="3"> Whenever more specific knowledge is required to establish such a relation, however, complements are considered to be new. The distinction between general and specific knowledge is 3A recent study by Morton (2000) showed for singular pronouns, that 98.7% of the times the antecedents were available within two preceding sentences. His findings are similar with those reported by Hobbs (1976). Since the context in our study regularly introduces complements, we believe that it gives an adequate picture of the interaction between information status and word order.</Paragraph> <Paragraph position="4"> particularly hard to maintain, since the distinction is often clearly not binary. For instance, geographic familiarity with the catchment area of the Frankfurter Rundschau is considered specific knowledge: In (4), &quot;Waldstadion&quot; is one of the local soccer stadiums in Frankfurt. Even though many local readers of the Frankfurter Rundschau will know this it can not be assumed to be known by all readers of the newspaper, and is therefore coded as new information.</Paragraph> <Paragraph position="5"> (4) Frankfurt - Waldstadion Moreover, when two entities X and Y of a potentially larger group Z are considered equally specific, Y is coded as new information after X is mentioned in the context (see (5)): Here, &quot;Klassik&quot; and &quot;Jazz&quot; are two examples of music styles.</Paragraph> <Paragraph position="6"> (5) Klassik - Jazz classical music - jazz A special case form constructions with &quot;es&quot;. They are almost exclusively used impersonally in sentences such as &quot;Karten gibt es&quot;, tickets are available. We then annotate &quot;es&quot; as new information.</Paragraph> <Paragraph position="7"> Definiteness. For all complements, we annotate whether they are definite or indefinite. We largely follow the classification suggested by Prince (1992). Markers of definite complements are definite articles, demonstrative articles, possessive articles, personal pronouns, and unmodified proper names. Markers of indefinite complements are indefinite articles, zero articles, quantifiers, and numerals. Note, that all quantifiers are marked as indefinite even though certain quantifiers like all and every have been suggested in the literature to mark definite descriptions. Furthermore, certain syntactically indefinite DPs have been argued to be semantically definite and syntactically definite DPs to be semantically indefinite. In our study, however, only formal syntactic properties are critical for the assignation of definiteness.</Paragraph> <Paragraph position="8"> Pronominalization. For the annotation of pronominalization, we check whether complements are realized as pronouns or full NPs.</Paragraph> </Section> </Section> class="xml-element"></Paper>