File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-1041_abstr.xml

Size: 19,127 bytes

Last Modified: 2025-10-06 13:41:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1041">
  <Title>Argument Structure. Proceedings of the ARPA</Title>
  <Section position="2" start_page="0" end_page="282" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The procedure of reconstruction of the underlying structure of sentences (in the process of tagging a very large corpus of Czech) is described, with a special attention paid to the conditions under which the reconstruction of ellipted nodes is carried out.</Paragraph>
    <Paragraph position="1"> 1. The tagging scenarios with different (degrees and types of) theoretical backgrounds have undergone a rather rapid development flom morphologically based part-of-speech (POS) tagging through treebanks capturing the surface syntactic structures of sentences to semantically oriented tagging models, taking into account the underlying structure of sentences and/or certain issues of the 'inner' semantics of lexical units and their collocations.</Paragraph>
    <Paragraph position="2"> One of the critical aspects of the tagging scenario capturing the underlying structure of the sentences is the 'depth' of the resulting tree structures; in other words, how far these structures differ from the surface structures. If we take for granted (as is the case in most of the syntactic treebanks) that every word of the (surface) sentence should have a node of its own in the surface tree structure, then this issue can in part be reformulated in terms of two subquestions: (i) which surface nodes are superfluous and should be 'pruned away', (ii) which nodes should be assumed to be deleted in the surface and should be 'restored' in the underlying structure (e.g. in forms of different kinds of dummy symbols, see Fillmore 1999).</Paragraph>
    <Paragraph position="3"> In our paper, we are concerned with the point (ii).</Paragraph>
    <Paragraph position="4"> 2. In the TG and post-TG writings, it is common to distinguish between two types of deletions: (a) ellipsis proper and (b) gapping. For both of them, it is crucial that the elliptical construction and its antecedent should be parallel and 'identical' at least in some features. The two types of ellipsis can be illustrated by examples (1) and (2), respectively.</Paragraph>
    <Paragraph position="5">  (1) Psal jenom flkoly, kterd chtal.</Paragraph>
    <Paragraph position="6"> lit. 'He-wrote only homework's which hewanted' null (2) Honza dal Marii rfiki a Petr Ida tulipfin.  lit. 'John gave Mary rose and Peter Ida tulip' For both types, a reconstruction in some way or another is necessary, if the tree structure is to capture the underlying structure of the sentences.</Paragraph>
    <Paragraph position="7">  3. The examples quoted in the previous section cover what Quirk et al. (1973, pp. 536-620) call 'ellipsis in the strict sense'; they view ellipsis as a purely surface  phenomenon: the recoverability of the ellipted words is always unique and 'fits' into the surface structure. They difl'erentiate ellipsis fiom 'semantic implication' which would cover e.g. such cases as (3) and (4):  (3) John wants to read.</Paragraph>
    <Paragraph position="8"> (4) Thanks.</Paragraph>
    <Paragraph position="9">  If (3) is 'reconstructed' as 'John wants John to read', then the two occurrences of 'John' are referentially different, which is not true about the interpretation of (3). With (4), it cannot be uniquely determined whether the full corresponding structure should be '1 owe you thanks' or 'I give you thanks' etc. 4. For tagging a corpus on the underlying level, it is clear that we cannot limit ourselves to the cases of ellipsis in tile strict sense but we have to broaden lhe notion of 'reconstruction' to cover both (i) deletions licensed by the grammatical i)roperties of sentence elements or sentence structure, and (it) deletions licensed only by the preceding context (be it co-text or context of situation). 4.1. In our analysis of a sample of Czech National Corpus, two situations may occur within the group (i): (a) Only the position itself that should be &amp;quot;filled&amp;quot; in the sentence structure is predetemfined (i.e. a sentence element is subcategorized for this position), but its lexical setting is Tree'.</Paragraph>
    <Paragraph position="10"> This is e.g. the case of the so-called pro-drop character of Czech, where the position of the subject of a verb is 'given', but it may be filled in dependence on the context.</Paragraph>
    <Paragraph position="11"> (5) Pi:edseda vlfidy i:ekl, ~e pf'edlo~i nfivrh na zmenu volebniho systdmu.</Paragraph>
    <Paragraph position="12"> 'The Prime-minister said that (0) will submit a proposal on tile change of the electoral system.' The 'dropped' subject of the verb pi:edlo2i 'will submit' may refer to the Primeminister, to the Govermnent, or to somebody else identifiable on the basis of the context.</Paragraph>
    <Paragraph position="13"> Here also belong cases of the semantically obligatory but deletable complementations of verbs: the Czech verb l)i:(/et 'to arrive' has as its obligatory complementation an Actor and a Directional &amp;quot;where-to&amp;quot; (the obligatoriness of the Directional complementation can be tested by a question test, see Panevovfi 1974; Sgall et al. 1986), which can be deleted on the surface; its reference is determined by the context.</Paragraph>
    <Paragraph position="14"> (6) Vlak pi~ijede v poledne.</Paragraph>
    <Paragraph position="15"> 'Tile train will arrive at noon.' The utterer of (6) deletes the Direction 'where-to' because s/he assumes that the hearer knows the referent.</Paragraph>
    <Paragraph position="16"> (b) Both the position and its 'filler' are predetermined.</Paragraph>
    <Paragraph position="17"> This is the case of e.g. the subject of the infinitival complement of the so-called verbs of control as in (7).</Paragraph>
    <Paragraph position="18">  (7) Pi&amp;quot;edseda vVldy slibil pi~edlo~it nfivrh na zm6nu volebniho systdmu.</Paragraph>
    <Paragraph position="19"> 'The Prime-minister promised to submit a  proposal on the change of the electoral system.' The identification of the underlying subject of the infinitive is 'controlled' by the Actor of the main verb, in our example it is 'the Prime-minister'.</Paragraph>
    <Paragraph position="20"> Another example of this class of deletions are the so-called General Participants (close to the English one or German man): General Actor in (8), General Patient in (9), or  General Addressee in (10).</Paragraph>
    <Paragraph position="21"> (8) Ta kniha byla u~ vydfina dvakrfit.</Paragraph>
    <Paragraph position="22"> 'The book has already been published twice.' (9) V nedeli obvykle pe~,u.</Paragraph>
    <Paragraph position="23"> 'On Sundays (I) usually bake.' (10) D6dei?ek 6asto vypravuie pohfidky. 'Grandfather often tells fairy-tales.'</Paragraph>
    <Section position="1" start_page="279" end_page="279" type="sub_section">
      <SectionTitle>
4.2 Within the group (ii), there belong cases
</SectionTitle>
      <Paragraph position="0"> of the so-called 'occasional ellipsis' conditioned by the context alone.</Paragraph>
      <Paragraph position="1"> We are aware that not everything in any position that is identifiable on the basis of the context can be deleted in Czech (as might be in an extreme way concluded from examples (11) through (14)). However, the conditions restricting the possibility of ellipsis in Czech seem to be less strict than  e.g. in English, as illustrated by (15): (11) Milujeme a ctime svdho u~itele.</Paragraph>
      <Paragraph position="2"> 'We love and honour our teacher.' (12) Marii j sem vid~l a slygel zpivat.</Paragraph>
      <Paragraph position="3"> lit. 'Mary-Acc. Aux-be saw and heard tosing' null '! saw and heard Mary singing.' (13) Jirka se v~era v hospod6 opil do n6moty a Honza dneska.</Paragraph>
      <Paragraph position="4"> lit. 'Jirka himself yesterday in pub drunk to death and Honza today.' 'In the pub, Jirka drunk himself to death yesterday and Honza today.' (14) Petr f-ikal Pavlovi, aby ~el ven, a Martin, aby zflstal doma.</Paragraph>
      <Paragraph position="5"> 'Peter told Pavel to go outside and Martin (told Pavel) to stay at home.' (15) (Potkaljsi veera Toma?) Potkal.</Paragraph>
      <Paragraph position="6"> '(Did you meet Tom yesterday?) Met'.</Paragraph>
      <Paragraph position="7"> 4.3 in addition to setting principles of which nodes need to be restored it is also important to say in which cases no restoration is desirable. Nodes are not restored in cases of: (a) accidental omission (due to emotion, excitement or insufficient command of language, see e.g. Hlavsa 1990); (b) unfinished sentences, which usually lack focus (unlike ellipsis where the 'missing' elements belong to topic); (c) sentences without a finite verb that can  be captured by a structure with a noun in its root (in these cases there are no empty positions, nothing can be really added). All these cases have no clear-cut boundaries, rather it is more appropriate to expect continual transitions.</Paragraph>
    </Section>
    <Section position="2" start_page="279" end_page="280" type="sub_section">
      <SectionTitle>
5.1 The Prague Dependency Tree Bank
</SectionTitle>
      <Paragraph position="0"> (PDT in the sequel), which has been inspired by the build-up of the Penn  (the corpus currently comprises about 100 million tokens of word forms). PDT comprises three layers of annotations: (i) the morphemic layer with about 3000 morphemic tag values; a tag is assigned to each word form of a sentence in the corpus and the process of tagging is based on stochastic procedures described by Haii6 and Hladkfi (1997); (ii) analytic tree structures (ATSs) with every word form and punctuation mark explicitly represented as a node of a rooted tree, with no additional nodes added (except for the root of the tree of every sentence) and with the edges of the tree corresponding to (surface) dependency relations; (iii) tectogrammatical tree structures (TGTSs) corresponding to the underlying sentence representations, again dependency-based.</Paragraph>
      <Paragraph position="1"> At present the PDT contains 100000 sentences (i.e. ATSs) tagged on the first two layers. As for the third layer, the input for the tagging procedure are the ATSs; this procedure is in its starting phase and is divided into (i) automatic preprocessing (see BOhmov~ and Sgall 2000) and (ii) the manual phase. The restoration of the syntactic information absent in the surface (morphemic) shape of the sentence (i.e. for which there are no nodes on the analytic level) is mostly (but not exclusively) done - null at least for the time being - in the manual phase of the transduction procedure. In this phase, the tagging of the topic-focus articulation is also performed (see Burfifiovfi, Hajieovfi and Sgall 2000).</Paragraph>
    </Section>
    <Section position="3" start_page="280" end_page="282" type="sub_section">
      <SectionTitle>
5.2 The reconstruction of deletions in
</SectionTitle>
      <Paragraph position="0"> TGTSs is guided by the following general principles: (i) All 'restored' nodes standing for elements deleted in the surface structure of the sentence but present in its underlying structure get marked by one of the following values in the attribute DEL: ELID: the 'restored' element stauds alone; e.g. the linearized TGTS (disregarding other than structural relations) for (16) is (16'). (Note: Every dependent item is enclosed in a pair of parenthesis. The capitalized abbreviations stand for dependency relations and are self-explaining; in our examples we use English lexical units to make the  ELEX: if the antecedent is an expanded head node and not all the deleted nodes belong to the obligatory colnplementations of the given node and as such not all are reconstructed, cf. e.g. the simplified TGTS</Paragraph>
      <Paragraph position="2"> EXPN: if the given node itself was not ellipted but some of its complementations were and are not restored (see the principle (iii)(b) below), cf. e.g. the simplified TGTS in (15') for (15) above, with nonreconstructed telnporal modification:  immediately to the left of their governor. (iii) The following cases are prototypical examples of restorations (for an easier reference to the above discussion of the types of deletions, the primed numbers of tile TGTSs refer to the example sentences in Section 4): (a) Restoration of nodes for complementations for which the head nodes (governors) are subcategorized. The assignment of the lexical labels is governed by the following principles: in pro-drop cases (5') (comparable to Fillmore's 1999 CNI - constructionally-licensed null instantiation) and with an obligatory but deletable complementation (6') (cf.</Paragraph>
      <Paragraph position="3"> Filhnore's definite null instantiation, DNI) the lexieal value corresponds to the respective pronoun; with grammatical coreference (control), the lexical value is Cor (7'); in both these cases, the lexical wdue of the antecedent is put into a special attribute of Coreference; in cases of general participants (cf. Filllnore's indefinite null instantiation - INI) the lexical value is Gen</Paragraph>
      <Paragraph position="5"> (b) Elipted optional complelnentations are not restored (see (13') above) unless they are governors of adjuncts.</Paragraph>
      <Paragraph position="6"> (c) For coordinated structures, the guiding principle says: whenever possible, give  precedence to a &amp;quot;constituent&amp;quot; coordination before a &amp;quot;sentential&amp;quot; one (more generally: &amp;quot;be as economical as possible&amp;quot;), thus examples like (t7) are not treated as sentential coordination (i.e. they are not transformed into structures corresponding to (17')).</Paragraph>
      <Paragraph position="7"> (17) Karel pfinesl Jan6 kv6tiny a knihu. 'Katel brought Jane flowers and a book.' (17') Karel pfinesl Jan6 kv6tiny a Katel pfinesl Jan6 knihu.</Paragraph>
      <Paragraph position="8"> 'Karel brought Jane flowers and Karel brought Jane a book.' A special symbol CO is introduced in the complex labels for the coordinated nodes to mark which nodes stand in the coordination relation and which modify the coordination as a whole (see (11')); the lexical value of the restored elements is copied from the antecedents (see (13') above): (11') ((we.ACT) (love.CO) and (honour.CO) (our teacher.PAT)) The analysis of (11') is to be preferred to sentential coordination with deletion also for its correspondence with the fact, that in Czech object can stand after coordinated verbs only if the semantic relation between the verbs allows for a unifying interpretation, as shown by cases, where the object must be repeated with each verb (compare the contrast between (18) and (19)).</Paragraph>
      <Paragraph position="9"> (I 8) Potkal jsem Petra, ale nepoznal jsem ho.</Paragraph>
      <Paragraph position="10"> 'I met Peter, but I didn't recognize him.' (19) ??Potkal, ale nepoznal jsem Petra. 'I met but didn't recognize Peter.' However, there are cases where the coordination has to be taken as sentential or at least at a higher level. As modal verbs are represented as gramatemes of the main verb, sentences as (20) have to be analysed as in (20'):  (20) Petr musel i cht61 pfijit.</Paragraph>
      <Paragraph position="11"> 'Peter had to and wanted to come.' (20') (Peter.ACT) (had-to-come.CO) and (wanted-to-come.ELID. CO)  Another case of a less strict adherence to the economy principle are sentences with double reading. Such a treatnaent then allows for a distinction to be made between the two readings, e.g. in (21), namely  between (a) 'villagers who are (both) old and sick' and (b) 'villagers who are sick (but not necessarily old) and villagers who ate old (but not necessarily sick)':  contribution is work in progress: the principles are set, but precisions are achieved as the annotators progress. There are many issues left for further investigation; let us mention just one of them, as an illustration. Both in (22) and in (23), the scope of 'mfilokdo' (few) is (at least on the preferential readings) wide ('there are few people such that...'); however, (24) is ambiguous: (i) there were few people such that gave P. a book and M. flowers, (ii) few people gave P. a book and few people gave M. flowers (not necessarily the same people). A similar ambiguity is exhibited by (25): (i) there was no such (single) person that would give P. a book and M. flowers, (ii) P. did not get a book and M. did not get flowers. However, there is no such ambiguity in (26).</Paragraph>
      <Paragraph position="12"> (22) Mfilokdo jf jablka a nejf ban,~ny. lit. 'Few eat apples and do-not-eat bananas' 'Few people eat apples and do not eat bananas.'  (23) Mfilokdo dal Petrovi knihu a Marii kvetiny ne.</Paragraph>
      <Paragraph position="13"> lil. 'Few gave Peter book and Mary flowers not' 'l~ew people gave Peter a book and did not give Mary flowcrs.' (24) Mfilokdo dal Petrovi knihu a Marii kv6tiny.</Paragraph>
      <Paragraph position="14"> lit. 'Few gave Peter book and Mary flowers' 'Few people gave Peter a book and Mary flowers.' (25) Nikdo nedal Petrovi knihu a Marii kv6tiny.</Paragraph>
      <Paragraph position="15"> lit. 'Nobody did-not-give Peter book and Mary flowers' 'Nobody gave Peter a book and Mary flowers.' (26) Petrovi nikdo nodal knihu a Marii kv6tiny.</Paragraph>
      <Paragraph position="16"> lit. 'Peter nobody did-not-give book and Mary flowers' 'To Peter, nobody gave a book and to Mary, flowers.' An explanation of this behaviour offers itself in terms of the interplay el! contrast in polarity and of topic-focus articulation: an element standing at Ille beginning of the sentence with a contras\[ in polarity carries a wide scopc ('few' in (22) and (23)); with sentences without such a contrast both wide scope and narrow scope interpretations are possible ('few' and 'nobody' in (24) and (25), respectively); (25) differs from (26) in that in the latter sentence, the element in contrastive topic is 'Peter' in the first conjunct and 'Mary' in the second, rather than 'nobody', and there is no contrast in polarity involved.</Paragraph>
      <Paragraph position="17"> The tagging scheme sketched in the previous sections offers only a single TGTS for the ambiguous structures instead of two, which is an undesirable result. However, if lhe explanation offered above is confl'onted with a larger amount of data and confirmed, lhe difference between the two interpretations could be captured either by means of a combination of tags for the restored nodes and for the topic-focus articulation or by diflerent structures for coordination: while example (22) supports the economical treatment of coordinate structures (the ACT modifying the coordination as whole), examples (24) through (26) seem to suggest that there may be cases where the other approach (sentential coordination with ellipsis) is more appropriate to capture the differences in meaning.</Paragraph>
      <Paragraph position="18"> Acknowledgement. The research reported on in this paper has bccn predominantly carried out within a project suplgorted by the Czech Grant Agency 405-96-K214 and in part by the Czech Ministry o17 Education VS 96-151.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML