File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2708_metho.xml
Size: 19,200 bytes
Last Modified: 2025-10-06 14:09:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2708"> <Title>Prague Czech-English Dependency Treebank Any Hopes for a Common Annotation Scheme?</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 English to Czech Translation of Penn Treebank </SectionTitle> <Paragraph position="0"> When starting the PCEDT project, we chose the latter of two possible strategies: either the parallel annotation of already existing parallel texts, or the translation and annotation of an existing syntactically annotated corpus. The choice of the Penn Treebank as the source corpus was also pragmatically motivated: firstly it is a widely recognized linguistic resource, and secondly the translators were native speakers of Czech, capable of high quality translation into their native language.</Paragraph> <Paragraph position="1"> The translators were asked to translate each English sentence as a single Czech sentence and to avoid unnecessary stylistic changes of translated sentences. The translations are being revised on two levels, linguistic and factual. About half of the Penn Treebank has been translated so far (currently 21,628 sentences), the project aims at translating the whole Wall Street Journal part of the Penn For the purpose of quantitative evaluation methods, such as NIST or BLEU, for measuring performance of translation systems, we selected a test set of 515 sentences and had them retranslated from Czech into English by 4 different translator offices, two of them from the Czech Republic and two of them from the U.S.A.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Transformation of Penn Treebank </SectionTitle> <Paragraph position="0"> Phrase Trees into Dependency Structure The transformation algorithm from phrase-structure topology into dependency one, similar to transformations described by Xia and Palmer (2001), works as follows: Terminal nodes of the phrase are converted to nodes of the dependency tree.</Paragraph> <Paragraph position="1"> Dependencies between nodes are established recursively: The root node of the dependency tree transformed from the head constituent of a phrase becomes the governing node. The root nodes of the dependency trees transformed from the right and left siblings of the head constituent are attached as the left and right children (dependent nodes) of the governing node, respectively.</Paragraph> <Paragraph position="2"> Nodes representing traces are removed and their children are reattached to the parent of the trace.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Preprocessing of Penn Treebank </SectionTitle> <Paragraph position="0"> Several preprocessing steps preceded the transformation into both analytical and tectogrammatical representations. null Marking of Heads in English The concept of the head of a phrase is important during the tranformation described above. For marking head constituents in each phrase, we used Jason Eisner's scripts.</Paragraph> <Paragraph position="1"> Lemmatization of English Czech is an inflective language, rich in morphology, therefore lemmatization (assigning base forms) is indispensable in almost any linguistic application. Mostly for reasons of symmetry with Czech data and compatibility with the dependency annotation scheme, the English part was automatically lemmatized by the morpha tool (Minnen et al., 2001) using manually assigned POS tags of the Penn Treebank.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Unique Identification </SectionTitle> <Paragraph position="0"> For technical reasons, a unique identifier is assigned to each sentence and to each token of Penn Treebank.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 English Analytical Dependency Trees </SectionTitle> <Paragraph position="0"> This section describes the automatic process of converting Penn Treebank annotation into analytical representation. null struck Northern California, killing more than 50 people.&quot; The structural transformation works as described above. Because the handling of coordination in PDT is different from the Penn Treebank annotation style and the output of Jason Eisner's head assigning scripts, in the case of a phrase containing a coordinating conjunction (CC), we consider the rightmost CC as the head. The treatment of apposition is a more difficult task, since there is no explicit annotation of this phenomenon in the Penn Treebank; constituents of a noun phrase enclosed in commas or other delimiters (and not containing CC) are considered to be in apposition and the rightmost delimiter becomes the head.</Paragraph> <Paragraph position="1"> The information from both the phrase tree and the dependency tree is used for the assignment of analytical functions: Penn Treebank function tag to analytical function mapping: some function tags of a phrase tree correspond to analytic functions in an analytical tree and can be mapped to them: SBJ ! Sb, fDTV; LGS; BNF; TPC; CLRg ! Obj, fADV; DIR; EXT; LOC; MNR; PRP; TMP; PUTg ! Adv.</Paragraph> <Paragraph position="2"> Assignment of analytical functions using local context of a node: for assigning analytical functions to the remaining nodes, we use rules looking at the current node, its parent and grandparent, taking into account POS and the phrase marker of the constituent in the original phrase tree headed by the node. For example, the rule earthquake struck Northern California, killing more than 50 people.&quot; assigns the analytical function Atr to every determiner, the rule mPOS= MDjpPOS= VBjmAF= AuxV assigns the function tag AuxV to a modal verb headed by a verb, etc. The attribute mPOS representing the POS of a node is obligatory for every rule. The rules are examined primarily in the order of the longest prefix of the POS of the given node and secondarily in the order as they are listed in the rule file. The ordering of rules is important, since the first matching rule found assigns the analytical function and the search is finished.</Paragraph> <Paragraph position="3"> Specifics of the PDT and Penn Treebank annotation schemes, mainly the markup of coordinations, appositions, and prepositional phrases are handled separately: Coordinations and appositions: the analytical function that was originally assigned to the head of a coordination or apposition is propagated to its child nodes by attaching the suffix Co or Ap to them, and the head node gets the analytical function Coord or Apos, respectively.</Paragraph> <Paragraph position="4"> Prepositional phrases: the analytical function originally assigned to the preposition node is propagated to its child and the preposition node is labeled AuxP. Sentences in the PDT annotation style always contain a root node labeled AuxS, which, as the only one in the dependency tree, does not correspond to any terminal of the phrase tree; the root node is inserted neVz 50 lid'i.&quot; above the original root. While in the Penn Treebank the final punctuation is a constituent of the sentence phrase, in the analytical tree it is moved under the technical sentence root node.</Paragraph> <Paragraph position="5"> Compare the phrase structure and the analytical representation of a sample sentence from the Penn Treebank in Figures 1 and 2.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 English Tectogrammatical Dependency Trees </SectionTitle> <Paragraph position="0"> The transformation of Penn Treebank phrase trees into tectogrammatical representation consists of a structural transformation, and an assignment of a tectogrammatical functor and a set of grammatemes to each node. At the beginning of the structural transformation, the initial dependency tree is created by a general transformation procedure as described above. However, functional (synsemantic) words, such as prepositions, punctuation marks, determiners, subordinating conjunctions, certain particles, auxiliary and modal verbs are handled differently. They are marked as &quot;hidden&quot; and information about them is stored in special attributes of their governing nodes (if they were to head a phrase, the head of the other constituent became the governing node in the dependency tree).</Paragraph> <Paragraph position="1"> The well-formedness of a tectogrammatical tree structure requires the valency frames to be complete: apart from nodes that are realized on surface, there are several types of &quot;restored&quot; nodes representing the non-realized members of valency frames (cf. pro-drop property of neVz 50 lid'i.&quot; Czech and verbal condensations using gerunds and infinitives both in Czech and English). For a partial reconstruction of such nodes, we can use traces, which allow us to establish coreferential links, or restore general participants in the valency frames.</Paragraph> <Paragraph position="2"> For the assignment of tectogrammatical functors, we can use rules taking into consideration POS tags (e.g.</Paragraph> <Paragraph position="3"> PRP ! APP), function tags (JJ ! RSTR, JJR ! CPR, etc.) and lemma (&quot;not&quot; ! RHEM, &quot;both&quot; ! RSTR). Grammateme Assignment - morphological grammatemes (e.g. tense, degree of comparison) are assigned to each node of the tectogrammatical tree. The assignment of the morphological attributes is based on PennTreebank tags and reflects basic morphological properties of the language. At the moment, there are no automatic tools for the assignment of syntactic grammatemes, which are designed to capture detailed information about deep syntactic structure.</Paragraph> <Paragraph position="4"> The whole procedure is described in detail in KuVcerov'a and VZabokrtsk'y (2002).</Paragraph> <Paragraph position="5"> In order to gain a &quot;gold standard&quot; annotation, 1,257 sentences have been annotated manually (the 515 sentences from the test set are among them). These data are assigned morphological gramatemes (the full set of values) and syntactic grammatemes, and the nodes are reordered according to topic-focus articulation (information structure).</Paragraph> <Paragraph position="6"> The quality of the automatic transformation procedure described above, based on comparison with manually an- null notated trees, is about 6% of wrongly aimed dependencies and 18% of wrongly assigned functors.</Paragraph> <Paragraph position="7"> See Figure 3 for the manually annotated tectogrammatical representation of the sample sentence.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Automatic Annotation of Czech </SectionTitle> <Paragraph position="0"> The Czech translations of Penn Treebank were automatically tokenized and morphologically tagged, each word form was assigned a base form - lemma by HajiVc and Hladk'a (1998) tagging tools.</Paragraph> <Paragraph position="1"> Czech analytical parsing consists of a statistical dependency parser for Czech - either Collins parser (Collins et al., 1999) or Charniak parser (Charniak, 1999), both adapted to dependency grammar - and a module for automatic analytical function assignment ( VZabokrtsk'y et al., 2002).</Paragraph> <Paragraph position="2"> When building the tectogrammatical structure, the analytical tree structure is converted into the tectogrammatical one. These transformations are described by linguistic rules (B&quot;ohmov'a, 2001). Then, tectogrammatical functors are assigned by a C4.5 classifier ( VZabokrtsk'y et al., 2002).</Paragraph> <Paragraph position="3"> The test set of 515 sentences (which have been retranslated into English) has been also manually annotated on tectogrammatical level.</Paragraph> <Paragraph position="4"> See Figures 4 and 5 for automatic analytical and manual tectogrammatical annotation of the Czech translation of the sample sentence.</Paragraph> <Paragraph position="5"> The manual annotation of 1,257 English sentences on tectogrammatical level was, to our knowledge, the first attempt of its kind, and was based especially on the instructions for tectogrammatical annotation of Czech. During the process of annotation, we have experienced both phenomena that do not occur in Czech at all, and phenomena whose counterparts in Czech occur rarely, and therefore are not handeled thoroughly by the guidelines for tectogrammatical annotation designed for Czech. To mention just a few, among the former belongs the annotation of articles, certain aspects of the system of verbal tenses, and phrasal verbs. A specimen of a roughly corresponding phenomenon occurring both in Czech and English is the gerund. It is a very common means of condensation in English, but its counterpart in Czech (usually called transgressive) has fallen out of use and is nowadays considered rather obsolete.</Paragraph> <Paragraph position="6"> The guidelines for Czech require the transgressive to be annotated with the functor COMPL. The reason why it is highly problematic to apply them straightforwardly also to the annotation of English, is that the English gerund has a much wider range of functions than the Czech transgressive. The gerund can be seen as a means of condensing subordinated clauses with in principle adverbial meaning (as it is analyzed in the phrase-structure annotation of Penn Treebank). Since the range of functors with adverbial meaning is much more fine-grained, we deem it inappropriate to mark the gerund clauses in such a simple way on the tectogrammatical level.</Paragraph> <Paragraph position="7"> From the point of view of machine translation, the gerund constructions pose considerable difficulties because of the many syntactic constructions suitable as their translations corresponding to their varied syntactic functions. null We present two examples illustrating the issues mentioned above. Each example consists of three figures, the first one presenting the Penn Treebank annotation of a (in the second case simplified) sentence from the Penn Treebank, the second one giving its tentative tectogrammatic representation (according to the guidelines for Czech applied to English), and the third one containing the tec- null &quot;common and preferred stock purchase rights&quot;.</Paragraph> <Paragraph position="8"> togrammatical representation of its translation into Czech (see Figures 1, 3, 5, and Figures 6, 7, 8). Note that in neither of the two examples the Czech transgressive is used as the translation of the English gerund; a coordination structure is used instead.</Paragraph> <Paragraph position="9"> On the other hand, we have also experienced phenomena in English whose Penn Treebank style of annotation is insufficient for a successfull conversion into dependency representation.</Paragraph> <Paragraph position="10"> In English, the usage of constructions with nominal premodification is very frequent, and the annotation of such noun phrases in the Penn Treebank is often flat, grouping together several constituents without reflecting finer syntactic and semantic relations among them (see Figure 9 for an example of such a noun phrase). In fact, the possible syntactic and especially semantic relations between the members of the noun phrase can be highly ambiguous, but when translating such a noun phrase into Czech, we are not usually able to preserve the ambiguity and are forced to resolve it by choosing one of the readings (see Figure 10).</Paragraph> <Paragraph position="11"> Sometimes we even may be forced to insert new words explicitly expressing the semantic relations within the nominal group. An example of an English noun phrase and the tectogrammatical representation of its Czech translation with an inserted word &quot;podnikaj'ic'i&quot; ('operating') can be found in Figures 11 and 12.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Other Resources Included in PCEDT </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Reader's Digest Parallel Corpus </SectionTitle> <Paragraph position="0"> Reader's Digest parallel corpus contains raw text in 53,000 aligned segments in 450 articles from the Reader's Digest, years 1993-1996. The Czech part is a free translation of the English version. The final selection of data has been done manually, excluding articles whose translations significantly differ (in length, culture-specific facts, etc.). Parallel segments on sentential level have been aligned by Dan Melamed's aligning tool (Melamed, 1996). The topology is 1-1 (81%), 0-1 or 1-0 (2%), 1-2 or 2-1 (15%), 2-2 (1%), and others (1%).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Dictionaries </SectionTitle> <Paragraph position="0"> The PCEDT comprises also a translation dictionary compiled from three different Czech-English manual dictionaries: two of them were downloaded form the Web and one was extracted from Czech and English EuroWord-Nets. Entry-translation pairs were filtered and weighed taking into account the reliability of the source dictionary, the frequencies of the translations in Czech and English monolingual corpora, and the correspondence of the Czech and English POS tags. Furthermore, by training GIZA++ (Och and Ney, 2003) translation model on the training part of the PCEDT extended by the manual dictionaries, we obtained a probabilistic Czech-English dictionary, more sensitive to the domain of financial news specific for the Wall Street Journal.</Paragraph> <Paragraph position="1"> The resulting Czech-English probabilistic dictionary tion &quot;sanfrancisk'a marketingov'a a distribuVcn'i spoleVcnost podnikaj'ic'i v potravin'ach a stavebn'ich materi'alech&quot;. contains 46,150 entry-translation pairs in its lemmatized version and 496,673 pairs of word forms in the version where for each entry-translation pair all the corresponding word form pairs have been generated.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Tools </SectionTitle> <Paragraph position="0"> SMT Quick Run is a package of scripts and instructions for building statistical machine translation system from the PCEDT or any other parallel corpus. The system uses models GIZA++ and ISI ReWrite decoder (Germann et al., 2001).</Paragraph> <Paragraph position="1"> TrEd is a graphical editor and viewer of tree structures. Its modular architecture allows easy handling of diverse annotation schemes, it has been used as the principal annotation environment for the PDT and PCEDT.</Paragraph> <Paragraph position="2"> Netgraph is a multi-platform client-server application for browsing, querying and viewing analytical and tectogrammatical dependency trees, either over the Internet or locally.</Paragraph> </Section> </Section> class="xml-element"></Paper>