File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2717_metho.xml
Size: 9,401 bytes
Last Modified: 2025-10-06 14:10:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2717"> <Title>XML-based Phrase Alignment in Parallel Treebanks</Title> <Section position="3" start_page="0" end_page="93" type="metho"> <SectionTitle> 2 Building the treebanks </SectionTitle> <Paragraph position="0"> Our parallel treebank contains the first two chapters of Jostein Gaarder's novel &quot;Sofie's World&quot; with about 500 sentences.1 In addition it contains 500 sentences from economy texts (a quarterly report by a multinational company as well as part of a bank's annual report).</Paragraph> <Paragraph position="1"> In creating the parallel treebank, we have first annotated the monolingual treebanks with the ANNOTATE treebank editor.2 It includes Thorsten Brants' statistical Part-of-Speech Tagger and Chunker. The chunker follows the TIGER annotation guidelines for German (Brants and Hansen, 2002), which gives a flat phrase structure tree. This means, for instance, no unary nodes, no &quot;unnecessary&quot; NPs (noun phrases) within PPs (prepositional phrases) and no finite VPs (verb phrases).</Paragraph> <Paragraph position="2"> Using a flat tree structure for manual treebank annotation has two advantages for the human annotator: fewer annotation decisions, and a better overview of the trees. This comes at the prize of the trees not being complete from a linguistic point of view. Moreover, flat syntax trees are also problematic for node alignment in a parallel treebank. We prefer to have &quot;deep trees&quot; to be able to draw the alignment on as many levels as possible; in fact, the more detailed the sentence structure is, the more expressive our alignment can become.</Paragraph> <Paragraph position="3"> As an example, let us look at the work flow for the German-Swedish parallel treebank.</Paragraph> <Paragraph position="4"> We first annotated the German sentences semi-automatically in the flat manner, and we then automatically deepened the flat syntax trees (Samuelsson and Volk, 2004).</Paragraph> <Paragraph position="5"> We annotated the Swedish sentences by first tagging them with a Part-of-Speech tagger trained on SUC (the Stockholm-Ume@a Corpus). Since we did not have a Swedish treebank to train a Swedish chunker, we used a trick to apply the German chunker for Swedish sentences. We mapped the Swedish Part-of-Speech tags in the Swedish sentences to the corresponding German tags. Since the German chunker works on these tags, it then suggested constituents for the Swedish sentences, assuming they were German sentences. These experiments and the resulting time gain were reported in (Volk and Samuelsson, 2004). Upon completion of the Swedish treebank with flat syntax trees, we applied the same deepening method as for German, and we then converted the Part-of-Speech labels back to the Swedish labels.</Paragraph> <Paragraph position="6"> Finally, we annotated the English sentences according to the Penn Treebank guidelines. We trained the PoS tagger and the chunker on the Penn Treebank and integrated them into ANNOTATE.</Paragraph> <Paragraph position="7"> The English guidelines lead to complete trees so that the deepening step is not needed.</Paragraph> </Section> <Section position="4" start_page="93" end_page="93" type="metho"> <SectionTitle> 3 XML Representation of the Trees </SectionTitle> <Paragraph position="0"> After finishing the monolingual treebanks with ANNOTATE, the trees were exported from the accompanying SQL database and converted into TIGER-XML. TIGER-XML is a line-based (i.e.</Paragraph> <Paragraph position="1"> not nested and thus database-friendly) representation for graph structures, which includes syntax trees with node labels, edge labels, multiple features on the word level and even crossing edges.3 In a TIGER-XML graph each leaf (= token) and each node (= linguistic constituent) has a unique identifier which is prefixed with the sentence number. Leaves are numbered from 1 to 499 and nodes starting from 500 (under the plausible assumption that no sentence will ever have more than 499 tokens). As can be seen in the following example, node 500 in sentence 12 is of the category PP (prepositional phrase). The phrase consists of word number 4, which is the preposition in, plus node 502 which in turn is marked as an NP (noun phrase), consisting of the words 5 and 6. It should be noted that the id attribute in the token lines serves a dual purpose of identifier and order marker. This makes it possible to represent crossing branches.</Paragraph> <Paragraph position="3"> This means that the token identifiers and constituent identifiers are used as pointers to represent the nested tree structure. This example thus represents the upper tree in figure 1.</Paragraph> <Paragraph position="4"> One might wonder why tree nesting is not directly mapped into XML nesting. But the requirement that the representation format must support crossing edges rules out this option. TIGER-XML is a powerful representation format and is typically used with constituent symbols on the nodes and functional information on the edge labels. This constitutes a combination of constituent structure and dependency structure information.</Paragraph> </Section> <Section position="5" start_page="93" end_page="94" type="metho"> <SectionTitle> 4 XML Representation of the Alignment </SectionTitle> <Paragraph position="0"> Phrase alignment can be regarded as an additional layer of information on top of the syntax structure. We use the unique node identifiers for the phrase alignment across parallel trees. We also use an XML representation for storing the alignment. The alignment file first stores the names of the treebank files and assigns identifiers to them.</Paragraph> <Paragraph position="1"> Every single phrase alignment is then stored with the tag align. Thus the entry in the following example represents the alignment of node 505 in sentence 13 of language one (German) to the node 506 in sentence 14 of language two (Swedish).</Paragraph> <Paragraph position="2"> This representation allows phrase alignments within m:n sentence alignments, which we have used in our project. The XML also allows m:n phrase alignments, which we however have not used for reasons of simplicity and clarity. Two nodes are aligned if the words which they span convey the same meaning and could serve as translation units.</Paragraph> <Paragraph position="3"> The alignment format allows alignments to be specified between an arbitrary number of nodes, for example nodes from three languages. And it includes an attribute type which we currently use to distinguish between exact and approximate alignments.</Paragraph> </Section> <Section position="6" start_page="94" end_page="94" type="metho"> <SectionTitle> 5 Our Tree Alignment Tool </SectionTitle> <Paragraph position="0"> After finishing the monolingual trees we want to align them on the phrase level. For this purpose we have developed a &quot;TreeAligner&quot;. This program is a graphical user interface to insert (or correct) alignments between pairs of syntax trees.4 The TreeAligner can be seen in the line of tools such as I*Link (Ahrenberg et al., 2002) or Cairo (Smith and Jahr, 2000) but it is especially tailored to visualize and align full syntax trees.</Paragraph> <Paragraph position="1"> The TreeAligner requires three input files. One TIGER-XML file with the trees from language one, another TIGER-XML file with the trees from language two, plus the alignment file as described above. The alignment file might initially be empty when we want to start manual alignment from scratch, or it might contain automatically computed alignments for correction. The TreeAligner displays tree pairs with the trees in mirror orientation (one top-up and one top-down). See figure 1 for an example. This has the advantage that the alignment lines cross fewer parts of the lower tree. The trees are displayed with node labels and greyed-out edge labels. The PoS labels are omitted in the display since they are not relevant for the task.</Paragraph> <Paragraph position="2"> Each alignment is displayed as a dotted line between one node (or word) from each tree. Clicking on a node (or a word) in one tree and dragging the mouse pointer to a node (or a word) in the other tree inserts an alignment line. Figure 2 shows an example of a tree pair with alignment lines. Currently the TreeAligner supports two types of align- null are used to indicate exact translation correspondence vs. approximate translation correspondence. However, our experiments indicate that eventually more alignment types will be needed to precisely represent different translation deviations.</Paragraph> <Paragraph position="3"> Often one tree needs to be aligned to two trees in the other language. We therefore provide the option to scroll the trees independently. For instance, if we have aligned only a part of tree 20 from language one to tree 18 of language two, we may scroll to tree 19 of language two in order to align the remaining parts of tree 20.5 The TreeAligner is designed as a stand-alone tool (i.e. it is not prepared for collaborative annotation). It stores every alignment in an XML file (in the format described above) as soon as the user moves to a new tree pair. It has been tested on parallel treebanks with several hundred trees each.</Paragraph> </Section> class="xml-element"></Paper>