File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3106_intro.xml
Size: 2,439 bytes
Last Modified: 2025-10-06 14:04:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3106"> <Title>Phrase-Based SMT with Shallow Tree-Phrases</Title> <Section position="3" start_page="39" end_page="39" type="intro"> <SectionTitle> 2 Tree-Phrases </SectionTitle> <Paragraph position="0"> We call tree-phrase (TP) a bilingual unit consisting of a source, fully-lexicalized treelet (TL) and a target phrase (EP), that is, the target words associated with the nodes of the treelet, in order. A treelet can be an arbitrary, fully-lexicalized subtree of the parse tree associated with a source sentence. A phrase can be an arbitrary sequence of words. This includes the standard notion of phrase, popular with phrased-based SMT (Koehn et al., 2003; Vogel et al., 2003) aswellassequencesofwordsthatcontaingaps(possibly of arbitrary size). In this study, we collected a repository of tree-phrases using a robust syntactic parser called SYNTEX (Bourigault and Fabre, 2000). SYNTEX identifies syntactic dependency relations between words.</Paragraph> <Paragraph position="1"> It takes as input a text processed by the TREETAG-GER part-of-speech tagger.1 An example of the output SYNTEX produces for the source (French) sentence &quot;on a demand'e des cr'edits f'ed'eraux&quot; (request for federal funding) is presented in Figure 1.</Paragraph> <Paragraph position="2"> We parsed with SYNTEX the source (French) part of our training bitext (see Section 4.1). From this material, we extracted all dependency subtrees of depth 1 from the complete dependency trees found by SYNTEX. An elastic phrase is simply the list of tokens aligned with the words of the corresponding treelet as well as the respective offsets at which they were found in the target sentence (the first token of an elastic phrase always has an offset of 0).</Paragraph> <Paragraph position="3"> For instance, the two treelets in Figure 2 will be collected out of the parse tree in Figure 1, yielding cr'edits f'ed'eraux&quot; (request for federal funding). Note thatthe 2words &quot;a&quot;and&quot;demand'e&quot;(literally &quot;have&quot; and &quot;asked&quot;) from the original sentence have been merged together by SYNTEX to form a single token.</Paragraph> <Paragraph position="4"> These tokens are the ones we use in this study.</Paragraph> <Paragraph position="5"> with the first pair of structures listed in the example.</Paragraph> </Section> class="xml-element"></Paper>