File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1062_intro.xml

Size: 3,173 bytes

Last Modified: 2025-10-06 14:03:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1062">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A DOM Tree Alignment Model for Mining Parallel Data from the Web</Title>
  <Section position="4" start_page="489" end_page="490" type="intro">
    <SectionTitle>
1 See http://www.w3.org/DOM/
2 Related Work
</SectionTitle>
    <Paragraph position="0"> The parallel data available on the web have been an important knowledge source for machine translation. For example, Hong Kong Laws, an English-Chinese Parallel corpus released by Linguistic Data Consortium (LDC) is downloaded from the Department of Justice of the Hong Kong Special Administrative Region website.</Paragraph>
    <Paragraph position="1"> Recently, web mining systems have been built to automatically acquire parallel data from the web. Exemplary systems include PTMiner (Nie et al 1999), STRAND (Resnik and Smith, 2003), BITS (Ma and Liberman, 1999), and PTI (Chen, Chau and Yeh, 2004). Given a bilingual website, these systems identify candidate parallel documents using pre-defined URL patterns. Then content-based features are employed for candidate verification. Particularly, HTML tag similarities have been exploited to verify parallelism between pages. But it is done by simplifying HTML tags as a string sequence instead of a hierarchical DOM tree. Tens of thousands parallel documents have been acquired with accuracy over 90%.</Paragraph>
    <Paragraph position="2"> To support machine translation, parallel sentence pairs should be extracted from the parallel web documents. A number of techniques for aligning sentences in parallel corpora have been proposed. (Gale &amp; Church 1991; Brown et al.</Paragraph>
    <Paragraph position="3"> 1991; Wu 1994) used sentence length as the basic feature for alignment. (Kay &amp; Roscheisen 1993; and Chen 1993) used lexical information for sentence alignment. Models combining length and lexicon information were proposed in (Zhao and Vogel, 2002; Moore 2002). Signal processing techniques is also employed in sentence alignment by (Church 1993; Fung &amp; McKeown 1994). Recently, much research attention has been paid to aligning sentences in comparable documents (Utiyama et al 2003, Munteanu et al 2004).</Paragraph>
    <Paragraph position="4"> The DOM tree alignment model is the key technique of our mining approach. Although, to our knowledge, this is the first work discussing DOM tree alignments, there is substantial research focusing on syntactic tree alignment model for machine translation. For example, (Wu 1997; Alshawi, Bangalore, and Douglas, 2000; Yamada and Knight, 2001) have studied synchronous context free grammar. This formalism requires isomorphic syntax trees for the source sentence and its translation. (Shieber and Schabes 1990) presents a synchronous tree adjoining grammar (STAG) which is able to align two syn- null tactic trees at the linguistic minimal units. The synchronous tree substitution grammar (STSG) presented in (Hajic etc. 2004) is a simplified version of STAG which allows tree substitution operation, but prohibits the operation of tree adjunction. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML