File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1062_metho.xml
Size: 19,296 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1062"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics A DOM Tree Alignment Model for Mining Parallel Data from the Web</Title> <Section position="5" start_page="490" end_page="490" type="metho"> <SectionTitle> 3 A New Parallel Data Mining Scheme </SectionTitle> <Paragraph position="0"> Supported by DOM Tree Alignment Our new web parallel data mining scheme consists of the following steps: (1) Given a web site, the root page and web pages directly linked from the root page are downloaded. Then for each of the downloaded web page, all of its anchor texts (i.e. the hyperlinked words on a web page) are compared with a list of predefined strings known to reflect translational equivalence among web pages (Nie et al 1999). Examples of such predefined trigger strings include: (i) trigger words for English transla- null a2a3a5a8a7 , etc.}. If both categories of trigger words are found, the web site is considered bilingual, and every web page pair are sent to Step 2 for parallelism verification.</Paragraph> <Paragraph position="1"> (2) Given a pair of the plausible parallel web pages, a verification module is called to determine if the page pair is truly translation- null ally equivalent.</Paragraph> <Paragraph position="2"> (3) For each verified pair of parallel web pages, a DOM tree alignment model is called to extract parallel text chunks and hyperlinks. (4) Sentence alignment is performed on each pair of the parallel text chunks, and the resulting parallel sentences are saved in an output file.</Paragraph> <Paragraph position="3"> (5) For each pair of parallel hyperlinks, the corresponding pair of web pages is downloaded, and then goes to Step 2 for parallelism verification. If no more parallel hyperlinks are found, stop the mining process.</Paragraph> <Paragraph position="4"> Our new mining scheme is iterative in nature. It fully exploits the information contained in the parallel data and effectively uses it to pinpoint the location holding more parallel data. This approach is based on our observation that parallel pages share similar structures holding parallel content, and parallel hyperlinks refer to new parallel pages.</Paragraph> <Paragraph position="5"> By exploiting both the HTML tag similarity and the content-based translational equivalences, the DOM tree alignment model extracts parallel text chunks. Working on the parallel text chunks instead of the text of the whole web page, the sentence alignment accuracy can be improved by a large margin.</Paragraph> <Paragraph position="6"> In the next three sections, three component techniques, the DOM tree alignment model, sentence alignment model, and candidate web page pair verification model are introduced.</Paragraph> </Section> <Section position="6" start_page="490" end_page="492" type="metho"> <SectionTitle> 4 DOM Tree Alignment Model </SectionTitle> <Paragraph position="0"> The Document Object Model (DOM) is an application programming interface for valid HTML documents. Using DOM, the logical structure of a HTML document is represented as a tree where each node belongs to some pre-defined node types (e.g. Document, DocumentType, Element, Text, Comment, ProcessingInstruction etc.).</Paragraph> <Paragraph position="1"> Among all these types of nodes, the nodes most relevant to our purpose are Element nodes (corresponding to the HTML tags) and Text nodes (corresponding to the texts). To simplify the description of the alignment model, minor modifications of the standard DOM tree are made: (i) Only the Element nodes and Text nodes are kept in our document tree model. (ii) The ALT attribute is represented as Text node in our document tree model. The ALT text are textual alternative when images cannot be displayed, hence is helpful to align images and hyperlinks. (iii) the Text node (which must be a leaf) and its parent Element node are combined into one node in order to concise the representation of the alignment model. The above three modifications are exemplified in Fig. 1.</Paragraph> <Paragraph position="2"> Fig. 1 Difference between Standard DOM and Our Document Tree Despite these minor differences, our document tree is still referred as DOM tree throughout this paper.</Paragraph> <Section position="1" start_page="491" end_page="492" type="sub_section"> <SectionTitle> 4.1 DOM Tree Alignment </SectionTitle> <Paragraph position="0"> Similar to STSG, our DOM tree alignment model supports node deletion, insertion and substitution.</Paragraph> <Paragraph position="1"> Besides, both STSG and our DOM tree alignment model define the alignment as a tree hierarchical invariance process, i.e. if node A is aligned with node B, then the children of A are either deleted or aligned with the children of B.</Paragraph> <Paragraph position="2"> But two major differences exist between STSG and our DOM tree alignment model: (i) Our DOM tree alignment model requires the alignment a sequential order invariant process, i.e. if node A is aligned with node B, then the sibling nodes following A have to be either deleted or aligned with the sibling nodes following B. (ii) (Hajic etc. 2004) presents STSG in the context of language generation, while we search for the best alignment on the condition that both trees are given.</Paragraph> <Paragraph position="3"> To facilitate the presentation of the tree alignment model, the following symbols are introduced: given a HTML document D, DT refers to the corresponding DOM tree; DiN refers to the ith node of DT (here the index of the node is in the breadth-first order), and DiT refers to the sub-tree rooted at DiN , so D1N refers to the root of DT , and DT=D1T ; [ ]Dji,T refers to the forest consisting of the sub-trees rooted at nodes from DiT to DjT .</Paragraph> <Paragraph position="4"> t.N Di refers to the text of node DiN ; l.N Di refers to the HTML tag of the node DiN ; jC.N Di refers to the jth child of the node DiN ; [ ]nmC ,Di .N refers to the consecutive sequence of DiN 's children nodes from mC.N Di to nC.N Di ; the sub-tree rooted at</Paragraph> <Paragraph position="6"> i and the forest rooted at [ ]nmC ,Di .N is represented as [ ]nmTC ,Di .N . Finally NULL refers to the empty node introduced for node deletion.</Paragraph> <Paragraph position="7"> To accommodate the hierarchical structure of the DOM tree, two different translation probabilities are defined: ( )</Paragraph> <Paragraph position="9"> nmT , based on the alignment A. The tree alignment A is defined as a mapping from target nodes onto source nodes or the null node.</Paragraph> <Paragraph position="10"> Given two HTML documents F (in French) and E (in English), the tree alignment task is defined as searching for A which maximizes the following probability: ( ) ( ) ( )</Paragraph> <Paragraph position="12"> where ( )ETAPr represents the prior knowledge of the alignment configurations.</Paragraph> <Paragraph position="13"> By introducing dp which refers to the probability of a source or target node deletion occurring in an alignment configuration, the alignment prior ( )ETAPr is assumed as the following bi-</Paragraph> <Paragraph position="15"> where L is the count of non-empty alignments in A, and M is the count of source and target node deletions in A.</Paragraph> <Paragraph position="16"> As to ( )ATT EF ,Pr , we can estimate as ( ) ( )ATTATT EFEF ,Pr,Pr 11= , and ( )ATTr EiFl ,P can be calculated recursively depending on the alignment configuration of A :</Paragraph> <Paragraph position="18"> where K and K' are degree of FlN and EiN .</Paragraph> <Paragraph position="19"> (2) If FlN is deleted, and the children of FlN is aligned with EiT , then we have</Paragraph> <Paragraph position="21"> where K is the degree of EiN .</Paragraph> <Paragraph position="22"> To complete the alignment model,</Paragraph> <Paragraph position="24"> nm ,P ,],[ is to be estimated. As mentioned before, only the alignment configurations with unchanged node sequential order are considered as valid. So, [ ]( )ATTr E jiF nm ,P ,],[ is estimated recursively according to the following five alignment configurations of A: (4) If FmT is aligned with EiT , and [ ]F nmT ,1+ is</Paragraph> <Paragraph position="26"> where K is the degree of .EiN Finally, the node translation probability is modeled as ( ) ( ) ( )tNtNlNlNNN EiFlEiFlEjFl ..Pr..PrPr >> . And the text translation probability ( )EF ttPr is model using IBM model I (Brown et al 1993).</Paragraph> </Section> <Section position="2" start_page="492" end_page="492" type="sub_section"> <SectionTitle> 4.2 Parameter Estimation Using Expecta- </SectionTitle> <Paragraph position="0"> tion-Maximization Our tree alignment model involves three categories of parameters: the text translation probability ( ) EF ttPr , tag mapping probability ( )'Pr ll , and node deletion probability dp .</Paragraph> <Paragraph position="1"> Conventional parallel data released by LDC are used to train IBM model I for estimating the text translation probability ( )EF ttPr .</Paragraph> <Paragraph position="2"> One way to estimate ( )'Pr ll and dp is to manually align nodes between parallel DOM trees, and use them as training corpora for maximum likelihood estimation. However, this is a very time-consuming and error-prone procedure. In this paper, the inside outside algorithm presented in (Lari and Young, 1990) is extended to train parameters ( )'Pr ll and dp by optimally fitting the existing parallel DOM trees.</Paragraph> </Section> <Section position="3" start_page="492" end_page="492" type="sub_section"> <SectionTitle> 4.3 Dynamic Programming for Decoding </SectionTitle> <Paragraph position="0"> It is observed that if two trees are optimally aligned, the alignment of their sub-trees must be optimal as well. In the decoding process, dynamic programming techniques can be applied to find the optimal tree alignment using that of the sub-trees in a bottom up manner. The following is the pseudo-code of the decoding algorithm:</Paragraph> <Paragraph position="2"> j TCT ,1. , and then compute the best alignment between</Paragraph> <Paragraph position="4"> where the degree of a tree is defined as the largest degree of its nodes.</Paragraph> </Section> </Section> <Section position="7" start_page="492" end_page="492" type="metho"> <SectionTitle> 5 Aligning Sentences Using Tree Align- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="492" end_page="492" type="sub_section"> <SectionTitle> ment Model </SectionTitle> <Paragraph position="0"> To exploit the HTML structure similarities between parallel web documents, a cascaded approach is used in our sentence aligner implementation. null First, text chunks associated with DOM tree nodes are aligned using the DOM tree alignment model. Then for each pair of parallel text chunks, the sentence aligner described in (Zhao et al 2002), which combines IBM model I and the length model of (Gale & Church 1991) under a maximum likelihood criterion, is used to align parallel sentences.</Paragraph> </Section> </Section> <Section position="8" start_page="492" end_page="493" type="metho"> <SectionTitle> 6 Web Document Pair Verification </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="492" end_page="493" type="sub_section"> <SectionTitle> Model </SectionTitle> <Paragraph position="0"> To verify whether a candidate web document pair is truly parallel, a binary maximum entropy based classifier is used.</Paragraph> <Paragraph position="1"> Following (Nie et al 1999) and (Resnik and Smith, 2003), three features are used: (i) file length ratio; (ii) HTML tag similarity; (iii) sentence alignment score.</Paragraph> <Paragraph position="2"> The HTML tag similarity feature is computed as follows: all of the HTML tags of a given web page are extracted, and concatenated as a string. Then, a minimum edit distance between the two tag strings associated with the candidate pair is computed, and the HMTL tag similarity score is defined as the ratio of match operation number to the total operation number.</Paragraph> <Paragraph position="3"> The sentence alignment score is defined as the ratio of the number of aligned sentences and the total number of sentences in both files.</Paragraph> <Paragraph position="4"> Using these three features, the maximum entropy model is trained on 1,000 pairs of web pages manually labeled as parallel or nonparallel. The Iterative Scaling algorithm (Pietra, Pietra and Lafferty 1995) is used for the training.</Paragraph> </Section> </Section> <Section position="9" start_page="493" end_page="494" type="metho"> <SectionTitle> 7 Experimental Results </SectionTitle> <Paragraph position="0"> The DOM tree alignment based mining system is used to acquire English-Chinese parallel data from the web. The mining procedure is initiated by acquiring Chinese website list.</Paragraph> <Paragraph position="1"> We have downloaded about 300,000 URLs of Chinese websites from the web directories at cn.yahoo.com, hk.yahoo.com and tw.yahoo.com.</Paragraph> <Paragraph position="2"> And each website is sent to the mining system for English-Chinese parallel data acquisition. To ensure that the whole mining experiment to be finished in schedule, we stipulate that it takes at most 10 hours on mining each website. Totally 11,000 English-Chinese websites are discovered, from which 63,214 pairs of English-Chinese parallel web documents are mined. After sentence alignment, totally 1,069,423 pairs of English-Chinese parallel sentences are extracted.</Paragraph> <Paragraph position="3"> In order to compare the system performance, 100 English-Chinese bilingual websites are also mined using the URL pattern based mining scheme. Following (Nie et al 1999; Ma and Liberman 1999; Chen, Chau and Yeh 2004), the URL pattern-based mining consists of three steps: (i) host crawling for URL collection; (ii) candidate pair identification by pre-defined URL pattern matching; (iii) candidate pair verification. Based on these mining results, the quality of the mined data, the mining coverage and mining efficiency are measured.</Paragraph> <Paragraph position="4"> First, we benchmarked the precision of the mined parallel documents. 3,000 pairs of English-Chinese candidate documents are randomly selected from the output of each mining system, and are reviewed by human annotators. The document level precision is shown in Table 1.</Paragraph> <Paragraph position="5"> The document-level mining precision solely depends on the candidate document pair verification module. The verification modules of both mining systems use the same features, and the only difference is that in the new mining system the sentence alignment score is computed with DOM tree alignment support. So the 3.7% improvement in document-level precision indirectly confirms the enhancement of sentence alignment.</Paragraph> <Paragraph position="6"> Secondly, the accuracy of sentence alignment model is benchmarked as follows: 150 English-Chinese parallel document pairs are randomly taken from our mining results. All parallel sentence pairs in these document pairs are manually annotated by two annotators with crossvalidation. We have compared sentence alignment accuracy with and without DOM tree alignment support. In case of no tree alignment support, all the texts in the web pages are extracted and sent to sentence aligner for alignment. The benchmarks are shown in Table 2.</Paragraph> <Paragraph position="7"> support, the sentence alignment accuracy is greatly improved by 7% in both precision and recall. We also observed that the recall is lower than precision. This is because web pages tend to contain many short sentences (one or two words only) whose alignment is hard to identify due to the lack of content information.</Paragraph> <Paragraph position="8"> Although Table 2 benchmarks the accuracy of sentence aligner, but the quality of the final sentence pair outputs depend on many other modules as well, e.g. the document level parallelism verification, sentence breaker, Chinese word breaker, etc. To further measure the quality of the mined data, 2,000 sentence pairs are randomly picked from the final output, and are manually classified into three categories: (i) exact parallel, (ii) roughly parallel: two parallel sentences involving missing words or erroneous additions; (iii) not parallel. Two annotators are assigned for this task with cross-validation. As is shown in Table 3, 93.5% of output sentence pairs are either exact or roughly parallel.</Paragraph> <Paragraph position="9"> As we know, the absolute value of mining system recall is hard to estimate because it is impractical to evaluate all the parallel data held by a bilingual website. Instead, we compare mining coverage and efficiency between the two systems.</Paragraph> <Section position="1" start_page="494" end_page="494" type="sub_section"> <SectionTitle> Bilingual Websites </SectionTitle> <Paragraph position="0"> Although it downloads less data, the DOM tree based mining scheme increases the parallel data acquisition throughput by 32%. Furthermore, the ratio of downloaded page count per parallel pair is 2.26, which means the bandwidth usage is almost optimal.</Paragraph> <Paragraph position="1"> Another interesting topic is the complementarities between both mining systems. As reported in Table (5), 1797 pairs of parallel documents mined by the new scheme is not covered by the URL pattern-based scheme. So if both systems are used, the throughput can be further increased by 41%.</Paragraph> </Section> </Section> <Section position="10" start_page="494" end_page="494" type="metho"> <SectionTitle> 100 Bilingual Website 8 Discussion and Conclusion </SectionTitle> <Paragraph position="0"> Mining parallel data from web is a promising method to overcome the knowledge bottleneck faced by machine translation. To build a practical mining system, three research issues should be fully studied: (i) the quality of mined data, (ii) the mining coverage, and (iii) the mining speed.</Paragraph> <Paragraph position="1"> Exploiting DOM tree similarities helps in all the three issues.</Paragraph> <Paragraph position="2"> Motivated by this observation, this paper presents a new web mining scheme for parallel data acquisition. A DOM tree alignment model is proposed to identify translationally equivalent text chunks and hyperlinks between two HTML documents. Parallel hyperlinks are used to pinpoint new parallel data, and make parallel data mining a recursive process. Parallel text chunks are fed into sentence aligner to extract parallel sentences.</Paragraph> <Paragraph position="3"> Benchmarks show that sentence aligner supported by DOM tree alignment achieves performance enhancement by 7% in both precision and recall. Besides, the new mining scheme reduce the bandwidth cost by 8~9 times on average compared with the URL pattern-based mining scheme. In addition, the new mining scheme is more general and reliable, and is able to mine more data. Using the new mining scheme alone, the mining throughput is increased by 32%, and when combined with URL pattern-based scheme, the mining throughput is increased by 41%.</Paragraph> </Section> class="xml-element"></Paper>