XML Viewer - c02-1010

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1010_metho.xml
Size: 11,185 bytes
Last Modified: 2025-10-06 14:07:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1010">
  <Title>Structure Alignment Using Bilingual Chunking</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Structure Alignment Using Bilingual
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Chunking
2.1 Principle
</SectionTitle>
      <Paragraph position="0"> The chunks, which we will use, are extracted from the Treebank. When converting a tree to the chunk sequence, the chunk types are based on the syntactic category part of the bracket label. Roughly, a chunk contains everything to the left of and including the syntactic head of the constituent of the same name. Besides the head, a chunk also contains pre-modifiers, but no post-modifiers or arguments (Erik. 2000).</Paragraph>
      <Paragraph position="1"> Using chunk as the alignment structure, we can get around the problems such as PP attachment, structure mismatching across languages.</Paragraph>
      <Paragraph position="2"> Therefore, we can get high chunking accuracy.</Paragraph>
      <Paragraph position="3"> Using bilingual chunking, we can get both high chunking accuracy and high chunk alignment accuracy by making the SL chunking process and the TL chunking process constrain and improve each other.</Paragraph>
      <Paragraph position="4"> Our 'bilingual chunking' model for structure alignment comprises three integrated components: chunking models of both languages, and the crossing constraint; it uses chunk as the structure. (See Fig. 1) The crossing constraint requests a chunk in one language only correspond to at most one chunk in the other language. For instance, in Fig. 2 (the dashed lines represent the word alignments; the brackets indicate the chunk boundaries), the phrase &amp;quot;the first man&amp;quot; is a monolingual chunk, it, however, should be divided into &amp;quot;the first&amp;quot; and &amp;quot;man&amp;quot; to satisfy the crossing constraint. By  [the first ][man ][who][would fly across][ the channel] [a2a4a3 a5 ] [a6a8a7 ] [a9a8a5 a10a12a11 ] [a13 ] [a14 ] Fig. 2 the crossing constraint using crossing constraint, the illegal chunk candidates can be removed in the chunking process.</Paragraph>
      <Paragraph position="5"> The chunking models for both languages work successively under the crossing constraint.</Paragraph>
      <Paragraph position="6"> Usually, chunking involves two steps: (1) POS tagging, and (2) chunking. To alleviate effectively the influence of POS tagging deficiency to the chunking result, we integrate the two steps with a unified model for optimal solution. This integration strategy has been proven to be effective for base NP identification (Xun, Huang &amp; Zhou, 2001).</Paragraph>
      <Paragraph position="7"> Consequently, our model works in three successive steps: (1) word alignment between SL and TL sentences; (2) source language chunking; (3) target language chunking. Both (2) and (3) should work under the supervision of crossing constraints.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Crossing Constraint
</SectionTitle>
      <Paragraph position="0"> According to (Wu, 1997), crossing constraint can be defined in the following.</Paragraph>
      <Paragraph position="1"> For non-recursive phrases: Suppose two words w1 and w2 in language-1 correspond to two words v1 and v2 in language-2, respectively, and w1 and w2 belong to the same phrase of language-1. Then v1 and v2 must also belong to the same phrase of language-2.</Paragraph>
      <Paragraph position="2"> We can benefit from applying crossing constraint in the following three aspects: * Consistent chunking in the view of alignment. For example, in Fig. 2, &amp;quot;the first man&amp;quot; should be divided into &amp;quot;the first&amp;quot; and &amp;quot;man&amp;quot; for the consistency with the Chinese chunks &amp;quot;a15a17a16 a18 &amp;quot; and &amp;quot;a19 &amp;quot;, respectively.</Paragraph>
      <Paragraph position="3"> * Searching space reduction. The chunking space is reduced by ruling out those illegal fragments like &amp;quot;the first man&amp;quot;; and the alignment space is reduced by confining those legal fragments like &amp;quot;the first&amp;quot; only to correspond to the Chinese fragments &amp;quot;a15a17a16 &amp;quot; or &amp;quot;a15a17a16 a18 &amp;quot; based on word alignment anchors. * Time synchronous algorithms for structure alignment. Time synchronous algorithms cannot be used due to word permutation problem before.</Paragraph>
      <Paragraph position="4"> While under the crossing constraint, these algorithms (for example, dynamic programming) can be used for both chunking and alignment.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Mathematical Formulation
</SectionTitle>
      <Paragraph position="0"> Given an English sentence leee wwwe ,...,21= , its POS tag sequence is denoted by leeee tttT ,...,21= , where l is the sentence length. A sequence of chunks can be represented as:</Paragraph>
      <Paragraph position="2"> Where, ien denotes the thi chunk type of e , and 'l is the number of chunks in e .</Paragraph>
      <Paragraph position="3"> Similarly, for a Chinese sentence c</Paragraph>
      <Paragraph position="5"> Where, m denotes the word number of c, 'm is the number of Chinese chunks in c.</Paragraph>
      <Paragraph position="6"> Let bmi denote the thi positional tag, bmi can be begin of a chunk, inside a chunk, or outside any chunk.</Paragraph>
      <Paragraph position="7"> The most probable result is expressed as</Paragraph>
      <Paragraph position="9"> Where, A is the alignment between eB and cB .</Paragraph>
      <Paragraph position="10"> a refers to the crossing constraint. Equation (1) can be further derived into  In this formula, ),|( aeTp e aims to determine the best POS tag sequence for e .</Paragraph>
      <Paragraph position="11"> ),|,( aeTBp ee aims to determine the best chunk sequence from them. aeTBTcp eec ,,,,|( )aims to decide the best POS tag sequence for c based on the English POS sequence. ),,,,,|( aeTBTcBp eecc aims to decide the best Chinese chunking result based on the Chinese POS sequence and the English chunk sequence.</Paragraph>
      <Paragraph position="12"> Note that 1),,,,,,|( =aeTBTcBAp eecc In practice, in order to reduce the search space, only N-best results of each step are retained.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Determining the N-Best English POS
Sequences
</SectionTitle>
      <Paragraph position="0"> The HMM based POS tagging model (Kupiec 1992) with the trigram assumption is used to provide possible POS candidates for each word in terms of the N-best lattice.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Determining the N-best English
Chunking Result
</SectionTitle>
      <Paragraph position="0"> This step is to find the best chunk sequence based on the N-best POS lattice by decomposing the chunking model into two sub-models (1) inter-chunk model; (2) intra- chunk model.</Paragraph>
      <Paragraph position="1"> From equation (3), based on Bayes' rule, then</Paragraph>
      <Paragraph position="3"> Here, the crossing constraint a will remove those illegal candidates.</Paragraph>
      <Paragraph position="4"> The second part can be further derived based on two assumptions: (1) bigram for the English POS transition inside a chunk; (2) the first POS tag of a chunk only depends on the previous two tags. Thus</Paragraph>
      <Paragraph position="6"> Where, ix is the number of words that the thi English chunk contains. And 1,2, , [?][?] ieie tt refer to the two tags before 1,iet .</Paragraph>
      <Paragraph position="7"> The third part can be derived based on the assumption that an English word iew only depends on its POS tag ( iet ), chunk type ( 'ien ) it belongs to and its positional information ( iebm ) in the chunk, thus</Paragraph>
      <Paragraph position="9"> i' is the index of the chunk the word belongs to.</Paragraph>
      <Paragraph position="10"> Finally, from (4)(5)(6)(7), we arrive</Paragraph>
      <Paragraph position="12"/>
      <Paragraph position="14"> We assume the word translation probability is 1 since we are using the word alignment result.</Paragraph>
      <Paragraph position="15"> Comparing with a typical HMM based tagger, our model also utilizes the POS tag information in the other language.</Paragraph>
      <Paragraph position="16"> Obtaining the Best Chinese Chunking</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Result
</SectionTitle>
      <Paragraph position="0"> Similar to the English chunking model, the Chinese chunking model also includes (1) inter-chunk model; (2) intra-chunk model. They are simplified, however, because of limited training data.</Paragraph>
      <Paragraph position="1"> Using the derivation similar to equation (4)-(8), we can get (11) form equation (3) with the assumptions that (1) ),,,,,|( aeTBTcBp eecc depends only on cT , c anda ; (2) bigram for chunk type transition; (3) bigram for tag transition inside a chunk; (4) trigram for the POS tag transition between chunks, we get</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Model Estimation
</SectionTitle>
      <Paragraph position="0"> We use three kinds of resources for training and testing: a) The WSJ part of the Penn Treebank II corpus (Marcus, Santorini &amp; Marcinkiewics 1993). Sections 00-19 are used as the training data, and sections 20-24 as the test data. b) The HIT Treebank2, containing 2000 sentences. c) The HIT bilingual corpus3, containing 20,000 sentence-pairs (in general domain) annotated with POS and word alignment information.</Paragraph>
      <Paragraph position="1"> We used 19,000 sentence-pairs for training and 1,000 for testing. These 1000 sentence-pairs are manually chunked and aligned.</Paragraph>
      <Paragraph position="2"> From the Penn Treebank, English chunks were extracted with the conversion tool (http://lcg-www.uia.ac.be/conll2000/chunking).</Paragraph>
      <Paragraph position="3"> From the HIT Treebank, Chinese chunks were extracted with a conversion tool implemented by ourselves. We can obtain an English chunk bank and a Chinese chunk bank.</Paragraph>
      <Paragraph position="4"> With the chunk dataset obtained above, the parameters were estimated with Maximum Likelihood Estimation.</Paragraph>
      <Paragraph position="5"> The POS tag translation probability in equation (9) was estimated from c).</Paragraph>
      <Paragraph position="6"> The English part-of-speech tag set is the same with Penn Treebank. And the Chinese tag set is the same with HIT Treebank.</Paragraph>
      <Paragraph position="7"> 13 chunk types were used for English, which are the same with (Erik et al, 2000). 7 chunk types were used for Chinese, including BDP (adverb phrase), BNP (noun phrase), BAP (adjective  phrase), BVP (verb phrase), BMP (quantifier phrase), BPP (prepositional phrase) and O (words outside any other chunks).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML