File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2022_metho.xml

Size: 16,631 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2022">
  <Title>Using bilingual dependencies to align words in Enlish/French parallel corpora</Title>
  <Section position="5" start_page="127" end_page="127" type="metho">
    <SectionTitle>
3 Corpora and parsers
</SectionTitle>
    <Paragraph position="0"> The syntax-based alignment was tested on three parallel corpora aligned at the sentence level: INRA, JOC and HLT. The first corpus was compiled at the National Institute for Agricultural Research (INRA)  to enrich a bilingual terminology database used by translators. It comprises 6815 aligned sentences  and mainly consists of research papers and popular-science texts.</Paragraph>
    <Paragraph position="1"> The JOC corpus was made available in the framework of the ARCADE project, which focused on the evaluation of parallel text alignment systems (Veronis &amp; Langlais, 2000). It contains written questions on a wide variety of topics addressed by members of the European Parliament to the European Commission, as well as the corresponding answers. It is made up of 8765 aligned sentences. The HLT corpus was used in the evaluation of word alignment systems described in (Mihalcea &amp; Pederson, 2003). It contains 447 aligned sentences from the Canadian Hansards (Och &amp; Ney, 2000). The corpus processing was carried out by a French/English parser, SYNTEX (Fabre &amp; Bourigault, 2001). SYNTEX is a dependency parser whose input is a POS tagged</Paragraph>
    <Section position="1" start_page="127" end_page="127" type="sub_section">
      <SectionTitle>
corpus -- meaning
</SectionTitle>
      <Paragraph position="0"> each word in the corpus is assigned a lemma and grammatical tag. The parser identifies dependencies in the sentences of a given corpus, for instance subjects and direct and indirect objects of verbs.</Paragraph>
      <Paragraph position="1"> The parsing is performed independently in each language, yet the outputs are quite homogeneous since the syntactic dependencies are identified and represented in the same way in both languages.</Paragraph>
      <Paragraph position="2"> In addition to parsed English/French bitexts, the syntax-based alignment requires pairs of anchor words be identified prior to propagation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="127" end_page="127" type="metho">
    <SectionTitle>
4 Identification of anchor pairs
</SectionTitle>
    <Paragraph position="0"> stuttgart.de) are used.</Paragraph>
    <Paragraph position="1"> To derive a set of words that are likely to be useful for initiating the propagation process, we implemented a widely used method of co-occurrence counts described notably in (Gale &amp; Church, 1991; Ahrenberg et al., 2000). For each source (w1) and target (w2) word, the Jaccard association score is computed as follows: j(w1, w2) = f(w1, w2)/f(w1) + f(w2) - f(w1, w2) The Jaccard is computed provided the number of overall occurrences of w1 and w2 is higher than 4, since statistical techniques have proved to be particularly efficient when aligning frequent units. The alignments are filtered according to the j(w1, w2) value, and retained if this value was 0.2 or higher. Moreover, two further tests based on cognate recognition and mutual correspondence condition are applied.</Paragraph>
    <Paragraph position="2"> The identification of anchor pairs, consisting of words that are translation equivalents within aligned sentences, combines both the projection of the initial lexicon and the recognition of cognates for words that have not been taken into account in the lexicon. These pairs are used as the starting point of the propagation process</Paragraph>
  </Section>
  <Section position="7" start_page="127" end_page="128" type="metho">
    <SectionTitle>
5 Syntax-based propagation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="127" end_page="128" type="sub_section">
      <SectionTitle>
5.1 Two types of propagation
</SectionTitle>
      <Paragraph position="0"> The syntax-based propagation may be performed in two different directions, as a given word is likely to be both governor and dependent with respect to other words. The first direction starts with dependent anchor words and propagates the alignment link to the governors (Dep-to-Gov propagation). The Dep-to-Gov propagation is a priori not ambiguous since one dependent is governed at  The process is not iterative up to date so the number of words it allows to align depends on the initial number of anchor words per sentence.  most by one word. Thus, there is just one relation on which the propagation can be based. The second direction goes the opposite way: starting with governor anchor words, the alignment link is propagated to their dependents (Gov-to-Dep propagation). In this case, several relations that may be used to achieve the propagation are available, as it is possible for a governor to have more than one dependent. So the propagation is potentially ambiguous. The ambiguity is particularly widespread when propagating from head nouns to their nominal and adjectival dependents. In Figure 2, there is one occurrence of the relation pcomp in English and two in French. Thus, it is not possible to determine a priori whether to propagate using the relations mod/pcomp2, on the one hand, and pcomp1/pcomp2', on the other hand, or mod/pcomp2' and pcomp1/pcomp2. Moreover, even if there is just one occurrence of the same relation in each language, it does not mean that the propagation is of necessity performed through the same relation, as shown in Figure 3.</Paragraph>
      <Paragraph position="1">  In the following sections, we describe the two types of propagation. The propagation patterns we rely on are given in the form CDep-rel-CGov, where CDep is the POS of the dependent, rel is the dependency relation and CGov, the POS of the governor. The anchor element is underlined and the one aligned by propagation is in bold.</Paragraph>
    </Section>
    <Section position="2" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
5.2 Alignment of verbs
</SectionTitle>
      <Paragraph position="0"> Verbs are aligned according to eight propagation patterns.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="128" end_page="129" type="metho">
    <SectionTitle>
DEP-TO-GOV PROPAGATION TO ALIGN GOVERNOR
</SectionTitle>
    <Paragraph position="0"> VERBS. The patterns are: Adv-mod-V (1), N-subj-V (2), N-obj-V (3), N-pcomp-V (4) and V-pcomp-V (5).</Paragraph>
    <Paragraph position="1">  (1) The net is then hauled to the shore. Le filet est ensuite hale a terre.</Paragraph>
    <Paragraph position="2"> (2) The fish are generally caught when they migrate from their feeding areas.</Paragraph>
    <Paragraph position="3"> Generalement les poissons sont captures quand ils migrent de leur zone d'engraissement.</Paragraph>
    <Paragraph position="4"> (3) Most of the young shad reach the sea. La plupart des alosons gagne la mer.</Paragraph>
    <Paragraph position="5"> (4) The eggs are very small and fall to the bottom. Les oeufs de tres petite taille tombent sur le fond. (5) X is a model which was designed to stimulate... X est un modele qui a ete concu pour stimuler... GOV-TO-DEP PROPAGATION TO ALIGN DEPENDENT VERBS. The alignment links are propagated from the dependents to the verbs using three propagation patterns: V-pcomp-V (1), V-pcomp-N (2) and Vpcomp-Adj (3).</Paragraph>
    <Paragraph position="6"> mod pcomp1 (1) Ploughing tends to destroy the soil microaggregated structure.</Paragraph>
    <Paragraph position="7"> outdoor use of water utilisation en exterieur de l'eau Le labour tend a rompre leur structure microagregee. null pcomp2 (2) The capacity to colonize the digestive mucosa... null L'aptitude a coloniser le tube digestif... (3) An established infection is impossible to control. null mod pcomp1  Toute infection en cours est impossible a maitriser. reference product on the market produit</Paragraph>
    <Section position="1" start_page="128" end_page="128" type="sub_section">
      <SectionTitle>
5.3 Alignment of adjectives and nouns
</SectionTitle>
      <Paragraph position="0"> commercial de reference The two types of propagation described in section</Paragraph>
    </Section>
    <Section position="2" start_page="128" end_page="129" type="sub_section">
      <SectionTitle>
5.2 for use with verbs are also used to align adjec-
</SectionTitle>
      <Paragraph position="0"> tives and nouns. However, these latter categories cannot be treated in a fully independent way when propagating from head noun anchor words in order to align the dependents. The syntactic structure of noun phrases may be different in English and French, since they rely on a different type of composition to produce compounds and on the same one to produce free noun phrases. Thus, the potential ambiguity arising from the Gov-to-Dep propagation from head nouns mentioned in section 5.1</Paragraph>
      <Paragraph position="2"> may be accompanied by variation phenomena affecting the category of the dependents. For instance, a noun may be rendered by an adjective, or vice versa: tax treatment profits is translated by traitement fiscal des benefices, so the noun tax is in correspondence with the adjective fiscal. The syntactic relations used to propagate the alignment links are thus different.</Paragraph>
      <Paragraph position="3"> In order to cope with the variation problem, the propagation is performed regardless of whether the syntactic relations are identical in both languages, and regardless of whether the POS of the words to be aligned are the same. To sum up, adjectives and nouns are aligned separately of each other by means of Dep-to-Gov propagation or Gov-to-Dep propagation provided that the governor is not a noun. They are not treated separately when aligning by means of Gov-to-Dep propagation from head noun anchor pairs.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="129" end_page="129" type="metho">
    <SectionTitle>
DEP-TO-GOV PROPAGATION TO ALIGN
</SectionTitle>
    <Paragraph position="0"> ADJECTIVES. The propagation patterns involved are: Adv-mod-Adj (1), N-pcomp-Adj (2) and Vpcomp-Adj (3).</Paragraph>
    <Paragraph position="1"> (1) The white cedar exhibits a very common physical defect.</Paragraph>
    <Paragraph position="2"> Le Poirier-pays presente un defaut de forme tres frequent.</Paragraph>
    <Paragraph position="3"> (2) The area presently devoted to agriculture represents...</Paragraph>
    <Paragraph position="4"> La surface actuellement consacree a l'agriculture representerait...</Paragraph>
    <Paragraph position="5"> (3) Only four plots were liable to receive this input. Seulement quatre parcelles sont susceptibles de recevoir ces apports.</Paragraph>
    <Paragraph position="6"> DEP-TO-GOV PROPAGATION TO ALIGN NOUNS. Nouns are aligned according to the following propagation patterns: Adj-mod-N (1), N-mod-N/Npcomp-N (2), N-pcomp-N (3) and V-pcomp-N (4). (1) Allis shad remain on the continental shelf. La grande alose reste sur le plateau continental. (2) Nature of micropollutant carriers.</Paragraph>
    <Paragraph position="7"> La nature des transporteurs des micropolluants. (3) The bodies of shad are generally fusiform. Le corps des aloses est generalement fusiforme. (4) Ability to react to light.</Paragraph>
    <Paragraph position="8"> Capacite a reagir a la lumiere.</Paragraph>
  </Section>
  <Section position="10" start_page="129" end_page="129" type="metho">
    <SectionTitle>
UNAMBIGUOUS GOV-TO-DEP PROPAGATION TO
</SectionTitle>
    <Paragraph position="0"> ALIGN NOUNS. The propagation is not ambiguous when dependent nouns are not governed by a noun.</Paragraph>
    <Paragraph position="1"> This is the case when considering the following three propagation patterns: N-subj|obj-V (1), N- null pcomp-V (2) and N-pcomp-Adj (3).</Paragraph>
    <Paragraph position="2"> (1) The caterpillars can inoculate the fungus. Les chenilles peuvent inoculer le champignon. (2) The roots are placed in tanks.</Paragraph>
    <Paragraph position="3"> Les racines sont placees en bacs.</Paragraph>
    <Paragraph position="4"> (3) ...a fungus responsible for rot.</Paragraph>
    <Paragraph position="5"> ... un champignon responsable de la pourriture.</Paragraph>
  </Section>
  <Section position="11" start_page="129" end_page="130" type="metho">
    <SectionTitle>
POTENTIALLY AMBIGUOUS GOV-TO-DEP
PROPAGATION TO ALIGN NOUNS AND ADJECTIVES.
</SectionTitle>
    <Paragraph position="0"> Considering the potential ambiguity described in section 5.1, the algorithm which supports Gov-to-Dep propagation from head noun anchor words (n1, n2) takes into account three situations which are likely to occur.</Paragraph>
    <Paragraph position="1"> First, each of n1 and n2 has only one dependent, respectively dep1 and dep2, involving one of the mod or pcomp relation; dep1 and dep2 are aligned.</Paragraph>
    <Paragraph position="2"> the drained whey le lactoserum d'egouttage = (drained, egouttage) Second, n1 has one dependent dep1 and n2 several</Paragraph>
    <Paragraph position="4"> }, or vice versa. For each</Paragraph>
    <Paragraph position="6"> , check if one of the possible alignments has already been performed, either by propagation or anchor word spotting. If such an alignment exists, remove the others (dep1, dep2 k ) such that k [?] i, or vice versa. Otherwise, retain all the alignments</Paragraph>
    <Paragraph position="8"> ), or vice versa, without resolving the ambiguity.</Paragraph>
    <Paragraph position="9"> stimulant substances which are absent from... substances solubles stimulantes absentes de... (stimulant, {soluble, stimulant, absent}) already_aligned(stimulant, stimulant) = 1  ) such that k [?] i or l [?] j.</Paragraph>
    <Paragraph position="10"> Otherwise, retain all the alignments (dep1</Paragraph>
    <Paragraph position="12"> without resolving the ambiguity.</Paragraph>
    <Paragraph position="13"> unfair trading practices pratiques commerciales deloyales (unfair, {commercial, deloyal}) (trading, {commercial, deloyal}) already_aligned(unfair, deloyal) = 1  = (unfair, deloyal) = (trading, commercial) a big rectangular net, which is lowered... un vaste filet rectangulaire immerge... (big, {vaste, rectangulaire, immerge}) (rectangular, {vaste, rectangulaire, immerge}) already_aligned(rectangular, rectangulaire) = 1 = (rectangular, rectangulaire) = (big, {vaste, immerge}) The implemented propagation algorithm has two major advantages: it permits the resolution of some alignment ambiguities, taking advantage of alignments that have been previously performed. This algorithm also allows the system to cope with the problem of non-correspondence between English and French syntactic structures and makes it possible to align words using various syntactic relations in both languages, even though the category of the words under consideration is different.</Paragraph>
    <Section position="1" start_page="130" end_page="130" type="sub_section">
      <SectionTitle>
5.4 Comparative evaluation
</SectionTitle>
      <Paragraph position="0"> The results achieved using the syntax-based alignment (sba) are compared to those obtained with the baseline provided by the IBM models implemented in the giza++ package (Och &amp; Ney, 2000) (Table 2 and Table 3). More precisely, we used the intersection of IBM-4 Viterbi alignments for both translation directions. Table 2 shows the precision assessed against a reference set of 1000 alignments manually annotated in the INRA and the JOC corpus respectively. It can be observed that the syntax-based alignment offers good accuracy, similar to that of the baseline.</Paragraph>
      <Paragraph position="1">  More complete results (precision, recall and fmeasure) are presented in Table 3. They have been obtained using reference data from an evaluation of word alignment systems (Mihalcea &amp; Pederson, 2003). It should be noted that the figures concerning the syntax-based alignment were assessed in respect to the annotations that do not involve empty words, since up to now we focused only on content words. Whereas the baseline precision  for the HLT corpus is comparable to the one reported in Table 2, the syntax-based alignment score decreases. Moreover, the difference between the two approaches is considerable with regard to the recall. This may be due to the fact that our syntax-based alignment approach basically relies on isomorphic syntactic structures, i.e. in which the two following conditions are met: i) the relation under consideration is identical in both languages and ii) the words involved in the syntactic propagation have the same POS. Most of the cases of nonisomorphism, apart from the ones presented section 5.1, are not taken into account.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML