File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2207_metho.xml

Size: 19,732 bytes

Last Modified: 2025-10-06 14:09:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2207">
  <Title>Identifying correspondences between words: an approach based on a bilingual syntactic analysis of French/English parallel corpora</Title>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Starting hypothesis
</SectionTitle>
    <Paragraph position="0"> We take as a starting point the hypothesis formulated by Debili and Zribi (1996) according to which &amp;quot;paradigmatic connections can help to determine syntagmatic relations, and conversely&amp;quot;  .</Paragraph>
    <Paragraph position="1"> More precisely, the idea is that one can make use of syntactic relations to validate or invalidate the existence of alignment links, on the one hand, and to create new ones, on the other hand. The reasoning is as follows : if there is a pair of anchor words, i.e. if two words w1</Paragraph>
    <Paragraph position="3"> (communaute) are aligned at the sentence level, and if there is a syntactic relation standing between w1</Paragraph>
    <Paragraph position="5"> (interdire) on the other hand, then the alignment link is propagated from the anchor pair (community, communaute) to the words (ban, interdire). We call this procedure &amp;quot;alignment by syntactic propagation&amp;quot;.</Paragraph>
    <Paragraph position="6">  Our translation of the French version &lt;&lt; les liaisons paradigmatiques peuvent aider a determiner les relations syntagmatiques, et inversement &gt;&gt;. The Community banned imports of ivory.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
SUBJECT
La Communaute a interdit l'importation d'ivoire.
SUBJECT
</SectionTitle>
    <Paragraph position="0"> In the rest of this article, we describe the overall design and implementation of the syntactic propagation process and the results of applying it to two parsed French/English parallel corpora:</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
INRA and JOC.
4 Corpus processing
</SectionTitle>
    <Paragraph position="0"> The alignment by syntactic propagation was tested on two different parallel corpora aligned at the sentence level: INRA and JOC. The first corpus was constituted at the National Institute for</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="4" type="metho">
    <SectionTitle>
Agricultural Research (INRA)
</SectionTitle>
    <Paragraph position="0"> to enrich a bilingual terminology database exploited by translators. It comprises about 300,000 words and mainly consists of research and popular-science papers, press releases.</Paragraph>
    <Paragraph position="1"> The JOC corpus was provided by the ARCADE project, a campaign devoted to the evaluation of parallel text alignment systems (Veronis and Langlais, 2000). It contains written questions on a wide variety of topics addressed by members of the European Parliament to the European Commission and corresponding answers published by the Official Journal of the European Community in nine official languages. A portion of about 400,000 words of the French and English parts were used in the framework of the ARCADE evaluation task.</Paragraph>
    <Paragraph position="2"> The corpus processing was carried out by a French/English parser: SYNTEX (Bourigault and Fabre, 2000; Frerot, Fabre and Bourigault, 2003). SYNTEX is a dependency parser whose input is a  We are grateful to A. Lacombe who allowed us to use this corpus for research purposes.</Paragraph>
    <Paragraph position="3"> POS tagged  corpus-meaning each word in the corpus is assigned a lemma and grammatical tag. The parser identifies syntactic dependencies in the sentences of a given corpus, for instance subjects, direct and indirect objects of verbs. Once all syntactic dependencies have been identified, a set of words and phrases is extracted out of the corpus. The association score is computed provided the number of overall occurrences of u1 and u2 is higher than 4 since statistical techniques have proved to be particularly efficient when aligning frequent units. Moreover, the alignments are filtered according to the j(u  ) value, provided the latter is higher than 0.2. Then, two tests, based on cognate recognition and mutual correspondence condition (Altenberg, 1999), are applied as to filter spurious associations out of the initial lexicon. Both versions of the parser-the French one and the English one-are being developed according to the same procedures and architecture. The parsing is performed independently in each language, yet the outputs are quite homogeneous since the syntactic dependencies are identified and represented in the same way in both languages. In this respect, the alignment method proposed is different from the ones developed by Wu (2000) as well as Lin and Cherry (2003): the former is based on synchronous parsing while the letter uses a dependency tree generated only in the source language.</Paragraph>
    <Paragraph position="4"> The identification of anchor pairs, consisting of words which are translation equivalents within aligned sentences, combines both the projection of the initial lexicon and the recognition of cognates for words which have not been taken into account in the lexicon. These pairs are used as the starting point of the propagation process.</Paragraph>
    <Paragraph position="5">  gives some characteristics of the two corpora as for the number of aligned sentences, the overall number of anchor pairs identified, the average number of anchor pairs per sentence pair as well as the precision rate  of the anchor pairs. It can be seen that a high number of anchor pairs has been identified per sentence for both corpora with a high accuracy.</Paragraph>
    <Paragraph position="6"> In addition to parsed French/English corpus aligned at the sentence level, the syntactic alignment requires pairs of anchor words be identified prior to propagation as to start the process. In this study, we chose to extract a lexicon out of the corpus, the anchor pairs being located both by projecting the lexicon at the level of aligned sentences and processing the identical and fuzzy cognates.</Paragraph>
  </Section>
  <Section position="9" start_page="4" end_page="4" type="metho">
    <SectionTitle>
INRA JOC
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="10" start_page="4" end_page="4" type="metho">
    <SectionTitle>
5 Identification of anchor pairs
</SectionTitle>
    <Paragraph position="0"> To derive a list of words which are likely to be used to initiate the syntactic propagation process out of the corpus, we implemented a widely used method described notably in (Gale and Church, 1991; Ahrenberg, Andersson and Merkel, 2000) which is based on the assumption that the words which appear frequently in aligned text segments are potential translation equivalents. For each source (English) and target (French) unit,  ) in aligned sentences in comparison with their overall occurrences in the corpus and then an association score is computed. In this study, we chose the Jaccard association score which is calculated as follows:</Paragraph>
  </Section>
  <Section position="11" start_page="4" end_page="4" type="metho">
    <SectionTitle>
6 Syntactic propagation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
6.1 Two types of propagation
</SectionTitle>
      <Paragraph position="0"> The syntactic propagation may be performed according to two different directions. Indeed, a given word is likely to be both governor and dependent with respect to other words. The first direction consists in starting with dependent anchor words and propagating the alignment link to the governors (DepGov propagation). The DepGov propagation is a priori not ambiguous since one dependent is governed at most by one word. Thus, there is just one syntactic relation on which the propagation can be based. The syntactic structures are said to be parallel in English and French provided the two following conditions are met: i) the relation under consideration is identical in both languages and ii) the words involved in the  The precision was evaluated manually</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
6.2 Alignment of verbs syntactic propagation have the same POS. The
</SectionTitle>
      <Paragraph position="0"> second direction goes the opposite way: starting with governor anchor words, the alignment link is propagated to the dependents (GovDep propagation). In this case, several relations which may be used to achieve the propagation are available, as it is possible for a governor to have more than one dependent, and so the propagation is potentially ambiguous. The ambiguity is particularly widespread when performing the GovDep propagation from head nouns to their nominal and adjectival dependents. Let us consider the example (1). There is one occurrence of the relation PREP in English and two in French. Thus, it is not possible to determine a priori whether to propagate using the relations NN/PREP2, on the one hand, and PREP1/PREP2', on the other hand, or NN/PREP2' and PREP1/PREP2. Moreover, even if there is just one occurrence of the same relation in each language, it does not mean that the propagation is of necessity performed through the same relation, as shown in example (2).</Paragraph>
      <Paragraph position="1"> Verbs are aligned according to eight propagation patterns, that is to say five for the DepGov propagation and three for the GovDep one.</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="4" end_page="4" type="metho">
    <SectionTitle>
DEPGOV PROPAGATION TO ALIGN GOVERNOR
</SectionTitle>
    <Paragraph position="0"> VERBS. Five propagation patterns are used to align verbs: Adv-MOD-V (1), N-SUJ-V (2), N-OBJ-V (3), N-PREP-V (4) and V-PREP-V (5).</Paragraph>
    <Paragraph position="1"> (1) The net is then hauled to the shore.</Paragraph>
    <Paragraph position="2"> Le filet est ensuite hale a terre.</Paragraph>
    <Paragraph position="3"> (2) The fish are generally caught when they migrate from their feeding areas.</Paragraph>
    <Paragraph position="4"> Generalement les poissons sont captures quand ils migrent de leur zone d'engraissement.</Paragraph>
    <Paragraph position="5"> (3) Most of the young shad reach the sea.</Paragraph>
    <Paragraph position="6"> La plupart des alosons gagne la mer.</Paragraph>
    <Paragraph position="7"> (4) The eggs are very small and fall to the bottom. Les oeufs de tres petite taille tombent sur le fond. (5) X is a model which was designated to stimulate...</Paragraph>
    <Paragraph position="8"> X est un modele qui a ete concu pour stimuler...</Paragraph>
  </Section>
  <Section position="13" start_page="4" end_page="4" type="metho">
    <SectionTitle>
GOVDEP PROPAGATION TO ALIGN DEPENDENT
</SectionTitle>
    <Paragraph position="0"> VERBS. The alignment links are propagated from the dependents to the verbs using three propagation patterns: V-PREP-V (1), V-PREP-N (2) and V- null In the following sections, we describe precisely the implementation of the two types of propagation defined above in order to align verbs (section 6.2), on the one hand, and nouns and adjectives, on the other hand (section 6.3). To this, we rely on different propagation patterns. Propagation patterns are given in the form CDep-REL-CGov, where CDep is the POS of the dependent, REL is the syntactic relation and CGov, the POS of the governor. The anchor element is underlined and the one aligned by propagation is bolded. For instance, the pattern N-SUJ-V corresponds to the propagation going from a noun anchor pair to the verbs through the subject relation.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
DepGov and GovDep propagation
6.3 Alignment of adjectives and nouns
</SectionTitle>
      <Paragraph position="0"> As for verbs, the two types of propagation described in section 6.1 are used to align adjectives and nouns. However, as far as these categories of words are concerned, they can't be treated in a fully independent way when propagating from head noun anchor words in order to align the dependents. Indeed, the syntactic structure of noun phrases may be different in English and French, since they rely on a different type of composition to produce compounds and on the same one to produce free noun phrases (Chuquet and Paillard, 1989). Then the potential ambiguity arising from the GovDep propagation from head nouns evoked in section 6.1 may be accompanied by variation phenomena affecting the category of the dependents, called transposition (Vinay and Darbelnet, 1958; Chuquet and Paillard, 1989). For instance, a noun may be rendered by an adjective, or vice versa: tax treatment profits is translated by traitement fiscal des benefices, so the noun tax is in correspondence with the adjective fiscal. The syntactic relations used to propagate the alignment links are thus different.</Paragraph>
      <Paragraph position="1"> In order to cope with the variation problem, the propagation is performed whether the syntactic relations are identical in both languages or not, and if they are not, whether the categories of the words to be aligned are the same or not. To sum up, adjectives and nouns are aligned separately of each other by means of DepGov propagation or GovDep propagation provided that the governor is not a noun. They are not treated separately when aligning by means of GovDep propagation from head noun anchor pairs.</Paragraph>
    </Section>
  </Section>
  <Section position="14" start_page="4" end_page="4" type="metho">
    <SectionTitle>
DEPGOV PROPAGATION TO ALIGN ADJECTIVES.
</SectionTitle>
    <Paragraph position="0"> The propagation patterns involved are: Adv-MOD-Adj (1), N-PREP-Adj (2) and V-PREP-Adj (3).</Paragraph>
    <Paragraph position="1">  (1) The white cedar exhibits a very common physical defect.</Paragraph>
    <Paragraph position="2"> Le Poirier-pays presente un defaut de forme tres frequent.</Paragraph>
    <Paragraph position="3"> (2) The area presently devoted to agriculture represents...</Paragraph>
    <Paragraph position="4"> La surface actuellement consacree a l'agriculture representerait...</Paragraph>
    <Paragraph position="5"> (3) Only fours plots were liable to receive this input.</Paragraph>
    <Paragraph position="6"> Seulement quatre parcelles sont susceptibles de recevoir ces apports.</Paragraph>
    <Paragraph position="7"> DEPGOV PROPAGATION TO ALIGN NOUNS. Nouns are aligned according to the following propagation patterns: Adj-ADJ-N (1), N-NN-N/N-PREP-N (2), N-PREP-N (3) and V-PREP-N (4).</Paragraph>
    <Paragraph position="8"> (1) Allis shad remain on the continental shelf. La grande alose reste sur le plateau continental. (2) Nature of micropolluant carriers.</Paragraph>
    <Paragraph position="9"> La nature des transporteurs des micropolluants. (3) The bodies of shad are generally fusiform. Le corps des aloses est generalement fusiforme. (4) Ability to react to light.</Paragraph>
    <Paragraph position="10"> Capacite a reagir a la lumiere.</Paragraph>
  </Section>
  <Section position="15" start_page="4" end_page="4" type="metho">
    <SectionTitle>
UNAMBIUOUS GOVDEP PROPAGATION TO ALIGN
</SectionTitle>
    <Paragraph position="0"> NOUNS. The propagation is not ambiguous when dependent nouns are not governed by a noun. This is the case when considering the following three propagation patterns: N-SUJ|OBJ-V (1), N-PREP-V  (2) and N-PREP-Adj (3).</Paragraph>
    <Paragraph position="1"> (1) The caterpillars can inoculate the fungus. Les chenilles peuvent inoculer le champignon. (2) The roots are placed in tanks.</Paragraph>
    <Paragraph position="2"> Les racines sont placees en bacs.</Paragraph>
    <Paragraph position="3"> (3) Botrysis, a fungus responsible for grey rot.  Botrysis, champignon responsable de la pourriture grise.</Paragraph>
  </Section>
  <Section position="16" start_page="4" end_page="4" type="metho">
    <SectionTitle>
POTENTIALLY AMBIGUOUS GOVDEP
PROPAGATION TO ALIGN NOUNS AND ADJECTIVES.
</SectionTitle>
    <Paragraph position="0"> As we already explained in section 6.1, the propagation is potentially ambiguous when starting with head noun anchor words and trying to align the noun(s) and/or adjective(s) they govern.</Paragraph>
    <Paragraph position="1"> Considering this potential ambiguity, the algorithm which supports GovDep propagation form head noun anchor words (n1, n2) takes into account three situations which are likely to occur :  1. if each of n1 and n2 have only one dependent, respectively reg1 and reg2, involving one of the following relations  }, or vice versa. For each reg2 i , check if one of the possible alignments has already been performed, either by propagation or anchor word spotting. If such an alignment exists, remove the others (reg1, reg2 k ) such as k [?] i, or vice versa. Otherwise, retain all the alignments (reg1, reg2 i ), or vice versa, without solving the ambiguity; stimulant substances which are absent from...</Paragraph>
    <Paragraph position="2"> substances solubles stimulantes absentes de...</Paragraph>
    <Paragraph position="3"> (stimulant, {soluble, stimulant, absent}) already_aligned(stimulant, stimulant) = 1 = stimulant, stimulant)  3. both n1 and n2 have several dependents,</Paragraph>
    <Paragraph position="5"> , check if one/several alignments have already been performed. If such alignments exist, remove all the alignments (reg1</Paragraph>
    <Paragraph position="7"> ) such as k [?] i or l [?] j. Otherwise, retain all the alignments</Paragraph>
    <Paragraph position="9"> ) without solving the ambiguity.</Paragraph>
    <Paragraph position="10"> unfair trading practices pratiques commerciales deloyales (unfair, {commercial, deloyal}) (trading, {commercial, deloyal}) already_aligned(unfair, deloyal) = 1 = (unfair, deloyal) = (trading, commercial) a big rectangular net, which is lowered... un vaste filet rectangulaire immerge... (big, {vaste, rectangulaire, immerge}) (rectangular, {vaste, rectangulaire, immerge}) already_aligned(rectangular, rectangulaire) = 1 = (rectangular, rectangulaire) = (big, {vaste, immerge}) The implemented propagation algorithm has two major advantages: it allows to solve some alignment ambiguities taking advantage of alignments which have been performed previously. This algorithm allows also to cope with the problem of non correspondence between English and French syntactic structures and makes it possible to align words using different syntactic relations in both languages, even though the category of the words under consideration is different.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
6.4 Overall results
</SectionTitle>
      <Paragraph position="0"> Table 5 gives a summary of the results obtained by applying all propagation patterns according to each corpus. It can be seen that the highest accuracy is achieved for the alignments corresponding to anchor pairs validated by the syntactic propagation (AP and PP): 99.7 and 99.8% precision, respectively for INRA and JOC.</Paragraph>
      <Paragraph position="1"> The rates tend to decrease - respectively 88.5 and 86.1% - as regards alignments established only by means of propagation, referred to as propagated pairs (PP), and is even lower - 76.3% - for the anchor pairs which have not been confirmed by the propagation (AP). Furthermore, the new alignments produced account for less than 20% of overall alignments to approximately 50% for the confirmed ones. Finally, since the method aims at aligning content words, the recall is assessed in relation to their overall occurrences in the corpora.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML