File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1032_metho.xml

Size: 16,782 bytes

Last Modified: 2025-10-06 14:14:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="P95-1032">
  <Title>A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora</Title>
  <Section position="2" start_page="0" end_page="236" type="metho">
    <SectionTitle>
2 Algorithm overview
</SectionTitle>
    <Paragraph position="0"> We treat the bilingual lexicon compilation problem as a pattern matching problem - each word shares some common features with its counterpart in the translated text. We try to find the best representations of these features and the best ways to match them. We ran the algorithm on a small Chinese/English parallel corpus of approximately 5760 unique English words.</Paragraph>
    <Paragraph position="1"> The outline of the algorithm is as follows: 1. Tag the English half of the parallel text.</Paragraph>
    <Paragraph position="2"> In the first stage of the algorithm, only English words which are tagged as nouns or proper nouns are used to match words in the Chinese text.</Paragraph>
    <Paragraph position="3">  2. Compute the positional difference vector of each word. Each of these nouns or proper nouns is converted from their positions in the text into a vector.</Paragraph>
    <Paragraph position="4"> 3. Match pairs of positional difference vectors~ giving scores. All vectors from English and Chinese are matched against each other by Dynamic Time Warping (DTW).</Paragraph>
    <Paragraph position="5"> 4. Select a primary lexicon using the scores.  A threshold is applied to the DTW score of each pair, selecting the most correlated pairs as the first bilingual lexicon.</Paragraph>
    <Paragraph position="6"> 5. Find anchor points using the primary lexicon. The algorithm reconstructs the DTW paths of these positional vector pairs, giving us a set of word position points which are filtered to yield anchor points. These anchor points are used for compiling a secondary lexicon.</Paragraph>
    <Paragraph position="7">  6. Compute a position binary vector for  each word using the anchor points. The remaining nouns and proper nouns in English and all words in Chinese are represented in a non-linear segment binary vector form from their positions in the text.</Paragraph>
    <Paragraph position="8"> 7. Match binary vectors to yield a secondary lexicon. These vectors are matched against each other by mutual information. A confidence score is used to threshold these pairs. We obtain the secondary bilingual lexicon from this stage.</Paragraph>
    <Paragraph position="9"> In Section 3, we describe the first four stages in our algorithm, cumulating in a primary lexicon. Section 4 describes the next anchor point finding stage. Section 5 contains the procedure for compiling the secondary lexicon.</Paragraph>
  </Section>
  <Section position="3" start_page="236" end_page="238" type="metho">
    <SectionTitle>
3 Finding high frequency bilingual
</SectionTitle>
    <Paragraph position="0"> word pairs When the sentence alignments for the corpus are unknown, standard techniques for extracting bilingual lexicons cannot apply. To make matters worse, the corpus might contain chunks of texts which appear in one language but not in its translation 1, suggesting a discontinuous mapping between some parallel texts.</Paragraph>
    <Paragraph position="1"> We have previously shown that using a vector representation of the frequency and positional information of a high frequency word was an effective way to match it to its translation (Fung &amp; McKeown 1994). Dynamic Time Warping, a pattern recognition technique, was proposed as a good way to match these 1This was found to be the case in the Japanese translation of the AWK manual (Church et al. 1993). The Japanese AWK was also found to contain different programming examples from the English version.</Paragraph>
    <Paragraph position="2"> vectors. In our new algorithm, we use a similar positional difference vector representation and DTW matching techniques. However, we improve on the matching efficiency by installing tagging and statistical filters. In addition, we not only obtain a score from the DTW matching between pairs of words, but we also reconstruct the DTW paths to get the points of the best paths as anchor points for use in later stages.</Paragraph>
    <Section position="1" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
3.1 Tagging to identify nouns
</SectionTitle>
      <Paragraph position="0"> Since the positional difference vector representation relies on the fact that words which are similar in meaning appear fairly consistently in a parallel text, this representation is best for nouns or proper nouns because these are the kind of words which have consistent translations over the entire text.</Paragraph>
      <Paragraph position="1"> As ultimately we will be interested in finding domain-specific terms, we can concentrate our effort on those words which are nouns or proper nouns first. For this purpose, we tagged the English part of the corpus by a modified POS tagger, and apply our algorithm to find the translations for words which are tagged as nouns, plural nouns or proper nouns only. This produced a more useful list of lexicon and again improved the speed of our program.</Paragraph>
    </Section>
    <Section position="2" start_page="236" end_page="236" type="sub_section">
      <SectionTitle>
3.2 Positional difference vectors
</SectionTitle>
      <Paragraph position="0"> According to our previous findings (Fung&amp; McKeown 1994), a word and its translated counterpart usually have some correspondence in their frequency and positions although this correspondence might not be linear. Given the position vector of a word p\[i\] where the values of this vector are the positions at which this word occurs in the corpus, one can compute a positional difference vector V\[i- 1\] where Vii- 1\] = p\[i\]- p\[i- 1\]. dim(V) is the dimension of the vector which corresponds to the occurrence count of the word.</Paragraph>
      <Paragraph position="1"> For example, if positional difference vectors for the word Governor and its translation in Chinese .~ are plotted against their positions in the text, they give characteristic signals such as shown in Figure 1. The two vectors have different dimensions because they occur with different frequencies. Note that the two signals are shifted and warped versions of each other with some minor noise.</Paragraph>
    </Section>
    <Section position="3" start_page="236" end_page="237" type="sub_section">
      <SectionTitle>
3.3 Matching positional difference vectors
</SectionTitle>
      <Paragraph position="0"> The positional vectors have different lengths which complicates the matching process. Dynamic Time Warping was found to be a good way to match word vectors of shifted or warped forms (Fung &amp; McKeown 1994). However, our previous algorithm only used the DTW score for finding the most correlated word pairs. Our new algorithm takes it one step further by backtracking to reconstruct the DTW paths and then automatically choosing the best points on these DTW paths as anchor points.</Paragraph>
      <Paragraph position="1">  For a given pair of vectors V1, V2, we attempt to discover which point in V1 corresponds to which point in V2 . If the two were not scaled, then position i in V1 would correspond to position j in V2 where j/i is a constant. If we plot V1 against V2, we can get a diagonal line with slope j/i. If they occurred the same number of times, then every position i in V1 would correspond to one and only one position j in V2. For non-identical vectors, DTW traces the correspondences between all points in V1 and V2 (with no penalty for deletions or insertions). Our DTW algorithm with path reconstruction is as follows:</Paragraph>
      <Paragraph position="3"> In our algorithm, we reconstruct the DTW path and obtain the points on the path for later use.</Paragraph>
      <Paragraph position="4"> The DTW path for Governor/~d~,~ is as shown in Figure 2.</Paragraph>
      <Paragraph position="5"> optimal path - (i, il,i2,... ,im-2,j) where in = ~n+l(in+l), n -- N- 1,N- 2,... ,1 with iN = j We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.</Paragraph>
    </Section>
    <Section position="4" start_page="237" end_page="238" type="sub_section">
      <SectionTitle>
3.4 Statistical filters
</SectionTitle>
      <Paragraph position="0"> If we have to exhaustively match all nouns and proper nouns against all Chinese words, the matching will be very expensive since it involves computing all possible paths between two vectors, and then backtracking to find the optimal path, and doing this for all English/Chinese word pairs in the texts. The complexity of DTW is @(NM) and the complexity of the matching is O(IJNM) where I is the number of nouns and proper nouns in the English text, J is the number of unique words in the Chinese text, N is the occurrence count of one English word and M the occurrence count of one Chinese word.</Paragraph>
      <Paragraph position="1"> We previously used some frequency difference constraints and starting point constraints (Fung &amp; McKeown 1994). Those constraints limited the  number of the pairs of vectors to be compared by DTW. For example, low frequency words are not considered since their positional difference vectors would not contain much information. We also apply these constraints in our experiments. However, there is still many pairs of words left to be compared. To improve the computation speed, we constrain the vector pairs further by looking at the Euclidean distance g of their means and standard deviations:</Paragraph>
      <Paragraph position="3"> If their Euclidean distance is higher than a certain threshold, we filter the pair out and do not use DTW matching on them. This process eliminated most word pairs. Note that this Euclidean distance function helps to filter out word pairs which are very different from each other, but it is not discriminative enough to pick out the best translation of a word.</Paragraph>
      <Paragraph position="4"> So for word pairs whose Euclidean distance is below the threshold, we still need to use DTW matching to find the best translation. However, this Euclidean distance filtering greatly improved the speed of this stage of bilingual lexicon compilation.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="238" end_page="240" type="metho">
    <SectionTitle>
4 Finding anchor points and
</SectionTitle>
    <Paragraph position="0"> eliminating noise Since the primary lexicon after thresholding is relatively small, we would like to compute a secondary lexicon including some words which were not found by DTW. At stage 5 of our algorithm, we try to find anchor points on the DTW paths which divide the texts into multiple aligned segments for compiling the secondary lexicon. We believe these anchor points are more reliable than those obtained by tracing all the words in the texts.</Paragraph>
    <Paragraph position="1"> For every word pair from this lexicon, we had obtained a DTW score and a DTW path. If we plot the points on the DTW paths of all word pairs from the lexicon, we get a graph as in the left hand side of Figure 3. Each point (i, j) on this graph is on the DTW path(vl, v2) where vl is from English words in the lexicon and v2 is from the Chinese words in the lexicon. The union effect of all these DTW paths shows a salient line approximating the diagonal. This line can be thought of the text alignment path. Its departure from the diagonal illustrates that the texts of this corpus are not identical nor linearly aligned.</Paragraph>
    <Paragraph position="2"> Since the lexicon we computed was not perfect, we get some noise in this graph. Previous alignment methods we used such as Church (1993); Fung &amp; Church (1994); Fung &amp; McKeown (1994) would bin the anchor points into continuous blocks for a rough alignment. This would have a smoothing effect. However, we later found that these blocks of anchor points are not precise enough for our Chinese/English corpus. We found that it is more advantageous to increase the overall reliability of anchor points by keeping the highly reliable points and discarding the rest.</Paragraph>
    <Paragraph position="3"> From all the points on the union of the DTW paths, we filter out the points by the following conditions: If the point (i, j) satisfies</Paragraph>
    <Paragraph position="5"> then the point (i, j) is noise and is discarded.</Paragraph>
    <Paragraph position="6"> After filtering, we get points such as shown in the right hand side of Figure 3. There are 388 highly reliable anchor points. They divide the texts into 388 segments. The total length of the texts is around 100000, so each segment has an average window size of 257 words which is considerably longer than a sentence length; thus this is a much rougher alignment than sentence alignment, but nonetheless we still get a bilingual lexicon out of it.</Paragraph>
    <Paragraph position="7">  The constants in the above conditions are chosen roughly in proportion to the corpus size so that the filtered picture looks close to a clean, diagonal line. This ensures that our development stage is still unsupervised. We would like to emphasize that if they were chosen by looking at the lexicon output as would be in a supervised training scenario, then one should evaluate the output on an independent test corpus.</Paragraph>
    <Paragraph position="8"> Note that if one chunk of noisy data appeared in text1 but not in text2, this part would be segmented between two anchor points (i, j) and (u, v). We know point i is matched to point j, and point u to point v, the texts between these two points are matched but we do not make any assumption about how this segment of texts are matched. In the extreme case where i -- u, we know that the text between j and v is noise. We have at this point a segment-aligned parallel corpus with noise elimination.</Paragraph>
    <Paragraph position="9"> 5 Finding low frequency bilingual word pairs Many nouns and proper nouns were not translated in the previous stages of our algorithm. They were not in the first lexicon because their frequencies were too low to be well represented by positional difference vectors.</Paragraph>
    <Section position="1" start_page="239" end_page="239" type="sub_section">
      <SectionTitle>
5.1 Non-linear segment binary vectors
</SectionTitle>
      <Paragraph position="0"> In stage 6, we represent the positional and frequency information of low frequency words by a binary vector for fast matching.</Paragraph>
      <Paragraph position="1"> The 388 anchor points (95,10), (139,131),..., (98809, 93251) divide the two texts into 388 non-linear segments. Textl is segmented by the points (95,139,..., 98586, 98809) and text2 is segmented by the points (10,131,..., 90957, 93251).</Paragraph>
      <Paragraph position="2"> For the nouns we are interested in finding the translations for, we again look at the position vectors. For example, the word prosperity occurred seven times in the English text. Its position vector is (2178, 5322,... ,86521,95341) . We convert this position vector into a binary vector V1 of 388 dimensions where VI\[i\] = 1 if prosperity occured within the ith segment, VI\[i\] -0 otherwise. For prosperity, VI\[i\] -- 1 where i = 20, 27, 41, 47,193,321,360. The Chinese translation for prosperity is ~!. Its position vector is (1955,5050,... ,88048). Its binary vector is V2\[i\] = 1 where i = 14, 29, 41, 47,193,275,321,360.</Paragraph>
      <Paragraph position="3"> We can see that these two vectors share five segments in common.</Paragraph>
      <Paragraph position="4"> We compute the segment vector for all English nouns and proper nouns not found in the first lexicon and whose frequency is above two. Words occurring only once are extremely hard to translate although our algorithm was able to find some pairs which occurred only once.</Paragraph>
    </Section>
    <Section position="2" start_page="239" end_page="240" type="sub_section">
      <SectionTitle>
5.2 &amp;quot;Binary vector correlation measure
</SectionTitle>
      <Paragraph position="0"> To match these binary vectors V1 with their counterparts in Chinese V2, we use a mutual information score m.</Paragraph>
      <Paragraph position="2"> If prosperity and ~ occurred in the same eight segments, their mutual information score would be 5.6. If they never occur in the same segments, their m would be negative infinity. Here, for prosperity/~ ~, m = 5.077 which shows that these two words are indeed highly correlated.</Paragraph>
      <Paragraph position="3"> The t-score was used as a confidence measure. We keep pairs of words if their t &gt; 1.65 where</Paragraph>
      <Paragraph position="5"> For prosperity/~.~\]~, t = 2.33 which shows that their correlation is reliable.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML