XML Viewer - w96-0201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0201_metho.xml
Size: 25,945 bytes
Last Modified: 2025-10-06 14:14:24
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0201">
  <Title>A Geometric Approach to Mapping Bitext Correspondence</Title>
  <Section position="5" start_page="0" end_page="109000" type="metho">
    <SectionTitle>
3. SIMR
</SectionTitle>
    <Paragraph position="0"> Most of SIMR's effort is spent searching for TPCs, one short chain at a time. The search for each chain begins in a small rectangular region of the bitext space, whose dimensions are proportional to those of the whole bitext space. Within this search 1Since distances in the bitext space are measured in characters, the position of a token is defined to be the mean position of its characters.</Paragraph>
    <Paragraph position="1"> rectangle, the search alternates between a generation phase and a recognition phase, which are described in more detail in Sections 3.1 and 3.2.</Paragraph>
    <Paragraph position="2"> In the generation phase, SIMR generates all the points of correspondence that satisfy the supplied matching predicate (explained below). In the recognition phase, SIMR calls the chain recognition heuristic to search for suitable chains among the generated points. If no suitable chains are found, the search rectangle is proportionally expanded up and to the right and the generationrecognition cycle is repeated. The rectangle keeps expanding until at least one acceptable chain is found. If more than one chain is found, SIMR accepts the chain whose points are least dispersed around its least-squares line. Then, SIMR selects another region of the bitext space to search for the next chain.</Paragraph>
    <Paragraph position="3"> SIMR employs a simple heuristic to select regions of the bitext space to search. To a first approximation, TBMs are monotonically increasing functions. This means that if SIMR accepts a chain, it should look for others either above and to the right or below and to the left of the one it has just located. All SIMR needs is a place to start the trace, and a good place to start is at the beginning. The origin of the bitext space is always a TPC. So, the first search rectangle is anchored at the origin. Subsequent search rectangles are anchored at the top right corner of the previously found chain, as shown in Figure 2.</Paragraph>
    <Paragraph position="4">  strategy. The search reclangle is anchored at the top right corner of the previously found chain. Its diagonal remains parallel to the main diagonal.</Paragraph>
    <Paragraph position="5"> The expanding-rectangle search strategy makes SIMR robust in the face of TBM discontinuities. Figure 2 shows a segment of the TBM trace that contains a vertical gap (an omission in the text on the x-axis). As the search rectangle grows, it will eventually pick up the TBM's trail, even if  the discontinuity is quite large (Melamed 1996).</Paragraph>
    <Paragraph position="6"> Section 3.8 explains why SIMR will not be led astray by false points of correspondence.</Paragraph>
    <Section position="1" start_page="2" end_page="109000" type="sub_section">
      <SectionTitle>
3.1 Point Generation
</SectionTitle>
      <Paragraph position="0"> A matching predicate is a heuristic for guessing whether a given point in the bitext space is a TPC. I have considered only token-based matching predicates, which can only return TRUE for a point (x, y) if x is the position of a token e on the x-axis and y is the position of a token f on the yaxis. For each such point, the matching predicate must decide whether the e and f are likely to be mutual translations.</Paragraph>
      <Paragraph position="1"> Various knowledge sources can be brought to bear on the decision. The most universal knowledge source is a translation lexicon. Translation lexicons can be extracted from machine-readable bilingual dictionaries (MRBDs), in the rare cases where MRBDs are available. In other cases, they can be induced automatically using any of several existing methods (Dagan et al. 1993, Fung ~ Church 1991, Melamed 1995). Since the matching predicate does not require perfect accuracy, the induced lexicons need not be perfect.</Paragraph>
      <Paragraph position="2"> When a large translation lexicon is not available, a small hand-constructed translation lexicon for the key terms in a given bitext may suffice to produce a rough map for that bitext.</Paragraph>
      <Paragraph position="3"> If the languages involved have similar alphabets, then it may be possible to construct a matching predicate with very little effort, using the method of cognates. Cognates are words with a common etymology and a similar meaning in different languages. The etymological similarity is often reflected in the words' orthography and/or pronunciation. Languages that are closely related will often share a large number of cognates. For example, in the non-technical Canadian Hansards (parliamentary debate transcripts available in English and French), cognates can be found for roughly one quarter of all text tokens (Melamed 1995). A cognate-based matching predicate will generate more points for more similar language pairs, and for text genres where more word borrowing occurs, such as technical texts.</Paragraph>
      <Paragraph position="4"> For English and French, such a matching predicate can generate enough points in the bitext space to obviate the need for a translation lexicon.</Paragraph>
      <Paragraph position="5"> Phonetic cognates can be used to map between language pairs with dissimilar alphabets, even when the languages are not closely related.</Paragraph>
      <Paragraph position="6"> When language L1 borrows a word from language L2, the word is usually written in L1 similarly to the way it sounds in L2. Thus, French and Russian /p~rtmone/ are cognates, as are English /sIstom/and Japanese/~isutemu/. For many lan-</Paragraph>
      <Paragraph position="8"> space, the true points of correspondence trace the true bitext map parallel to the main diagonal.</Paragraph>
      <Paragraph position="9"> guages, it is not difficult to construct an approximate mapping from the orthography to its underlying phonological form. Given such a mapping for L1 and L2, it is possible to identify cognates despite incomparable orthographies.</Paragraph>
      <Paragraph position="10"> SIMR was tested on French and English with two different matching predicates. The first matching predicate relies on orthographic cognates and a stop-list of closed-class words for both languages. SIMR judges the cognateness of each token pair by their Longest Common Subsequence Ratio (LCSR). The LCSR of a token pair is the number of characters that appear in the same order in both tokens divided by the length of the longer token (Melamed 1995). The common characters need not be contiguous. The matching predicate considers a token pair cognates if their LCSR exceeds a certain threshold. The LCSR threshold was optimized together with SIMR's other parameters, as described in Section 3.7. The stop-list of closed-class words made the matching predicate more accurate, because closed-class words are unlikely to have cognates. On the contrary, they often produce spurious matches. Examples for French and English include a, an, on and par.</Paragraph>
      <Paragraph position="11"> The second matching predicate was just like the first, except that it also evaluated to TRUE whenever the input token pair appeared as an entry in a translation lexicon. The translation lexicon was automatically extracted from an MRBD (Cousin et al. 1991).</Paragraph>
    </Section>
    <Section position="2" start_page="109000" end_page="109000" type="sub_section">
      <SectionTitle>
3.2 Point Selection
</SectionTitle>
      <Paragraph position="0"> As illustrated in Figure 3, even short sequences of TPCs form characteristic patterns. In particular, TPCs have the following properties:  * Linearity: TPCs tend to line up straight. Sets of points with a roughly linear arrangement are called chains.</Paragraph>
      <Paragraph position="1"> * Constant Slope: The slope of a TPC chain is rarely much different from the bitext slope. * Injectivity: No two points in a chain of TPCs  can have the same x- or y-co-ordinates. SIMR exploits these properties to decide which chains in the scatterplot might be TPC chains. The chain recognition heuristic involves two threshold parameters: maximum point dispersal and maximum angle deviation. Each threshold is used to filter candidate chains. First, the linearity of each chain is judged by measuring the root mean squared distance of the chain's points from the chain's least-squares line. If this distance exceeds the maximum point dispersal threshold, the chain is rejected. Second, the angle of each chain's least-squares line is compared to the arctangent of the bitext slope. If the difference exceeds the maximum angle deviation threshold, the chain is rejected. Lastly, chains that lack the injectivity property are rejected.</Paragraph>
    </Section>
    <Section position="3" start_page="109000" end_page="109000" type="sub_section">
      <SectionTitle>
3.3 Reducing the Search Space
</SectionTitle>
      <Paragraph position="0"> In a region of the scatterplot containing n points, there are 2 n possible chains -- too many to search by brute force. The.properties of TPCs listed above provide two ways to constrain the search.</Paragraph>
      <Paragraph position="1"> The Linearity property leads to a constraint on the chain size. Chains of only a few points are unreliable, because they often line up straight by coincidence. Chains that are too big will span too long a segment of the TBM to be well approximated by a line. SIMR chooses a fixed chain size k, 6 &lt; k &lt; 9. Fixing the chain size at k reduces the number of candidate chains to k (n Fortypicalvaluesofnandk, ( n ) k can still reach into the millions. The Constant Slope prop-erty suggests another constraint: SIMR should consider only chains that are roughly parallel to the main diagonal. Two lines are parallel if the perpendicular displacement between them is constant. So, if we want to find chains that are roughly parallel to the main diagonal, we should look for chains whose points all have roughly the same displacement 2 from the main diagonal.</Paragraph>
      <Paragraph position="2"> Points with similar displacement can be grouped together by sorting, as illustrated in Figure 4.</Paragraph>
      <Paragraph position="3"> Then, chains that are most parallel to the main  subsequence 1 ~i mnail~J (points 1 thru 6) a~ subsequence 8 (points 5 thru 10) &amp;quot; &amp;quot;  bered according to their displacement from the main diagonal. The chain most parallel to the main diagonal is always one of the contiguous sub-sequences of this ordering. For a fixed chain size of 6, there are 13 - 6 + 1 = 8 contiguous subsequences in this region of 13 points. Of these 8, subsequence 5 is the best chain.</Paragraph>
      <Paragraph position="4"> diagonal will be contiguous subsequences of the sorted point sequence. In a region of the scatterplot containing n points, there will be only n-k+l such subsequences of length k. Sorting the points by their displacement is the most computationally expensive step in the recognition process.</Paragraph>
      <Paragraph position="5"> SIMR's chain recognition heuristic accepts non-monotonic chains. This is a desirable property, because even languages with similar syntax, like French and English, have well-known differences in word order. For example, English (adjective, noun) pairs usually correspond to French (noun, adjective) pairs. Such inversions result in chains that contain a pattern like points 5 and 9 in Figure 4. SIMR has no problem accepting the inverted points, unlike bitext mapping algorithms that try to minimize the distance between TPCs.</Paragraph>
      <Paragraph position="6"> To my knowledge, no other bitext mapping algorithm allows non-monotonic map segments.</Paragraph>
      <Paragraph position="7"> You may wonder how SIMR will fare with languages that are less closely related, which have even more word order variation. This is an open question, but there is reason to be optimistic. To accommodate language pairs with vastly different word order, it may suffice for SIMR to increase the maximum point dispersal threshold, relaxing the linearity constraint on TPC chains.</Paragraph>
      <Paragraph position="8">  correspondence that line up in rows and columns.</Paragraph>
    </Section>
    <Section position="4" start_page="109000" end_page="109000" type="sub_section">
      <SectionTitle>
3.4 Reducing Noise
</SectionTitle>
      <Paragraph position="0"> The Injectivity property also leads to a heuristic which reduces the number of candidate chains, although the chief aim of this heuristic is to increase the signal-to-noise ratio in the scatterplot.</Paragraph>
      <Paragraph position="1"> The heuristic was introduced after inspection of several scatterplots in bitext spaces revealed a recurring noise pattern. This noise pattern is illustrated in Figure 5. It consists of correspondence points that line up in rows or columns associated with frequent token types. Token types like the English article &amp;quot;a&amp;quot; can produce one or more correspondence points for almost every sentence in the opposite text. Since only one of these correspondence points can be correct, all but one of the points in each row and column are noise. It's difficult to measure exactly how much noise is generated by frequent tokens, and of course the proportion is different for every bitext. Visual inspection of some scatterplots indicated that frequent tokens are often responsible for the lion's share of the noise. Reducing this source of noise makes it much easier for SIMR to stay on track.</Paragraph>
      <Paragraph position="2"> Other bitext mapping algorithms mitigate this source of noise either by assigning lower weights to correspondence points associated with frequent token types (Church 1993) or by simply deleting frequent token types from the bitext (Dagan et al. 1993). However, a frequent token type can be rare in some parts of the text. In those parts, the token type can provide valuable clues to correspondence. On the other hand, many tokens of a relatively rare type can be concentrated in a short segment of the text, resulting in many false correspondence points. The varying concentration of identical tokens suggests that more localized  fence A were switched during translation, resulting in a non-monotonic segment. To interpolate injective bitext maps, non-monotonic segments must be encapsulated in Minimum Enclosing Rectangles (MERs). A unique bitext map can then be interpolated by using the lower left and upper right corners of the MER (map M2), instead of using the non-monotonic correspondence points (function M1).</Paragraph>
      <Paragraph position="3"> noise filters would be more effective. SIMR's localized search strategy provides the perfect vehicle for a localized noise filter.</Paragraph>
      <Paragraph position="4"> The filter is based on another threshold parameter, the maximum point ambiguity level (MaxPAL). For each point p = (x, y), let X be the number of points in column x within the search rectangle, and let Y be the number of points in row y within the search rectangle. Then, ambiguity level of p = X + Y - 2.</Paragraph>
      <Paragraph position="5"> Thus, if p is the only point in its row and column, its ambiguity level is zero. SIMR ignores points whose ambiguity level exceeds the MaxPAL threshold. What makes this a localized filter is that only points within the search rectangle count towards each other's ambiguity level. This means that the ambiguity level of a given point can increase as the search rectangle expands; the set of points that SIMR ignores can change dynamically.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="109000" end_page="109000" type="metho">
    <SectionTitle>
3.5 Interpolation
</SectionTitle>
    <Paragraph position="0"> A bitext map can be derived from a set of correspondence points by linear interpolation. The only complication is that linear interpolation is not well-defined for non-monotonic sets of points.</Paragraph>
    <Paragraph position="1"> It would be incorrect to simply connect the dots left to right, because the resulting function may not be one-to-one. To interpolate injeetive bitext maps, non-monotonic segments must be encapsulated in Minimum Enclosing Rectangles (MERs), as shown in Figure 6. A unique bitext map can be interpolated by using the lower left and upper right corners of the MER, instead of using the non-monotonic correspondence points.</Paragraph>
    <Section position="1" start_page="109000" end_page="109000" type="sub_section">
      <SectionTitle>
3.6 Enhancements
</SectionTitle>
      <Paragraph position="0"> There are many possible enhancements to the algorithm outlined above. The following subsections describe but two of the more interesting extensions in the current implementation.</Paragraph>
      <Paragraph position="1"> Large Non-monotonic Segments SIMR has no problem with small non-monotonic segments inside chains. However, the expanding rectangle search strategy can miss larger non-monotonic segments, which cannot fit inside one chain. If a more precise map is desired, these larger non-monotonic segments can be easily recovered during a second sweep through the bitext space*</Paragraph>
      <Paragraph position="3"> ing translation. If a more precise map is desired, these larger non-monotonic segments can be easily recovered during a second sweep through the bitext space. Any non-monotonic segment of the TBM will occupy the intersection of a vertical gap and a horizontal gap in the monotonic first-pass map.</Paragraph>
      <Paragraph position="4"> Non-monotonic TBM segments result in a characteristic map pattern, as a consequence of the injectivity of bitext maps. In Figure 7, the vertical range of segment j corresponds to a vertical gap in SIMR's first-pass map. The horizontal range of segment j corresponds to a horizontal gap in SIMR's first-pass map. Similarly, any non-monotonic segment of the TBM will occupy the intersection of a vertical gap and a horizontal gap in the monotonic first-pass map. Furthermore, switched segments are almost always adjacent and relatively short. Therefore, to recover non-monotonic segments of the TBM, SIMR needs only to search gap intersections that are close to  the first-pass map. There are usually very few such intersections that are also large enough to accommodate new chains, so the second-pass search requires only a small fraction of the computational effort of the first pass.</Paragraph>
      <Paragraph position="5"> Local Slope Variation To ensure that SIMR rejects spurious chains, the maximum angle deviation threshold must be set low. However, like any heuristic filter, this one will reject some perfectly valid candidates. The injectivity of bitext maps enables a method for recovering some of the rejected valid chains. Valid chains that are rejected by the angle deviation filter sometimes occur between two accepted chains, as shown in Figure 8.</Paragraph>
      <Paragraph position="6">  it has a highly deviant slope. Such chains can be recovered by re-searching regions between accepted chains. The slope of the local main diagonal can be quite different from the slope of the global main diagonal.</Paragraph>
      <Paragraph position="7"> slope of the TBM between the end of Chain C and the start of Chain D must be much closer to the slope of Chain X than to the slope of the main diagonal. Chain X should be accepted. When SIMR makes its second-pass search for non-monotonic segments, it also searches for sandwiched chains in any space between two accepted chains that is large enough to accommodate another chain. This subspace of the bitext space will have its own main diagonal. The slope of this local main diagonal can be quite different from the slope of the global main diagonal.</Paragraph>
      <Paragraph position="8"> Another source of local slope variation is &amp;quot;non-linguistic&amp;quot; text, such as white space or tables of numbers. Usually, such text is copied &amp;quot;as is&amp;quot; during translation, resulting in regions of bitext space where the slope of the TBM is exactly 1.</Paragraph>
      <Paragraph position="9"> The problem is that these regions can be large enough to severely skew the slope of the main diagonal. Thus, they can fool SIMR into searching the whole bitext space for TPC chains whose slope is close to 1, even though most of the bitext  map between &amp;quot;linguistic&amp;quot; parts of the bitext has a very different slope. Sometimes, the translation of non-linguistic text is completely erratic, especially where white space is concerned. Not surprisingly, SIMR cannot perform well on such text.</Paragraph>
      <Paragraph position="10"> It should not be difficult to recognize bitext sections that consist of &amp;quot;non-linguistic&amp;quot; text. Then, SIMR will be better able to follow the variations in the slope of the TBM. This extension to SIMR is next in line.</Paragraph>
    </Section>
    <Section position="2" start_page="109000" end_page="109000" type="sub_section">
      <SectionTitle>
3.7 Evaluation
</SectionTitle>
      <Paragraph position="0"> The standard method of evaluating bitext mapping algorithms is to compare their output to a hand-constructed reference set of TPCs. Michel Simard of CITI graciously provided me with several such reference sets for French-English bitexts, including the same &amp;quot;easy&amp;quot; and &amp;quot;hard&amp;quot; Hansard bitexts that have been used to evaluate other bitext mapping and alignment algorithms in the literature (Church 1993, Simard et al. 1992, Dagan et al. 1993). A non-Hansard reference set was used for SIMR's development. All of SIMR's parameters, namely the thresholds for maximum point dispersal, maximum angle deviation, maximum point ambiguity, and the LCSR used in the matching predicate, as well as the fixed chain size, were simultaneously optimized on this data set using simulated annealing (Vidal 1993). Different parameter settings considered by the optimization process resulted in different bitext maps for the development bitext. Each set of parameter values was scored according to the root mean squared error between the resulting bitext map and the reference set of TPCs. The best-scoring set of parameter values was used to evaluate SIMR.</Paragraph>
      <Paragraph position="1"> SIMR was evaluated on the &amp;quot;easy&amp;quot; and &amp;quot;hard&amp;quot; Hansard bitexts. Note that these bitexts are so named because one was easier than the other for the alignment algorithm that was first evaluated on them. There is no a priori reason to believe that one or the other will be easier for SIMR. Table 1 compares SIMR's error distribution on these bitexts with that of the previous front-runner, char._al:i.gn, as reported by Church  (1993). SIMR's RMS error is lower by more than a factor of 4. SIMR is also much more robust: it rarely errs by more than half the length of an average sentence. Such robustness has enabled at least one new commercial-quality application -automatic detection of omissions in translations (Melamed 1996). This task was impossible until now, because it cannot tolerate even a few wild errors, such as those produced by an independent implementation of char_al:i.gn (Simard 1995).</Paragraph>
      <Paragraph position="2"> Note that the error between a bitext map and each reference point can be defined as the horizontal distance, the vertical distance, or the distance perpendicular to the main diagonal. The latter distance will always be shortest, on average. Church (1993) did not specify which metric he used. Of the three possibilities, Table 1 conservatively reports the highest error estimates for SIMR. The lowest estimates for SIMR without the translation lexicon are an RMS error of 6.1 for the &amp;quot;easy&amp;quot; bitext and 5.4 for the &amp;quot;hard&amp;quot; bitext. With the translation lexicon, the lowest error estimates drop to 6.0 for the &amp;quot;easy&amp;quot; bitext and 4.6 for the &amp;quot;hard&amp;quot; bitext.</Paragraph>
    </Section>
    <Section position="3" start_page="109000" end_page="109000" type="sub_section">
      <SectionTitle>
3.8 Discussion
</SectionTitle>
      <Paragraph position="0"> One concern about greedy algorithms is that if they wander off track, they may not be able to find their way back. There is no guarantee that this will never happen with SIMR. However, there is evidence that it is extremely unlikely. First, SIMR can wander off the right track only if there is an alternative (wrong) track. The noise reduction heuristics mentioned in Section 3.5 ensure that very few points of correspondence can be generated away from the TBM trace. Those points that are generated are extremely unlikely to be sufficiently linear and to have the proper slope to fool the chain recognition heuristic. The fixed chain size parameter also plays a role. The longer the chain, the less probable it is that a set of false points of correspondence will take on a valid-looking arrangement.</Paragraph>
      <Paragraph position="1"> The development bitext used in the simulated annealing parameter optimization contained over 40000 words. During the optimization, SIMR occasionally veered off course when the fixed chain size was 5 or less. It rarely got lost with a fixed chain size of 6 and never with a fixed chain size of 7 or more. The optimal fixed chain size with respect to the RMS error metric was 9 when the translation lexicon was used, and 8 when it was not. The chances of 8 or 9 false points of correspondence satisfying the maximum point dispersal, maximum angle deviation, and maximum point ambiguity level thresholds are negligible.</Paragraph>
      <Paragraph position="2"> Finally, if SIMR does get lost, the resulting bitext map will contain telltale discontinuities. Such discontinuities can be automatically detected with high reliability (Melamed 1996). With this sanity check in place, manual verification should never be necessary.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML