File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/j99-1003_metho.xml

Size: 2,595 bytes

Last Modified: 2025-10-06 14:15:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="J99-1003">
  <Title>Bitext Maps and Alignment via Pattern Recognition</Title>
  <Section position="3" start_page="0" end_page="108" type="metho">
    <SectionTitle>
2. Bitext Geometry
</SectionTitle>
    <Paragraph position="0"> Each bitext defines a rectangular bitext space, as illustrated in Figure 1. The lower left corner of the rectangle is the origin of the bitext space and represents the two texts' beginnings. The upper right corner is the terminus and represents the texts' ends. The line between the origin and the terminus is the main diagonal. The slope of the main diagonal is the bitext slope.</Paragraph>
    <Paragraph position="1"> Each bitext space is spanned by a pair of axes. The lengths of the axes are the lengths of the two component texts. The axes of a bitext space are measured in characters, because text lengths measured in characters correlate better than text lengths measured in tokens (Gale and Church 1991a). This correlation is important for geometric bitext mapping heuristics, such as those described in Section 4.4. Although the axes are measured in characters, I will argue that word tokens are the optimum level of analysis for bitext mapping. By convention, each token is assigned the position of its median character.</Paragraph>
    <Paragraph position="2"> Each bitext space contains a number of true points of correspondence (TPCs), other than the origin and the terminus. TPCs exist both at the coordinates of matching  Melamed Bitext Maps and Alignment text units and at the coordinates of matching text unit boundaries. If a token at position p on the x-axis and a token at position q on the y-axis are translations of each other, then the coordinate (p, q) in the bitext space is a TPC. If a sentence on the x-axis ends at character r and the corresponding sentence on the y-axis ends at character s, then the coordinate (r + .5, s + .5) is a TPC. The .5 is added because it is the inter-sentence boundaries that correspond, rather than the last characters of the sentences. Similarly, TPCs arise from corresponding boundaries between paragraphs, chapters, list items, etc. Groups of TPCs with a roughly linear arrangement in the bitext space are called chains.</Paragraph>
    <Paragraph position="3"> Bitext maps are injective (1-to-1) partial functions in bitext spaces. A complete set of TPCs for a particular bitext is the true bitext map (TBM). The purpose of a bitext mapping algorithm is to produce bitext maps that are the best possible approximations of each bitext's TBM.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML