File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/p94-1051_metho.xml

Size: 4,365 bytes

Last Modified: 2025-10-06 14:13:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P94-1051">
  <Title>AUTOMATIC ALIGNMENT IN PARALLEL CORPORA</Title>
  <Section position="4" start_page="334" end_page="335" type="metho">
    <SectionTitle>
THE ALIGNMENT ALGORITHM_
</SectionTitle>
    <Paragraph position="0"> Content words, unlike functional ones, might be interpreted as the bearers that convey information by denoting the entities and their relationships in the world. The notion of spreading the semantic load supports the idea that every content word should be represented as the union of all the parts of speech we can assign to it \[Basili 92\]. The postulated assumption is that a connection between two units of text is established if, and only if, the semantic load in one unit approximates the semantic load of the other.</Paragraph>
    <Paragraph position="1"> Based on the fact that the principal requirement in any translation exercise is meaning preservation across the languages of the translation pair, we define the semantic load of a sentence as the patterns of tags of its content words. Content words are taken to be verbs, nouns, adjectives and adverbs. The complexity of transfer in translation imposes the consideration of the number of content tags which appear in a tag pattern. By considering the total number of content tags the morphological derivation procedures observed across languages, e.g. the transfer of a verb into a verb+deverbal noun pattern, are taken into account. Morphological ambiguity problems pertaining to content words are treated by constructing ambiguity classes (acs) leading to a generalised set of content tags. It is essential here to clarify that in this approach no disambiguation module is prerequisite. The time breakdown for morphological tagging, without a disambiguator device, is according to \[Cutting 92\] in the order of 1000 ~tseconds per token. Thus, tens of megabytes of text may then be tagged per hour and high coverage can be obtained without prohibitive effort.</Paragraph>
    <Paragraph position="2"> Having identified the semantic load of a sentence, Multiple Linear Regression is used to build a quantitative model relating the content tags of the source language (SL) sentence to the response, which is assumed to be the sum of the counts of the corresponding content tags in the target language (TL) sentence. The regression model is fit to a set of sample data which has been manually aligned at sentence level. Since we intuitively believe that a simple summation over the SL content tag counts would be a rather good estimator of the response, we decide that the use of a linear model would be a cost-effective solution.</Paragraph>
    <Paragraph position="3"> The linear dependency of y (the sum of the counts of the content tags in the TL sentence) upon x i (the counts of each content tag category and of each ambiguity class over the SL sentence) can be stated as : Y=bo+b 1 x 1 /b2x2+b3x3 +--.+bnxn~ (I) where the unknown parameters {bi} are the regression coefficients, and s is the error of estimation assumed to be normally distributed with zero mean and variance 02 .</Paragraph>
    <Paragraph position="4"> In order to deal with different taggers and alternative tagsets, other configurations of (1), merging acs appropriately, are also recommended. For example, if an acs accounts for unknown words, we can use the fact that most unknown words are nouns or proper nouns and merge this category with nouns. We can also merge acs that are represented with only a few distinct words in the training corpus. Moreover, the use of relatively few acs (associated with content words) reduces the number of parameters  to be estimated, affecting the size of the sample and the time required for training.</Paragraph>
    <Paragraph position="5"> The method of least squares is used to estimate the regression coefficients in (1).</Paragraph>
    <Paragraph position="6"> Having estimated the b i and 0 2, the probabilistic score assigned to the comparison of two sentences across languages is just the area under the N(0,o 2) p.d.f., specified by the estimation error. This probabilistic score is utilised in a Dynamic Programming (DP) framework similar to the one described in \[Gale 91\]. The DP algorithm is applied to aligned paragraphs and produces the optimum alignment of sentences within the paragraphs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML