File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2016_metho.xml

Size: 2,125 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2016">
  <Title>Cognates Can Improve Statistical Translation Models</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The method
</SectionTitle>
    <Paragraph position="0"> We experimented with three word similarity measures: Simard's condition, Dice's coefficient, and LCSR.</Paragraph>
    <Paragraph position="1"> Simard et al. (1992) proposed a simple condition for detecting probable cognates in French-English bitexts: two words are considered cognates if they are at least four characters long and their first four characters are identical. Dice's coefficient is defined as the ratio of the number of shared character bigrams to the total number of bigrams in both words. For example, colour and couleur share three bigrams (co, ou,andur), so their  B3 BCBMBJBD, as their longest common subsequence is &amp;quot;c-o-l-u-r&amp;quot;.</Paragraph>
    <Paragraph position="2"> In order to identify a set of likely cognates in a tokenized and sentence-aligned bitext, each aligned segment is split into words, and all possible word pairings are stored in a file. Numbers and punctuation are not considered, since we feel that they warrant a more specific approach. After sorting and removing duplicates, the file represents all possible one-to-one word alignments of the bitext. Also removed are the pairs that include English function words, and words shorter than the minimum length (usually set at four characters). For each word pair, a similarity measure is computed, and the file is again sorted, this time by the computed similarity value. If the measure returns a non-binary similarity value, true cognates are very frequent near the top of the list, and become less frequent towards the bottom. The set of likely cognates is obtained by selecting all pairs with similarity above a certain threshold. Typically, lowering the threshold increases recall while decreasing precision of the set. Finally, one or more copies of the resulting set of likely cognates are concatenated with the training set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML