XML Viewer - p06-2095

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2095_metho.xml
Size: 10,017 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2095">
  <Title>Using comparable corpora to solve problems difficult for human translators</Title>
  <Section position="4" start_page="739" end_page="741" type="metho">
    <SectionTitle>
2 Finding translations in comparable
</SectionTitle>
    <Paragraph position="0"> corpora The proposed model finds potential translation equivalents in four steps, which include  1. expansion of words in the original expression using related words; 2. translation of the resultant set using existing bilingual dictionaries; 3. further expansion of the set using related words in the target language; 4. filtering of the set according to expressions  frequent in the target language corpus. In this study we use several comparable corpora for English and Russian, including large reference corpora (the BNC and the Russian Reference Corpus) and corpora of major British and Russian newspapers. All corpora used in the study are quite large, i.e. the size of each corpus is in the range of 100-200 million words (MW), so that they provide enough evidence to detect such collocations as strong voice and clear defiance. Although the current study is restricted to the English-Russian pair, the methodology does not rely on any particular language. It can be extended to other languages for which large comparable corpora, POS-tagging and lemmatisation tools, and bilingual dictionaries are available. For example, we conducted a small study for translation between English and German using the Oxford German Dictionary and a 200 MW German corpus derived from the Internet (Sharoff, 2006).</Paragraph>
    <Section position="1" start_page="739" end_page="740" type="sub_section">
      <SectionTitle>
2.1 Query expansion
</SectionTitle>
      <Paragraph position="0"> The problem with using comparable corpora to find translation equivalents is that there is no obvious bridge between the two languages. Unlike aligned parallel corpora, comparable corpora provide a model for each individual language, while dictionaries, which can serve as a bridge, are inadequate for the task in question, because the problem we want to address involves precisely translation equivalents that are not listed there.</Paragraph>
      <Paragraph position="1"> Therefore, a specific query needs first to be generalised in order to then retrieve a suitable candidate from a set of candidates. One way to generalise the query is by using similarity classes, i.e. groups of words with lexically similar behaviour. In his work on distributional similarity (Lin, 1998) designed a parser to identify grammatical relationships between words. However, broad-coverage parsers suitable for processing BNC-like corpora are not available for many languages. Another, resource-light approach treats the context as a bag of words (BoW) and detects the similarity of contexts on the basis of collocations in a window of a certain size, typically 3-4 words, e.g. (Rapp, 2004). Even if using a parser can increase precision in identification of contexts in the case of long-distance dependencies (e.g. to cook Alice a whole meal), we can find a reasonable set of relevant terms returned using the BoW approach, cf. the results of human evaluation for English and German by (Rapp, 2004).</Paragraph>
      <Paragraph position="2">  For each source word s0 we produce a list of similar words: Th(s0) = s1,...,sN (in our tool we use N = 20 as the cutoff). Since lists of distributionally words can contain words irrelevant to the source word, we filter them to produce a more reliable similarity class S(s0) using the assumption that the similarity classes of similar words have common members: [?]w [?] S(s0),w [?] Th(s0)&amp;w [?]uniontextTh(si) This yields for experience the following similarity class: knowledge, opportunity, life, encounter, skill, feeling, reality, sensation, dream, vision, learning, perception, learn.1 Even if there is no requirement in the BoW approach that words in the similarity class are of the same part of speech, it happens quite frequently that most words have the same part of speech because of the similarity of contexts.</Paragraph>
    </Section>
    <Section position="2" start_page="740" end_page="740" type="sub_section">
      <SectionTitle>
2.2 Query translation and further expansion
</SectionTitle>
      <Paragraph position="0"> In the next step we produce a translation class by translating all words from the similarity class into the target language using a bilingual dictionary (T(w) for the translation of w). Then for Step 3 we have two options: a full translation class (TF) and a reduced one (TR).</Paragraph>
      <Paragraph position="1"> TF consists of similarity classes produced for all translations: S(T(S(s0))). However, this causes a combinatorial explosion. If a similarity class contains N words (the average figure is 16) and a dictionary lists on average M equivalents for a source word (the average figure is 11), this procedure outputs on average M x N2 words in the full translation class. For instance, the complete translation class for experience contains 998 words. What is worse, some words from the full translation class do not refer to the domain implied in the original expression because of the ambiguity of the translation operation. For instance, the word dream belongs to the similarity class of experience. Since it can be translated into Russian as a241a234a224a231a234a224 ('fairy-tale'), the latter Russian word will be expanded in the full translation class with words referring to legends and stories. In the later stages of the project, word sense disambiguation in corpora could improve precision of translation classes. However at the present stage we attempt to trade the recall of the tool for greater precision by translating words in the source similarity class, 1Ordered according to the score produced by the Singular Value Decomposition method as implemented by Rapp.</Paragraph>
      <Paragraph position="2"> and generating the similarity classes of translations only for the source word:</Paragraph>
      <Paragraph position="4"> This reduces the class of experience to 128 words.</Paragraph>
      <Paragraph position="5"> This step crucially relies on a wide-coverage machine readable dictionary. The bilingual dictionary resources we use are derived from the source file for the Oxford Russian Dictionary, provided by OUP.</Paragraph>
    </Section>
    <Section position="3" start_page="740" end_page="741" type="sub_section">
      <SectionTitle>
2.3 Filtering equivalence classes
</SectionTitle>
      <Paragraph position="0"> In the final step we check all possible combinations of words from the translation classes for their frequency in target language corpora.</Paragraph>
      <Paragraph position="1"> The number of elements in the set of theoretically possible combinations is usually very large:producttext Ti, where Ti is the number of words in the translation class of each word of the original MWE.</Paragraph>
      <Paragraph position="2"> This number is much larger than the set of word combinations which is found in the target language corpora. For instance, daunting experience has 202,594 combinations for the full translation class of daunting experience and 6,144 for the reduced one. However, in the target language corpora we can find only 2,256 collocations with frequency &gt; 2 for the full translation class and 92 for the reduced one.</Paragraph>
      <Paragraph position="3"> Each theoretically possible combination is generated and looked up in a database of MWEs (which is much faster than querying corpora for frequencies of potential collocations). The MWE database was pre-compiled from corpora using a method of filtering, similar to part-of-speech filtering suggested in (Justeson and Katz, 1995): in corpora each N-gram of length 2, 3 and 4 tokens was checked against a set of filters.</Paragraph>
      <Paragraph position="4"> However, instead of pre-defined patterns for entire expressions our filtering method uses sets of negative constraints, which are usually applied to the edges of expressions. This change boosts recall of retrieved MWEs and allows us to use the same set of patterns for MWEs of different length.</Paragraph>
      <Paragraph position="5"> The filter uses constraints for both lexical and part-of-speech features, which makes configuration specifications more flexible.</Paragraph>
      <Paragraph position="6"> The idea of applying a negative feature filter rather than a set of positive patterns is based on the observation that it is easier to describe undesirable features than to enumerate complete lists of patterns. For example, MWEs of any length ending with a preposition are undesirable (particles in  phrasal verbs, which are desirable, are tagged differently by the Tree Tagger, so there is no problem with ambiguity here). Our filter captures this fact by having a negative condition for the right edge of the pattern (regular expression /_IN$/), rather than enumerating all possible configurations which do not contain a preposition at the end. In this sense the filter is permissive: everything that is not explicitly forbidden is allowed, which makes the description more economical.</Paragraph>
      <Paragraph position="7"> The same MWE database is used for checking frequencies of multiword collocates for corpus queries. For this task, candidate N-grams in the vicinity of searched patterns are filtered using the same regular expression grammar of MWE constraints, and then their corpus frequency is checked in the database. Thus scores for multiword collocates can be computed from contingency tables similarly to single-word collocates. In addition, only MWEs with a frequency higher than 1 are stored in the database. This filters out most expressions that co-occur by chance. Table 1 gives an overview of the number of MWEs from the news corpus which pass the filter. Other corpora used in ASSIST (BNC and RRC) yield similar results. MWE frequencies for each corpus can be checked individually or joined together.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML