File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1065_metho.xml
Size: 16,468 bytes
Last Modified: 2025-10-06 14:07:50
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1065"> <Title>Measuring the Similarity between Compound Nouns in Difierent Languages Using Non-Parallel Corpora</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Extraction of Translations from Non-Parallel Corpora </SectionTitle> <Paragraph position="0"> In parallel corpora, positions and frequencies of translation equivalents are correlated; therefore, whenwetrytoflndtranslationequivalents from parallel corpora, this information provides valuable clues. On the other hand, in non-parallel corpora, positions and frequencies of words cannot be directly compared. Fung assumed that co-occurring words of translation equivalents are similar, and compared distributions of the co-occurring words to acquire Chinese-English translations from comparable corpora (Fung, 1997). This method generates co-occurringwordsvectorsfortargetwords,and judges the pair of words whose similarity is high to be translation equivalents. Rapp made German and English association word vectors and calculated the similarity of these vectors to flnd translations (Rapp, 1999). K.Tanaka and Iwasaki (1996) also assumed the resemblance between co-occurring words in a source language and those in a target language, and performed experiments to flnd irrelevant translations intentionally added to a dictionary.</Paragraph> <Paragraph position="1"> In fact, flnding translation equivalents from non-parallel corpora is a very di-cult problem, soitisnotpracticaltoacquireallkindsoftranslationsinthecorpora. Mosttechnicaltermsare composed of known words, and we must collect these words to translate them correctly because new terms can be inflnitely created by combining several words. We focus on translations of compound nouns here. First, we collect the translation candidates of a target compound, and then measure the similarity between them to choose an appropriate candidate.</Paragraph> <Paragraph position="2"> In many cases, translation pairs of compound nouns in difierent languages have corresponding component words, and these can be used as strong clues to flnding the translations (Tanaka and Matsuo, 1999). However, these clues are sometimes insu-cient for determining which is thebesttranslationforatargetcompoundnoun when two or more candidates exist. For example, a0a2a1a4a3a6a5 eigyo rieki, which means earnings beforeinterestandtaxes,canbepairedwithoperating proflts or business interest. Both pairs havecommoncomponents,andwecannotjudge which pairisbetterusingonlythisinformation.</Paragraph> <Paragraph position="3"> A reasonable way to discriminate their meanings and usages is to see the context in which the compound words appear. In the following example, we can judge operating proflts is a numerical value and business interest is a kind of group.</Paragraph> <Paragraph position="4"> + ... its fourth-quarter operating proflt will fall short of expectations ...</Paragraph> <Paragraph position="5"> + ... the powerful coalition of business interests is pumping money into advertisements ...</Paragraph> <Paragraph position="6"> Thus contextual information helps us discriminate words' categories. We use the distribution of co-occurring words to compare the context.</Paragraph> <Paragraph position="7"> This paper describes a method of measuring semanticsimilaritybetweencompoundnounsin difierent languages to acquire compound noun translations from non-parallel corpora. We choose Japanese and English as the language pairs. The English translation candidates of a Japanese compound cJ that are tested for similarity can be collected by the method proposed byT.TanakaandMatsuo(1999). Thesummary of the method is as follows except to measure the similarity in the third stage.</Paragraph> <Paragraph position="8"> 1. Collect English candidate translation equivalents CE from corpus by part-of-speech (POS) patterns.</Paragraph> <Paragraph position="9"> 2. Make translation candidates set TE by extracting the compounds whose component words are related to the components of cJ in Japanese from CE.</Paragraph> <Paragraph position="10"> 3. Select a suitable translation cE of cJ from TE by measuring the similarity between cJ and each element of TE.</Paragraph> <Paragraph position="11"> In the flrst stage, this method collects target candidates CE by extracting all units that are described by a set of POS templates. For example, candidate translations of Japanese compound nouns may be English noun{noun,adjective{noun,noun{of{noun,etc.</Paragraph> <Paragraph position="12"> T.TanakaandMatsuo(2001)reportedthat60% of Japanese compound nouns in a terminologicaldictionaryarenoun-nounornoun-su-xtype null and 55% of English are noun-noun or adjectivenoun. Next, it selects the compound nouns whose component words correspond to those of nouns of the original language cJ, and makes a set of translation candidates TE. The component words are connected by bilingual dictionariesandthesauri. Forexample, if cJ isa0a7a1a2a3 a5 eigyo rieki, the elements of T E are fbusiness interest, operating proflts, business gaing.</Paragraph> <Paragraph position="13"> Theoriginalmethodselectsthemostfrequent candidate as the best one, however, this can be improved by using contextual information.</Paragraph> <Paragraph position="14"> Therefore we introduce the last stage; the proposed method calculates the similarity between the original compound noun cJ and its translation candidates by comparing the contexts, and chooses the most plausible translation cE.</Paragraph> <Paragraph position="15"> In this paper, we describe the method of selecting the best translation using contextual information. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Similarity between two compounds </SectionTitle> <Paragraph position="0"> Even in non-parallel corpora, translation equivalents are often used in similar contexts. Figure 1 shows parts of flnancial newspaper articles whose contents are unrelated to each other.</Paragraph> <Paragraph position="1"> In the article, a8a10a9a12a11a14a13 kakaku-kyousou appears with a15a17a16 gekika \intensify&quot;, a0a6a1 eigyo \business&quot;a18 a3a19a5 rieki \proflt&quot;, a20a19a21 yosou \prospect&quot;, etc. Its translation price competition is used with similar or relevant words { brutal, business, profltless,etc.,althoughthearticle is not related to the Japanese one at all. We use the similarity of co-occurring words of targetcompoundsindifierentlanguagestomeasure the similarity of the compounds. Since co-occurring words in difierent languages cannot be directly compared, a bilingual dictionary is usedasabridgeacrossthecorpora. Someother co-occurringwordshavesimilarmeaningsorare related to the same concept { a3a4a5 \proflt&quot; and</Paragraph> <Paragraph position="3"> (In particular, price competition of overseas travel has become intense. The operating profits are likely to show a flve hundred million yen deflcit, although they were expected to show a surplus at flrst.) Price competition has become so brutal in a wide array of businesses { cellular phones, disk drives, personal computers { that some companies are stuck in a kind of profltless prosperity, selling loads of product at puny margins.</Paragraph> <Paragraph position="4"> margin, etc. These can be correlated by a thesaurus. null The words frequently co-occurring with a8 a9a17a11a12a13 are listed in Table 1. Its translation price competition has more co-occurring words related to these words than the irrelevant word price control. The more words can be related to each other in a pair of compounds, the more similar the meanings.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Context representation </SectionTitle> <Paragraph position="0"> Inordertodenotethefeatureofeachcompound noun, we use the context in the corpora. In this paper, context means information about words that co-occur with the compound noun in the same sentence.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Word co-occurrence </SectionTitle> <Paragraph position="0"> Basically, the co-occurring words of the target word are used to represent its features. Such co-occurring words are divided into two groups in terms of syntactic dependence, and are distinguished in comparing contexts.</Paragraph> <Paragraph position="1"> 1. Words that have syntactic dependence on the target word.</Paragraph> <Paragraph position="2"> (subject-predicate, predicate-object, modiflcation, etc.) + ... flerce price competition by exporters ...</Paragraph> <Paragraph position="4"> (part) + ... price competition was intensifying in this three months ...</Paragraph> <Paragraph position="5"> 2. Words that are syntactically independent of the target word.</Paragraph> <Paragraph position="6"> + ... intense price competition caused margins to shrink ...</Paragraph> <Paragraph position="7"> The words classifled into the flrst class represent the direct features of the target word: attribute, function, action, etc. We cannot distinguish the role using only POS since it varies { attributes are not always represented by adjectives nor actions by verbs (compare intense price competition withprice competition is intensifying this month.).</Paragraph> <Paragraph position="8"> On the other hand, the words in the second class have indirect relations, e.g., association, with the target word. This type of word has more variation in the strength of the relation, and includes noise, therefore, they are distinguished from the words in the flrst class. For simplicity of processing, words that have dependent relations are detected by word sequencetemplates, asshowninFigure2. Kilgarrifi and Tugwell collect pairs of words that have syntactic relations, e.g., subject-of, modiflermodiflee, etc., using flnite-state techniques (Kilgarrifi and Tugwell, 2001). The templates showninFigure2aresimplifledversionsforpatternmatching. Therefore, thetemplatescannot detect all the dependent words; however, they can retrieve frequent and dependent words that are relevant to a target compound.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Semantic co-occurrence </SectionTitle> <Paragraph position="0"> Since flnding the exact translations of co-occurring words from unrelated corpora is harder than from parallel corpora, we also compare the contexts at a more abstract level. In the example of \price competition&quot;, a a71a73a72 denwa \telephone&quot;correspondstoa fax interm of communications equipment, as well as its exact translation, \telephone&quot;.</Paragraph> <Paragraph position="1"> We employ semantic attributes from Nihongo Goi-Taikei { A Japanese Lexicon (Ikehara et al., 1997) to abstract words. Goi-Taikei originated from a Japanese analysis dictionary for the Japanese-English MT system ALT-J/E (Ikehara et al., 1991). This lexicon has about 2,700 semantic attributes in a hierarchical structure (maximum 12 level), and these attributes are attached to three hundred thousand Japanese words. In order to abstract English words, the bilingual dictionary for ALT-J/E was used. This dictionary has the same semantic attributes as Goi-Taikei for pairs of Japanese and English. We use 397 attributes in the upper 5 levels to ignore a slight difierence between lower nodes. If a word has two or more semantic attributes, an attribute for a word is selected as follows.</Paragraph> <Paragraph position="2"> 1. For each set of co-occurring words, sum up the frequency for all attributes that are attached to the words.</Paragraph> <Paragraph position="3"> 2. For each word, the most frequent attribute is chosen. As a result each word has a unique attribute.</Paragraph> <Paragraph position="4"> 3. Sum up the frequency for an attribute of each word.</Paragraph> <Paragraph position="5"> In the following example, each word has one or more semantic attributes at flrst. The number of words that have each attribute are counted: three for [374], and one for [494] and [437]. Astheattribute [374] appearsmorefrequently than [494] among all words in the corpus, [374] is selected for \bank&quot;.</Paragraph> <Paragraph position="6"> bank : [374: enterprise/company],[494: embankment] store : [374: enterprise/company] hotel : [437: lodging facilities],[374: enterprise/company]</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Context vector </SectionTitle> <Paragraph position="0"> A simple representation of context is a set of co-occurring words for a target word. As the strength of relevance between a target compound noun t and its co-occurring word r, the feature value of r, ,,w(t;r) is deflned by the log likelihood ratio (Dunning, 1993) 1 as follows.</Paragraph> <Paragraph position="2"> where f(t) and f(r) are frequencies of compound noun t and co-occurring word r, respectively. f(t;r) is the co-occurring frequency between t and r, and N is the total frequencies of all words in a corpus.</Paragraph> <Paragraph position="3"> The context of a target compound t can be represented by the following vector (context wordvector1, cw1), whoseelementsarethefeature values of t and its co-occurring words ri.</Paragraph> <Paragraph position="5"> to all vectors of the same language. Moreover, translation matrix T, described in K.Tanaka and Iwasaki (1996), can convert a vector to another vector whose elements are aligned in the sameorderasthatoftheotherlanguage(Tcw).</Paragraph> <Paragraph position="6"> The element tij of T denotes the conditional probability that a word ri in a source language is translated into another word rj in a target language.</Paragraph> <Paragraph position="7"> We discriminate between words that have syntactic dependence and those that do not because the strengths of relations are difierent as mentioned in Section 4.1. In order to intensify the value of dependent words, f(t;r) in equation(3)isreplacedwiththefollowing f0(t;r)using the weight w determined by the frequency of dependence.</Paragraph> <Paragraph position="9"/> <Paragraph position="11"> Here, fd(t;r) is the frequency of word r that has dependency on t. The constant is determined experimentally, and later evaluation is done with const = 2. We deflne a modifled vector (context word vector 2, cw2), which is a version of cw1.</Paragraph> <Paragraph position="12"> Similarly, another context vector is also deflned for semantic attributes to which co-occurring belong by using the following feature value,,a insteadof,,w (contextattributevector, ca). La inequation(8)isthesemanticattribute version of L in equation (2). f(t;r) and f(t) are replaced with f(a;r) and f(a), respectively, where a indicates an attribute of a word.</Paragraph> <Paragraph position="14"> a1a6a3 a5 and its translation operating proflt, and an irrelevant word, business interest. Each item corresponds to an element of context vector cw or ca, and the words in the same row are connected by a dictionary. The high ,,w words in the class of \independent words&quot; include words associated indirectly to a target word, e.g., a0</Paragraph> <Paragraph position="16"> Some of these words are valid clues for connecting the two contexts; however, others are not very important. On the other hand, words in the class of \dependent words&quot; are directly relatedtoatargetword,e.g., increase of operating proflts, estimate operating proflts. The variety of these words are limited in comparison to the \independent words&quot; class, whereas they often can be efiective clues.</Paragraph> <Paragraph position="17"> More co-occurring words of operating proflts that mark high ,,w are linked to those of a0 a1 a3 a5 ratherthanbusiness interest. Asforsemantic attributes, operating proflt shares more upper attributes with a0a2a1a6a3a4a5 than business interest. The similarity Sw(ts;tt) between compound nouns ts in the source language and tt in the target language is deflned by context word vectors and translation matrix T as follows.</Paragraph> <Paragraph position="19"> Similarly, the semantic attribute based similarity Sa(ts;tt) is deflned as follows.</Paragraph> <Paragraph position="21"/> </Section> </Section> class="xml-element"></Paper>