File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1106_evalu.xml
Size: 7,331 bytes
Last Modified: 2025-10-06 13:59:08
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1106"> <Title>Lower and higher estimates of the number of &quot;true analogies&quot; between sentences contained in a large multilingual corpus</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4.2 Results </SectionTitle> <Paragraph position="0"> Out of a total of 2,384,202 English analogies on the level of form, 238,135 are common with Chinese.</Paragraph> <Paragraph position="1"> They involve 25,554 sentences. Consequently, 10% of the English analogies of form may be thought to be analogies of form and meaning, i.e., &quot;true analogies&quot;, when relying only on Chinese.</Paragraph> <Paragraph position="2"> Between English and Japanese the number of analogies in common is 336,287 (involving 24,674 sentences) which represents 14% of the English analogies. An example is given in Figure 2.</Paragraph> <Paragraph position="3"> Between Chinese and Japanese very similar figures are obtained, as the number of analogies in common between these two languages is 329,429 (involving 25,127 sentences).</Paragraph> <Paragraph position="4"> Taking the intersection of Chinese, English and Japanese leads to a figure of 68,164 &quot;true analogies&quot;, involving 13,602 different sentences.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> Although the number of analogies dropped from</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 million analogies of form in English, down to </SectionTitle> <Paragraph position="0"> less than 70,000 when intersecting with Chinese and Japanese, one cannot say that the obtained figure is small.</Paragraph> <Paragraph position="1"> The average number of &quot;true analogies&quot; per sentence over all the corpus is: 162,318 / 68,184 = 0.42. In other words, in this corpus, one sentence is involved in about half a &quot;true analogy&quot; in average, taking it for granted that the linguistic differences between Chinese, English and Japanese filter real oppositions in meaning out of the oppositions captured by analogies of form.</Paragraph> <Paragraph position="2"> The number of sentences involved in at least one analogy is 13,602, so that, more than one tenth of the sentences of the corpus are in an immediate analogical relation with other sentences of the corpus. Such a figure is not negligeable.</Paragraph> <Paragraph position="3"> Averaging those sentences involved in at least one analogy gives the figure of 162,318 / 13,602 = 11.93 &quot;true analogies&quot;, which indicates that, on average, there are ten different ways to obtain these sentences by analogy with other sentences.</Paragraph> <Paragraph position="4"> It is questionable whether those analogies that were lost in the successive intersections were really not analogies on the meaning level. In fact, the impression is that our experiment yielded a figure which is excessively low. An inspection by hand convinced us that almost all analogies which were discarded would have been considered by a human evaluator as &quot;true analogies&quot;. Figure 1 shows two such examples. The problem is that the corresponding translations in other languages did not make an analogy of form. Other ways of saying could have made valid analogies of form. Consequently, the low number of translation equivalents available in our corpus is responsible of the low number of &quot;true analogies&quot; found by this method.</Paragraph> <Paragraph position="5"> 5 A higher estimate: translation by enforcement of &quot;true analogies&quot;</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Method </SectionTitle> <Paragraph position="0"> The corpus we used is rather poor in translation equivalents, or paraphrases: an English sentence gets only 1.20 equivalent sentences on average when translated into Chinese, and only 1.52 into Japanese. If we would like to get a more accurate estimate of the number of &quot;true analogies&quot; in English, then our problem becomes that of increasing the number of possible translations of English sentences in Chinese and in Japanese, i.e., to increase the number of paraphrases in Chinese and Japanese.</Paragraph> <Paragraph position="1"> To address this problem, we adopted a view which is the opposite of our previous view. We decided to enforce &quot;true analogies&quot;: given an analogy of form in a first language we forced it, when possible, to be reflected by an analogy of form in the second language. This should yield an estimate of the number of analogies in common between two languages which, if not necessarily more accurate, will at least be a higher estimate.</Paragraph> <Paragraph position="3"> To do so, the formula mentioned in section 3.1 is used in production, i.e., D2 is generated from the three sentences A2, B2 and C2 when it is possible.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Results </SectionTitle> <Paragraph position="0"> Using the method described above, we automatically produced Chinese translations for those English sentences of the corpus which intervene in at least one analogy of form. This delivered an average of 51 different candidate sentences. As a whole, 48,351 sentences among 53,250 could be translated.</Paragraph> <Paragraph position="1"> By doing the same for Japanese, the average number of different sentences is higher: 174 for 47,702 translated sentences14. (For the reader to judge, Figure 3 shows examples of Japanese-to-English translations, rather than English-to-Japanese.) The obtained translations were added to the corpus so as to increase the number of paraphrases in Chinese and Japanese. Then all counts were redone, and the new figures are listed under the title &quot;Higher estimate&quot; in Table 2.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Discussion </SectionTitle> <Paragraph position="0"> The new figure of 1,507,380 analogies for 49,052 sentences involved should be compared with the previous figures for the lower estimate. It is much higher, but it seems closer to the impression one gets when screening the analogies: analogies of form which are not analogies of meaning are very rare.</Paragraph> <Paragraph position="1"> However, the sentences that were obtained by enforcing analogies and then included in the corpus, are not always valid sentences. Figure 3 shows some such examples.</Paragraph> <Paragraph position="2"> 14Here again, we suspect the cause of the difference to be the indifferent use of kanji and hiragana.</Paragraph> <Paragraph position="3"> Future works should thus consider the problem of filtering in some ways the translations obtained automatically using, for example, N-gram statistical models. After such a filtering, new counts should be performed again. However, the problem with such a filtering is that it may lose the morphological productivity of analogy.</Paragraph> </Section> </Section> class="xml-element"></Paper>