File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2040_metho.xml
Size: 9,561 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2040"> <Title>TotalRecall: A Bilingual Concordance for Computer Assisted Translation and Language Learning</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Aligning the corpus </SectionTitle> <Paragraph position="0"> Central to TotalRecall is a bilingual corpus and a set of programs that provide the bilingual analyses to yield a translation memory database out of the bilingual corpus. Currently, we are working with a collection of Chinese-English articles from the Sinorama magazine. A large bilingual collection of Studio Classroom English lessons will be provided in the near future. That would allow us to offer bilingual texts in both translation directions and with different levels of difficulty. Currently, the articles from Sinaroma seems to be quite usefully by its own, covering a wide range of topics, reflecting the personalities, places, and events in Taiwan for the past three decade.</Paragraph> <Paragraph position="1"> The concordance database is composed of bi-lingual sentence pairs, which are mutual translation. In addition, there are also tables to record additional information, including the source of each sentence pairs, metadata, and the information on phrase and word level alignment. With that additional information, TotalRecall provides various functions, including 1. viewing of the full text of the source with a simple click. 2. highlighted translation counterpart of the query word or phrase.</Paragraph> <Paragraph position="2"> 3. ranking that is pedagogically useful for translation and language learning.</Paragraph> <Paragraph position="3"> We are currently running an experimental prototype with Sinorama articles, dated mainly from 1995 to 2002. There are approximately 50,000 bi-lingual sentences and over 2 million words in total. We also plan to continuously updating the database with newer information from Sinorama magazine so that the concordance is kept current and relevant to the . To make these up to date and relevant.</Paragraph> <Paragraph position="4"> The bilingual texts that go into TotalRecall must be rearranged and structured. We describe the main steps below:</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Sentence Alignment </SectionTitle> <Paragraph position="0"> After parsing each article from files and put them into the database, we need to segment articles into sentences and align them into pairs of mutual translation. While the length-based approach (Church and Gale 1991) to sentence alignment produces surprisingly good results for the close language pair of French and English at success rates well over 96%, it does not fair as well for distant language pairs such as English and Chinese.</Paragraph> <Paragraph position="1"> Work on sentence alignment of English and Chinese texts (Wu 1994), indicates that the lengths of English and Chinese texts are not as highly correlated as in French-English task, leading to lower success rate (85-94%) for length-based aligners.</Paragraph> <Paragraph position="2"> Table 1 The result of Chinese collocation candidates extracted. The shaded collocation pairs are selected based on competition of whole phrase log likelihood ratio and word-based translation probability. Un-shaded items 7 and 8 are not selected because of conflict with previously chosen bilingual collocations, items 2 and 3.</Paragraph> <Paragraph position="3"> Simard, Foster, and Isabelle (1992) pointed out cognates in two close languages such as English and French can be used to measure the likelihood of mutual translation. However, for the English-Chinese pair, there are no orthographic, phonetic or semantic cognates readily recognizable by the computer. Therefore, the cognate-based approach is not applicable to the Chinese-English tasks.</Paragraph> <Paragraph position="4"> At first, we used the length-based method for sentence alignment. The average precision of aligned sentence pairs is about 95%. We are now switching to a new alignment method based on punctuation statistics. Although the average ratio of the punctuation counts in a text is low (less than 15%), punctuations provide valid additional evidence, helping to achieve high degree of alignment precision. It turns out that punctuations are telling evidences for sentence alignment, if we do more than hard matching of punctuations and take into consideration of intrinsic sequencing of punctuation in ordered comparison. Experiment results show that the punctuation-based approach outperforms the length-based approach with precision rates approaching 98%.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Phrase and Word Alignment </SectionTitle> <Paragraph position="0"> After sentences and their translation counterparts are identified, we proceeded to carry out finer-grained alignment on the phrase and word levels.</Paragraph> <Paragraph position="1"> We employ part of speech patterns and statistical from a parallel corpus. The preferred syntactic patterns are obtained from idioms and collocations in the machine readable English-Chinese version of Longman Dictionary of Contemporary of English.</Paragraph> <Paragraph position="2"> Phrases matching the patterns are extract from aligned sentences in a parallel corpus. Those phrases are subsequently matched up via cross linguistic statistical association. Statistical association between the whole phrase as well as words in phrases are used jointly to link a collocation and its counterpart collocation in the other language. See Table 1 for an example of extracting bilingual collocations. The word and phrase level information is kept in relational database for use in processing queries, hightlighting translation counterparts, and ranking citations. Sections 3 and 4 will give more details about that.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Queries </SectionTitle> <Paragraph position="0"> The goal of the TotalRecall System is to allow a user to look for instances of specific words or expressions. For this purpose, the system opens up two text boxes for the user to enter queries in any one of the languages involved or both. We offer some special expressions for users to specify the following queries: * Exact single word query - W. For instance, enter &quot;work&quot; to find citations that contain &quot;work,&quot; but not &quot;worked&quot;, &quot;working&quot;, &quot;works.&quot; * Exact single lemma query - W+. For instance, enter &quot;work+&quot; to find citations that contain &quot;work&quot;, &quot;worked&quot;, &quot;working&quot;, &quot;works.&quot; * Exact string query. For instance, enter &quot;in the work&quot; to find citations that contain the three words, &quot;in,&quot; &quot;the,&quot; &quot;work&quot; in a row, but not citations that contain the three words in any other way.</Paragraph> <Paragraph position="1"> * Conjunctive and disjunctive query. For instance, enter &quot;give+ advice+&quot; to find citations that contain &quot;give&quot; and &quot;advice.&quot; It is also possible to specify the distance between &quot;give&quot; and &quot;advice,&quot; so they are from a VO construction. Similarly, enter &quot;hard |difficult |tough&quot; to find citations that involve difficulty to do, understand or bear something, using any of the three words.</Paragraph> <Paragraph position="2"> Once a query is submitted, TotalRecall displays the results on Web pages. Each result appears as a pair of segments, usually one sentence each in English and Chinese, in side-by-side format. The words matching the query are highlighted, and a &quot;context&quot; hypertext link is included in each row. If this link is selected, a new page appears displaying the original document of the pair.</Paragraph> <Paragraph position="3"> If the user so wishes, she can scroll through the following or preceding pages of context in the original document.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Ranking </SectionTitle> <Paragraph position="0"> It is well known that the typical user usual has no patient to go beyond the first or second pages returned by a search engine. Therefore, ranking and putting the most useful information in the first one or two is of paramount importance for search engines. This is also true for a concordance.</Paragraph> <Paragraph position="1"> Experiments with a focus group indicate that the following ranking strategies are important: * Citations with a translation counterpart should be ranked first.</Paragraph> <Paragraph position="2"> * Citations with a frequent translation counterpart appear before ones with less frequent translation * Citations with same translation counterpart should be shown in clusters by default. The cluster can be called out entirely on demand.</Paragraph> <Paragraph position="3"> * Ranking by nonlinguistic features should also be provided, including date, sentence length, query position in citations, etc.</Paragraph> <Paragraph position="4"> With various ranking options available, the users can choose one that is most convenient and productive for the work at hand.</Paragraph> </Section> class="xml-element"></Paper>