File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-3139_metho.xml

Size: 8,916 bytes

Last Modified: 2025-10-06 14:13:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-3139">
  <Title>A Statistical Approach to Machine Aided Translation of Terminology Banks</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. The Problem of Root Acquisition
</SectionTitle>
    <Paragraph position="0"> Suppose that we have a large amount of terms through a manual or automatic lexical acquisition process. In these terms, there is always certain degree of redundancy in the form of repeated occurrence of certain general or domain specific roots in different words (or words in noun-noun compounds). In order to take advantage of the redundancy and reduce the effort of translating these terms, there is the need for discovering the roots automatically. So given a set of terms, we are supposed to produce a list of roots that appear more than twice in the terminology bank. For example, given acidimeter acidity amide antibiotic antiblocking cyanoacrylate gloss glossmeter hydroxybonzylmoth hydrometer mildew mildewiclde polyacryl polyacrylamide polyacrylonitdle polyacrylsulfone acrylalkyd pacrylate polyacrylate polyamide polyol polytributyltinacrylate we are suppose to produce acryl, amide, amine, anti, block, cide, gloss, meter, mildew, hydro, el, poly After hand translation, we get</Paragraph>
    <Paragraph position="2"> Now we are in a position to translate the original terminology bank by the composition of the translated roo~:</Paragraph>
    <Paragraph position="4"/>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Root Acquisition
</SectionTitle>
    <Paragraph position="0"> A root can be anywhere between one and up to I 1 characters (such as phosphazene in pho~phazene, polyaryloxyphosphazene, and polyphosphazene). To carry out a statistical analysis on a letter by letter basis would mean searching for scarce roots (102-103) in a very large search space (1015). However, a root can be either from one to 3 syllables long and there are but about some 2,000 syllables. So if we analyze the data as syllables, the search space is drastically reduced (1010). So, we choose to hyphenate words in the terminology bank first and extract only roots that are made of I to 3 syllables.</Paragraph>
    <Paragraph position="1"> If we had in advance the appearing frequency of the syllables and roots in the terminology bank, we could simply use them to compute the most likely hyphenation or dissection. After the whole term banks are hyphenated and dissected, we can then not only produce the list of the most likely roots in the terminology bank, but also produce the frequency count of each syllable or root. However, in most cases, we do not have the frequency count of syllables and roots in the first place, a dilemma.</Paragraph>
    <Paragraph position="2"> Both hyphenation and root dissection are attacked using the EM algorithm (Dempster et at. 1977). In brief, the EM algorithm for the root dissection problem works like this: given some initial estimate of the root probability, any dissection of all the terms in the terminology bank into roots can be evaluated according to this set of initial root probability. We can compute tile most likely dissection of terms into roots using tile initial root probabilities. We then re-estimate the probability of any root according to this dissection.</Paragraph>
    <Paragraph position="3"> Repeated applications of the process lead to probability that assign ever greater probability to correct dissection of term into roots. This algorithm leads to a local but acceptable maximum.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1. Hyphenation
</SectionTitle>
      <Paragraph position="0"> Previous methods Ibr hyphenation are all based on rules about the nature of characters (consonant or vowel) and can only achieve about 90% hit rate (Knuth, 1985; Smith, 1989). The other 10% is done using an exception dictionary. These hyphenation algorithms are not feasible for our purpose because of the low rate and reliance on an exception dictionary. Therefore, we have developed a statistical approach to hyphenation. Tile idea is to collect frequency count of syllables in correctly hyphenated words. Then we use the frequency to estimate the likelihood of a syllable in  prob - probability of optimal dissection at a position prev - previOus dissecting position 1. Hyphenate Word into n syllables.</Paragraph>
      <Paragraph position="1"> Word = S 1 S 2 ... S n 2. prob\[0\] = 1; prey\[0\] = 0; 3. Fori:=ltondo4&amp;5 4. j&amp;quot; = max prob\[i-j\] x RootProb(Ri)</Paragraph>
      <Paragraph position="3"> 6. Compute Pos by tracing back prev links starting from prev\[n\].</Paragraph>
      <Paragraph position="4"> a possible hyphenation and choose the hyphenation that consists of a most likely sequence of syllables. The optimization process is done through a dynamic programming algorithm described in Algorithm 1.</Paragraph>
      <Paragraph position="5"> 3,2. Root Dissection  Chie can set the initial estimate of the probability of single-, bi-, and tri-syllablC/ roots as follows:</Paragraph>
      <Paragraph position="7"> ac. 1 act. 1 ad47 aer6 af7 ag59 age. 15air4 air. 2 a1141 al. 26 am 49 an 106 an. 8 ance. 7 and. 2  abi 3 abil 3 able.9 abra 6 absorb 2 absorp 9 accel 8 accep 3 ace 40 acene. 4 aci 7 acid. 177 acous 2 acri 3 acro 3</Paragraph>
      <Paragraph position="9"> for is a single-syllable root R = S = Bigram(S i $2), for a bi-syllable root R = SIS 2</Paragraph>
      <Paragraph position="11"> for a tri-syllable root R = S18283, The root dissection is done using Algorithm 2 which is similar to the hyphenation algorithm.</Paragraph>
      <Paragraph position="12"> ACTES DE COLING-92. NANTES, 23-28 AOIYf 1992 9 2 3 PROC. OF COLING-92, NANTES, AUO. 23-28, 1992 a 2 a. 2 able. 4 abrasion. 3abrasive. 3 absorption. 7 accelo 4 aceno. 2 acetal. 6 acetate. 2 acetic. 4 aceto 3 acety 5 acetyl. 2 acid. 177 acidi 22 acoustic. 2 acridine. 2 acryl 3 acryl. 2 acrylat 4 Figure 4. Roots extracted after the first iteration plastic 5 plasticisation plasticised plasticiser plasticity PO 6 antipode epichlopohyddn pored porosity poteniometric polari 4 polarisation polarity polarization poly 302 polyacetal polyacrolein polyacrylamide polyacrylate polymer 28 copolymedzation polymedsation copolymedsate polymer. 14 prepolymer terpolymer photopolymer biopolymer port. 4 export import support position, 5 composition decomposition pre 18 prechrome precipitate precipitated precipitation prene. 5 chloroprene polychloroprene pr/3 prileshajev primary pdmedess Figure 5. Roots extracted after the last iteration</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Experimental Results
</SectionTitle>
    <Paragraph position="0"> The experiment has been carried out on a personal computer running a C-H- compiler under DOS 5.0. The terminology bank consists of more than 4,000 lines of chemical terms compiled by a leading chemical company hi Germany for internal use. Each line consists of from 1 to 5 words and a word can be any where from 1 to 15 syllables long or 2 to 31 characters long.</Paragraph>
    <Paragraph position="1"> The initial syllable probabilities used in the hyphenation algorithm are the appearance counts of some 1,800 distinct syllables in a partially hyphenated data, which is the result of running Latex (Knuth 1986) on the terminology bank itself.</Paragraph>
    <Paragraph position="2"> The root dissection algorithm uses the syllable probability and bigram of syllables to start the EM algoritlun. Small segments of the bigram and root probabilities produced in the first iteration are shown in Figure 2 and Figure 3 respectively.</Paragraph>
    <Paragraph position="3"> To facilitate human translation, in the last iteration, we produce the exemplary words along side with the root found. A small segment is shown in Figure 5.</Paragraph>
    <Paragraph position="4"> Following the terminology of research in information retrieval, we can evaluate the performance of this root extraction method: precision = number of correct roots number of roots found number of correct roots recall = number of actual roots These two numbers can be calculated for all appearances of roots or for the set of distinct roots respectively. We have extracted 223 distinct and more frequently occurring root and 203 of them are valid roots. To analyze precision and recall for all occurrences, we have randomly sampled 100 terms, in which a domain expert identified 237 roots and our algorithm split into 195 valid roots in 226 proposed roots. Thus, counting all occurrences of root, the precision and recall rates are as follows: precision recall 86.3%=(195/226) 82.3%=(195/237) If distinct roots are counted, the precision and recall rates are as follows: precision recall 91.0%=(203/223) Not available</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML