An empirical lnethod for identifying and translating technical 
terminology 
Sayori Shimohata 
Research & Development Group, 
Oki Electric Industry Co., Ltd. 
Crystal Tower 1-2-27 Shirolni, 
Chuo-ku, Osaka 540-6025 Japan 
shimohat a245 ~oki. co.j p 
Abstract 
This paper describes a. method for retrieving 
patterns of words a.nd expressions frequently 
used in a. specific dom a.in and building a. dictio- 
nary for ma.chine translatiou(MT). The method 
uses an untagged text corpus in retrieving word 
sequences a.nd simplified pa.rt-of-speech tern- 
plates in identifying their synta.ctic ca.tegories. 
The pa.per presents e×perimenta.l results for a.p- 
plying the words and expressions to a pattern- 
based ma.chine translation system. 
1 Introduction 
Th.ere has been a. continuous interest in corpus- 
based approa.ches which retrieve words and ex- 
pressions in connection with a specific domain 
(we call them technical terms herea.fter). They 
may correspond to syntactic phra.ses or compo- 
nents of syntactic relationships and ha.ve been 
found useful in various application area.s, in- 
cluding inibrmation e×tra.ction, text sumlna.- 
riza.tion, and ma.chine tra.nsla.tion. Am.ong oth- 
ers, a. knowledge of technica\] terminology is in- 
dispensa.ble for machine tra.nsla.tion beca.use us- 
age and mea.ning of technica.1 terms a.re often 
quite different from their literal interpreta.tion. 
One a.pproa.ch for identifying technical termi- 
nology is a. rule-ba.sed a.pproa.eh which learns 
l.oca.1 syntactic patterns from a training cor- 
pus. A variety of methods ha.ve been developed 
within this fra.mework, (Ra.msha.w, 1995) (Arga.- 
mon et al., 1999) (Ca.rdie and Pierce, 1.999) a.nd 
achieved good results for the considered ta.sk. 
Surprisingly, though, little work ha.s been d.e- 
voted to lea.rning local syntactic pa.tterns be- 
sides noun phrases. Another drawback of this 
a.pproach is tha.t it requires substa.ntiM training 
corpora, in many cases with pa.rt-of-speech tags. 
An. alternative approa.ch is a. statistical one 
which retrieves recurrent word sequences as 
co\]loca.tiolls (Sma.dja., 1993)(Ha.runo et a.1., 
1996)(Shimolla.ta et a.1., :1997). This a.pproach 
is robust and pra.ctical because it uses t)lain 
text corpora, without a.ny inibrmation depen- 
dent on a la.ngua.ge. Unlike the former N)- 
proa.ch, this a.pproach extra.cts va.rious types of 
local pa.tterns a.t the same time. Therefore, 
post-processing, such as part of speech ta.gging 
and syntactic category identifica.tion, is neces- 
sary when we a.pply them to NLP applica.tions. 
This pa.per presents a. method for identify- 
ing technicM terms froni a. corpus and a.pl)ly- 
ing them to a. ma.chine tra.nsla.tion system. The 
proposed method retrieves local pa.tterns by uti- 
lizing the n-gram statistics a.nd identifies their 
syntactic categories with. simple pa.rt-ofspeech 
teml)la.tes. We ma.ke 3. ma.chine trans\]a.tion dic- 
tiona.ry from the retrieved patterns and tra.ns- 
late documents in the Sa.lne doma.in a.s the orig- 
inal corpus. 
In the next section, we briefly describe a 
pa.ttern-based machine translation. The follow- 
ing section explains how th.e proposed method 
works in detail. We th.en present experimenta.l 
results a.nd conclude with a discussion. 
2 Pattern-based MT system 
h pattern-ha.seal MT system uses a set of bilin- 
gua.1 pa.tterns(CFG rules) (Abeille et a.l., 1990) 
(Ta.keda., 1.996) (Shimohata. et a.l., 1.999). In the 
pa.rsing process, the engine performs a. CFG- 
parsing for a.n input sentence and rewrites trees 
by a.pplying the source pa.tterns. 3'erminals 
and non-terminals are processed under the sa.me 
fra.lnework but lexicalized pa.tterns ha.re priority 
over symbolized pa.tterns 1 A plausible parse 
We define a symbolized pattern as a pattern with- 
out a. terminal and ~L lexicalizcd pattern as that with 
more than one terminal, we prepares 1000 symbolized 
patterns a.nd 130,000 lexicalizcd patterns as a system 
782 
tree will be selected among possible parse trees 
by the number of l)atterns applied. Then the 
pa.rse tree is tr~msferred into target  by 
using target patterns which correspond to the 
source patterns. 
Figure 1 shows an example of translation 
patterns between Fmglish and .lapanese. Each 
C1 G rule) has col English pattern(a left-half ' ,' ' 
responding aal)anese pattern(a right-half CFG 
rule). Non-terminals are bracketed with in- 
dex numbers which represents correspondence 
of non-terminals between the source and target 
pattern. 
S *--\[I:NP\] \[2:VP\] S ~--\[I:NP\] ~(subj) \[2:VP\] 
NP ~a \[I:NP\] NP *-\[I:NP\] 
VP ~---\[I:VT\] \[2:NP\] VP '~--\[2:NP\] ,~(dobj) \[I:VT\] 
VP +-take \[I:NP\] VP ~---\[I:NP\] ~(dobj)nj-7~("do") 
VP ~-- take a bath VP "*-J~=t:t~("bath") \[5("in") ,,'~'7~("enter") 
V ~-- take V ',-'~7~("take") 
N '--" bath N ~--J:~,~("l)ath") 
Figure 1: translation l)atterns 
The pattern ibrmat is simple but highly de- 
scriptive. It can represent complicated linguis- 
tic phenomena and even correspondences be- 
tween the s with quite different struc- 
tures, l)'urthermore, a.l\] the knowledge necessary 
fl)r the translation, whether syntactic or lexical, 
are compiled in the same pattern tbrmat. Ow- 
ing to these fea.tures, we can easily apply the 
retrieved technical terms to a real MT system. 
3; Algorithm 
1,'igure 2 shows an outline of the l)roposed 
nlethod. The inpu t is an untagged :~nonolingu al 
corpus, while the output is a dolnain dictionary 
for machine translation. The process is con> 
prised of 3 phases: retrieving local patterns, as- 
signing their syntactic categories with part-of- 
speech(POS) templates, and making translation 
patterns. The dictionary is used when an MT 
system translates a text in the same domain as 
the corpus. 
We assume that the input is an English cor- 
pus and the dictionary is used for an English- 
Japanese MT system. In the remainder of this 
section, we will explain each phase in detail with 
English and Japanese examples. 
dictiona.ry. 
3.1 Retrieving local patterns 
We have ah'eady proposed a method for retriev- 
ing word sequences (Shimohata et al., 1997). 
This method generates all n-character (or n- 
word) strings appearing in a text and tilters 
out ffagl-nenta.1 strings with the distribution of 
words adjacent to the strings. This is based 
on the idea. that adjacent words are widely dis- 
tributed if the string is meaningful, m~d are lo- 
calized if the string is a substring of a meaning- 
ful string. 
The method introduces entropy value to mea- 
sure the word distribution. Let the string t)e 
8tr, the adjacent words Wl...w,~, and the fre- 
quency of str frcq(.slr). The probability of each 
possible adjacent word p(wi) is then: 
p(wi)- frcq(wi) 
frcq(str) (\]) 
At ttla,t time~ the entropy of ,~tr H(.qtr) is de- 
tined a.s: 
tl(,t,.) = (2) 
i=1 
Calculating the entropy of both sides of ,qtr, 
the lower one is used as ll(,tr). Then the 
strings whose entropy is larger than a given 
threshold are retrieved as local pattexns. 
3.2 Identifying syntactic categories 
Since the strings are just word sequences, the 
l)rocess gives tllem syntactic categories. For 
each str .str~ 
1. assign pa.rt-ofspeech tags tl, ... t~. to the 
coH\]ponent words Wl, ... /vr~ 
2. match tag sequence tl, ... t,~ with part-of- 
speech templates 7~ 
3. give sir corresponding syntactic category 
,5'6'i, it' it matches Ti 
3.2.1 Assigning part-of-speech tags 
The process uses a simplified part-of speech set 
shown in table 1. l?unction words are assigned 
as they are, while content words except for ad- 
verb are fallen into only one part of speech 
word. Four kinds of words "be", "do", "'not", 
and "to" are assigned to speciM tags be, do, 
not, and to respectively. 
There are several reasons to use the simplitied 
POS tags: 
783 
Retrieve local patterns 
---* Identify syntactic categories 
Make translation patterns 
5 )n 
Figure 2: outline 
POS tag part of speech 
art 
adv 
aux 
eonj 
det 
prep 
prn 
punc 
be 
do 
not 
to 
word 
article 
adverb 
auxiliary verb 
conjunction 
determiner 
preposition 
pronoun 
punctuation 
. do~ 
"~Ot" 
"to" 
th.e others 
Table 1: part-of-speech tags 
• it may sometimes be difl3cult to identify 
precise parts of speech in such a local pat- 
tern. 
• words are often used beyond parts of speech 
in technical terminology 
• it is eml)irically found that word sequences 
retrieved through n-gram statistics have 
distributional concentration on several syn- 
tactic categories. 
Theretbre, we think the simplified POS tags are 
sufficient to identify syntactic categories. 
The word sequence w~, ... w,~ is represented 
for a part-of-speech tag sequence tl, ... ti. Fig- 
ure 3 shows examples of POS tagging. Italic 
the fuel tank 
art word word 
do this step • 
do det,prn word punc 
to oprn the 
to word art 
Figure 3: examples of POS tagging 
lines are given word sequences and bold lines 
are POS tag sequences. If a word falls into two 
or more parts of speech, all possible POSs wi\]\] 
be assigned like "this" in the second example. 
3.2.2 Matching POS templates 
The process identifies a syntactic category(SC) 
of sir by checking if str's tag sequence tl, ... 
tn matches a given POS template 7}. If they 
784 
match, str is given a syntactic category ,5'Ci 
corresponding to 5/). Table 2 shows examt)les 
of I)OS teml)la.tes and corresl)onding SCs 2 
SC POS template 
N 
N÷prep 
VT 
V-ed 
V 
1,'UNC 
(.,'0 (wo,.d l (.o,q) , (,,)o,,d) 
(.,'0 (wo,.d) + (pw, I~o)(,,,'0. (.,u.~ 
I~.o Iv,',,,) * (.,o,,d) + (.,.t) (~) (wo,.d) + (v,'~v)(.,'0: 
(.u. I~,o I l,,',,)(,~o,.a) 
((.'~ \[ .*,.. I ~o.j Ida* I *','¢*, I v,',,)+ 
If SC is N, delete art and generate: 
NP '-- st,- 
NP +-str 
If SC is VT, delete (aux\[tolprn) and art and generate: 
VP (-- str \[ 1 :NP\] 
VP ~-- \[I:NP\] ~(dobj) st*" "ej-~Cdo" ) 
If SC is v, delete (auxltolprn) generate: 
V +--sO" 
V *-- str ~("do") 
3'M)le 2: POS telnplates ~md corresponding SCs 
The templa£es are described in the l'orm of 
regula.r expressions(Rl~;) a. The first templ~te 
in table 2, for exanrple, :m~tches a string whose 
tag sequence begins with an article, contains 0 
or m ore rel)etitions of content word s or conj u n c- 
tions, a.nd ends with a content word. "the fuel 
ta,nk" in tigure 3 is applied to this templa.tes aald 
given a SC "N". 
3.3 Making translation patterns 
The process converts the strings into transla- 
tion l)a.tterns. The l)roblem here is that we need 
to generate bilingual translation l)al;terns from 
monolingua\] strings. We use heuristic rules on 
borr0wing word s from foreign \]angu ages ..1 
l!'igure 4 is an example of conversion rides tbr 
generating English-Jal)anese translation pa.t- 
terns. To give an exa.mple, "to open tile" in 
figure 3, whose SC is vT, is converted into the 
following patterns in accorda.nce with the sec- 
ond rule in figure 4. 
Figure d: conversion rules for generttting trans- 
lation l)a.tterns 
4 Evaluation 
VVe have tested our algorithln in building a 
doma.in dictionary and malting a. translation 
with it. A corpus used in the exl)eriment is a 
COml)uter nlanual comprising 167,023 words (in 
22,0d i sentences). 
The corl)us contains 24,7137 n-grooms which 
appear more than twice. Among them, 7,6116 
strings are extracted over the entropy threshold 
1. Table 3 is a list of top 20 strings (except 
for single words and function word sequences) 
retrieved from the test c()rptlS. 
These strings a.re c~tego:rized into 1,239 POS 
patterns. Table 4 is a. list of to I) 10 POS l)at;- 
terns aim the numl)ers of strings classitied into 
thenl, hi this experiment, the top 10 POS pat- 
terns a.ccount for dg.d % of a.ll 1'OS patterns. It 
substantiates the fa.ct that the retrieved strings 
tend to concentr~te in certa.in POS patterns. 
VP ~--open \[I:NP\] 
VP *--\[I:NP\] :~(dobj) open ~7~("do") 
2 Note that tile POS templates are strongly dependent 
on tile features of n-gram strings. 
a ,.,, causes tile resulting RP, to match 0 or more rep- 
etitions of the preceding I{E. "+" causes the resulting 
RE to match I or more rel)etitions of the preceding RI!'. 
"1:" creates a RE exl)ression that will match either right 
o,: left of "l"- "(...)" indicates the start and end of ~L 
group. 
4 In Japanese, foreign words, especially in technical 
terminology, are often used as they are in katakana (tiLe 
phonetic spelling for foreign words) followed by function 
words which indicate their parts of speech For example, 
English verbs are followed by "suru", a verb wliich means 
"do" in English. 
frcq POS 
1886 
553 
368 
229 
160 
158 
121. 
1.08 
101 
81 
Wol:d 
word word 
art word 
art word word 
word prep 
word art 
word word word 
to word 
prep art word 
prep word 
Table 4: top 10 P()S p~tterns 
785 
.lI(str) freq(atr) .st," IIH(, , -) freq(str) str 
5.51 
4.48 
4.4:6 
3.92 
3.79 
3.76 
3.67 
3.58 
3.56 
3.55 
247 
1499 
100 
106 
163 
309 
297 
36 
169 
180 
see also 
the server 
click OK . 
use this function 
the function 
the following 
the file 
in the Server Manager , 
using the 
CGI programs 
3.55 
3.54 
3.46 
3.46 
3 A4 
3.36 
3.29 
3.23 
3.22 
3.22 
552 
209 
209 
168 
172 
192 
132 
213 
71 
575 
the client 
use tim 
the user 
click the 
the catalog agent 
the request 
on page 
a specified 
if you want to 
your server 
Table 3: top 20 strings 
In the matching process, we prepared 15 tem- 
plates and 6 SCs. Table 5 is a result of SC 
identification. 2,462 strings(32.3 %) are not 
lnatched to any templates. The table indicates 
that most strings retrieved in this method are 
identified as N and NP. It is quite reasonable 
because the majority of the technical terms are 
supposed to be nouns and noun phrases. 
improved in parsing 104 
improved in word selection 467 
about the same 160 
same 21.2 
not imt)roved 57 
total 1000 
SC number of patterns 
NP 
N+prep 
VP 
VP+prep 
VT 
V 
722 
200 
32 
10 
177 
78 
Table 5: result of SC identification 
The retrieved translation patterns total 
1,21.9. Figure 5 shows an example of transla- 
tion patterns retrieved by our method. 
We, then, converted them to an MT dictio- 
nary and made a translation with and without 
it. Table 6 summarizes the evaluation results 
translating randomly selected 1.,000 sentences 
fi'om the test corpus. Compared with the trans- 
lations without the dictionary, the translations 
with the dictionary improved 571 in parsing and 
word selection. 
Figure 6 illustrates changes in translations. 
Each column consists of an input sentence, a 
translation without the dictionary, and a trans- 
lation with the dictionary. Bold English words 
Table 6: Translation evaluation results 
correspond to underlined a apanese. 
First two examples show improvement in 
word selection. The transl ations of" map(verb)" 
and "exec" are changed from word-for-word 
transla.tions to non-translation word sequences. 
Although "to make a map" and "exective" are 
not wrong translations, they are irrelevant in 
the computer manual context. On the contrary, 
the domain dictionary reduces confltsion caused 
by the wrong word selection. 
Wrong parsing and incomplete p~rsing are 
also reduced as shown in the next two exam- 
ples. In the third example, "Next" should be a 
noun, while it is usually used as an adverb. The 
domain dictionary solved the syntactic ambi- 
guity properly because it has exclusive priority 
over system dictionaries. In the forth example, 
"double-click" is an unknown word which could 
cause incomplete parsing. But the phrase was 
parsed as a verb correctly. 
The last one is an wrong example of Japanese 
verb selection. That was a main cause of er- 
rors and declines. The reason why the un- 
desirable Japanese verbs were selected is that 
786 
NP *- fiflly-qualified domain name 
NP ~ text search engine 
NP ~ access log for\[1 :NP\] 
VP *-- save \[I:NP\] 
V ~ deallocate 
NP +-- fully-qualified domain name 
NP ~ text search engine 
NP ~- \[I:NP\] (/)("of") access log 
VP ~ \[I:NP\] ~(dobj)save "-4-7-o("do '') 
V ~ deallocate 71"~("do") 
l!'igure 5: tile retrieved transla.tion patterns 
Type the URL prefix you want to nmp. 
&tgtztaqnap \[.,("perform mal~¢Zt,'~URL prefix ~"('ff\[..C'a2L:kl,~'o 
The exee tag allows an IITML file to execute an arbitrary progranl on the server; 
~("exeetive's lag") \[~+)---z {---~ HTML 7741bJa{{fc,~tgJr21 q~.la{ 
exec ~ It: IITML 7741bh~ server 0){\]~,-~¢g'fft:lq~Ja~gt~{~gT~O){~ag ; 
Type the full name of your server, and then click Next. 
.:.,~ It_ @~j~) i~tgtza)+Y--/l--cDB~tg~*jd-Jb-C~,Jv~btg~bXo 
~t3tz(T) server O)~tg~@~4-3U~ Next ,~ click I~t3~U~o 
Go to the Control Panel and dot,ble-elick the Services icon. 
Cont,ol l'anel .'x{~ta2;~t,x, ~5-eJ-a%\[~- ~tE.("double-") \[~ Services 74n>~ 
p IJ'yO~j-7~ ("elicld') o 
Control Panel -'x{~'g Services icon ,~double-click L,("double-click")td2~b~o 
Selling additional document directories 
~I~\]N0) F'z~)tb'b-~4 D~JbIJ"~N<("put, place") 7"~ 
:~\]JlllY) document \[Z directory ~gT~("assi~~U_~ 
Figure 6: example sentences in the test corl)us 
the method added deta.ult semantic intbrmation 
to the retrieved nouns and noun phrases. We 
hope to overcome it by a. model tha.t cla.ssilies 
noun pllrases, for example using verb-noun or 
a,djective-n ou :n relation s. 
5 Related work 
As mentioned in section 1, there are two ap- 
proaches in corpus-based technica.l term re- 
trievah a rule-based approach and a statistical 
a~pproach. Major ditlhre:nces between the two 
3,re: 
• the former uses a tagged corlItls while the 
latter uses an untagged one. 
• the former retrieves words and phrases with 
a designated syntactic category while the 
bttter :retrieves that with various syntactic 
categories at the same time. 
Our method uses the latter ~pproa, ch because 
we think it more practical both in resources and 
in applications. 
For colnparison~ we refer here to Smadja's 
method (1993) because this method and the 
proposed method have much in connnon. In 
both cases, technicaJ terms are retrieved from 
a.n untagged corpus with n-gram statistics and 
given syntactic categories for NI,P applica.tions. 
The methods are diflhrent in that Sma.dja uses a 
787 
parser for syntactic category identification while 
we use POS templates. A parser may add more 
precise syntactic category than I?OS templates. 
However, we consider it not to be critical under 
the specific condition that the variety of input 
patterns is very small. In terms of portability, 
the proposed method has an advantage. Actu- 
ally, adding POS templates is not so time con- 
suming as developing a parser. 
We have applied the translation patterns re- 
trieved by this method to a real MT system. 
As a result, 57.1. % of translations were im- 
proved with 1,219 translation patterns. To our 
knowledge, little work has gone into quantify- 
ing its effectiveness to NLP applications. We 
recognize that the method leaves room for im- 
provement in making translation patterns. We, 
therefore, plan to introduce techniques for find- 
ing translational equivalent from bilingual cor- 
pora (Me\]amed, 1998) to our method. 
6 Conclusion 
We have presented a method for identifying 
technical terminology and building a domain 
dictionary tbr MT. Applying the method to 
technical manuM in English yielded positive re- 
suits. We have found that the proposed method 
would dramatically improve the performance of 
translation. In the future work, we plan to in- 
vestigate the availability of POS patterns which 
are not categorized into any SCs. 

References 

Abeille A., Schabes Y., and .loshi A. K. 
1.990. "Using Lexicalized Tags for Machine 
Translation". In Proceicdings of the lnticrna- 
tional Gbnficricncic on Computational Linguis- 
tics(COLIN@, pages 1-6. 

Argamon, S., l)agan, I., and Krymolowski, 
YuvM. 1999. A Memory-Based Approach 
to Learning Shallow Natural Language Pat- 
terns. In Procicedirtgs of the 17th COLING 
and the 36th Anmtal Meeting of A CL, pages 
67-73. 

Cardie, C. and Pierce, D. 1.999. The Role of 
Lexicalization and Pruning for Base Noun 
Phrase Grammars In Proceedings of the 
16th National Conference on Artificial Inticl- 
Iigencc, pages 423-430. 

Haruno, M., Ikehara, S., and Yamazaki, T. 
1996. Learning BilinguM Collocations by 
Word-Level Sorting. In Proceedings of the 
16th COL1NG, pages 525 530. 

Melamed,I.D. 1998. Empirical Methods for MT 
Lexicon l)evelopment In Gerber, L. and Far- 
well, 1). Eds. Machine IYanslation and the 
Information Soup, Springer-Verlag. 

Ramshaw, L.A., and Marcus, M.P. 1995. 
Text; Chunking using Transformation-Based 
Learning In P~vcccdings of the 3rd Workshop 
on Very La,~qic Corpora , pages 82-94:. 

Shimohata,S., Sugio,T., and Nagata,J. 1997. 
Retrieving Collocations by Co-occurrences 
and Word Order Constraints. In Proceedings 
of thic 35th Annual Mcicting of ACL, pages 
476-481.. 

Shimohata, S. et al. 1999. "Machine Trans- 
lation System PENSEE: System Design and 
Implenlentation," In 1)roicicedings of Machine 
Translation Summit VII, pp.380-384. 

Smadja,l?.A. 1993. Retrieving Collocations 
fl'om Text: Xtract. In Cbmputational Lin- 
guistics, 19(1), pages 143 177. 

Takeda K. 1996. "Pattern-Based Context-lhee 
Grammars for Machine Translation". In Pro- 
ceedings of the 3/tth Annual Meeting of A CL, 
pages 14:4-151. 
