Representation and Processing of Chinese Nominals and 
Compounds 
Evelyne Viegas, Wanying Jin, Ron Dolan and Stephen Beale 
New Mexico State University 
Computing Research Laboratory 
Las Cruces, NM 88003, USA 
vi egas, wanying, ton, sb©crl, nmsu. edu 
Abstract 
1In this paper, we address representation issues of 
Chinese nominals. In particular, we look at lexical 
rules as a conceptual tool to link forms with the same 
semantics as is the case between nominalisations and 
the forms they are derived from. We also address 
Chinese compounds, illustrating how to recover im- 
plicit semantic relations in nominal compounds. Fi- 
nally, we show how to translate Chinese nontinals 
within a knowledge-based framework. 
1 Introduction 
In this paper, we present results of a theoretical and 
an applied investigation, within a knowledge base 
framework, on the building and processing of com- 
putational semantic lexicons, as reflected by experi- 
ments done on Spanish, English and Chinese, with a 
large scale application on Spanish. The multilingual 
dictionaries making process (Viegas and Raskin, 
1998) has been tested and attested for Mikrokos- 
mos, a machine translation system (Nirenburg et 
al., 1996) from Spanish and Chinese to English. 2 
Here, we focus on Chinese nominals and compounds 
in terms of representation and processing. 
In Section2, we briefly present the information 
carried inside Mikrokosmos lexicons. In Section 3, 
we show how a semantic-based transcategorial ap- 
proach is best fitted to account for nominalisations 
and their derived forms. Formally, we use the con- 
ceptual tool of lexical rules as described in (Viegas 
et al., 1996). In Section 4, we address the Irans- 
lation of Chinese nominal compounds into English 
using semantic information and word order informa- 
tion. We show the advantage of a transcategorial 
approach to lexicon representation and investigate 
some trade-offs between an interlingua and transfer 
approach to nominal compounding. 
XThis work has been supported in part by DoD under 
contract number MDA-904-92-C-5189. 
2The interested reader can visit the Mikrokosmos site 
at http://crl.nmsu.edu/Research/Projects/mikro/. 
2 A Brief Overview on the Structure 
of Mikrokosmos Lexicons 
In Mikrokosmos, the lexical information is dis- 
tributed among various levels, relevant to phonol- 
ogy, orthography, morphology, syntax, semantics, 
syntax-semantic linking, stylistics, paradigmatic and 
syntagmatic information, and also database type 
management information. 3 
Each entry consists of a list of words, stored in the 
lexicon independently of their POS (the verb and 
noun form of walk are under the same superentry). 
Each word meaning is identified by a unique iden- 
tificator, or lexeme (Onyshkevych and Nirenburg, 
1994). Homonyms and all meaning shifts of polyse- 
mous words are listed under one single superentry. 4 
We illustrate in Figure 1 relevant aspects, for this 
paper, of a lexicon entry via the description of two 
senses of the Chinese word ~ (activity): WorkAc- 
tivity and Exercise, which are well defined sym- 
bols or concepts in the Mikrokosmos ontology as de- 
scribed in (Mahesh, 1996). 
Word meanings in Mikrokosmos are represented 
partly in the lexicon and partly in the ontology. We 
have strived to achieve an intermediate grain size of 
meaning representation in both the lexicon and the 
ontology: many word senses have direct mappings 
to concepts in the ontology; many others must be 
decomposed and mapped indirectly through compo- 
sition and modification of ontological concepts. We 
have developed a set of guidelines and a training 
methodology that results in acceptable quality and 
uniformity in lexical and ontological representations 
(Mahesh, 1996; Viegas and Raskin, 1998). In prin- 
ciple, the separation between ontology and lexicon 
is as follows: language-neutral meanings are stored 
in the former; language-specific information in the 
latter. 
We keep the number of concepts well below the 
number of lexical items for a given language, such 
aDeLails on these zones can be found in (Viegas and 
Raskin, 1998; Meyer et al., 1990). 
4See (Weinreich, 1964), (Fillmore, 1971), (Cruse, 
1993), (Pustejovsky, 1995) for interesting accounts on 
homography/polysemy. See (Bouillon et al., 1992), 
(Busa, 1996) for accounts on nominals. 
20 
'~-N1 
"cat: N .\] dfn: 
events concerning the jobs people do eng.transl: 
activity 
\[root: \[~\]1 syn: \[semRep: o~J 
semRep: \[~ WorkActivity\] 
:~-N2 cat: 
dfn: 
eng.transl: 
syn: 
.semRep: 
N 
activity for nlalntalnlng health 
exercise 
Figure 1: Partial Entry for the Chinese word N~. 
that, for instance, the concept Ingest can be lex- 
icalised as ~; (eat.) or ~ (drink) according to the 
constraints put in the lexicon on the theme: Food 
for eat and Liquid for drink respectively. 
3 Role of Lexico-Semantic Rules 
This section deals with the use of morpho-semantic 
lexical rules (MSLRs) in the process of large-scale 
acquisition. The advantage of MSLRs is twofold: 
first, they can be considered as a means to reduce the 
number of lexicon entry types, and generally to make 
the acquisition process faster and cheaper; second, 
they can enhance the results of analysis processing 
by creating new entries for unknown words from the 
lexicon, found in corpora. Lexical rules have been 
addressed by many researchers. Here we apply Vie- 
gas et al. (1996) methodology to Chinese. 
Briefly, applying MSLRs to the Spanish entry 
comprar (buy), our MSLR generator produced au- 
tomatically 26 new entries (comprador-N1 (buyer), 
comprable-Adj (buyable), etc). This includes cre- 
ating new syntax, semantics and syntax-semantic 
mappings with correct subcategorisations and also 
the right semantics; for instance, the lexical entry for 
cornprable will have the subcategorisation for pred- 
icative and attributive adjectives and the semantics 
adds the attribute "FeasibilityAttribute" to the ba- 
sic meaning "Buy" of comprar. The form list gen- 
erated by the morpho-semantic generator is checked 
against MRDs, dictionaries and corpora: only the 
forms found in them are submitted to the acquisi- 
tion process. 
MSLRs constitute a powerful conceptual tool to 
extend a core lexicon from a monolingual viewpoint. 
We applied the same methodology to Chinese. The 
rules are language independent, what is language de- 
pendent is the morphemes to which they can apply. 
For Chinese, we do not have to worry about develop- 
ing a morpho-semantic generator as the productiv- 
ity in morphology is poor, if one excepts compounds 
characters in which semantics is not compositional 
(see the example of ~ (glucose) below). In this 
latter case, we acquired the entry manually. So rules 
are used to link nominalisations to verbs, and vice 
versa, meaning that once verbs have been acquired, 
nominal derivations can be produced automatically 
using rules. 
(1) a. ~-Vl(affirm) 
b. j~-Nl(affirmation) 
We present below the entry for ~-N1 (affirma- 
tion) after the application of the LR2event rule on 
~-V1 (affirm). 
#0= lkey:~, gram:lpos: N\], semRep:#sem, synSenn:\[gram:#ol, semRep:#t\], \[grarn:#o2, semRep:#a\], 
lexRule: LR2event\[root :\[key: ~ ~, gram:\[pos:V, subc:NPVNP\], 
semRep:#sem=\[narne:Assert, agent:#a, therne:#t\]\], vn:#0\]\] 
In our corpus (from the Chinese newspaper Xin- 
hun Daily), we found that 166 nouns could be de- 
rived from 351 verbs; that is, almost 47% of verbs 
can produce nouns. From an acquisition viewpoint, 
it is cheaper to use the mechanism of lexical rules to 
automatically produce nouns from verbs, with the 
same semantics, this is due to our transcategorial 
approach to semantics, where the same piece of se- 
mantics can be lexicalised as either a Noun or Verb. 
4 A Transcategorial Approach 
Compounding in Chinese is a common phenomenon 
(Jin, 1994; Jin and Chen, 1995; Palmer and Wu, 
1995). It is mainly used to combine i) characters 
whose semantics is different and non compositional, 
and ii) sequences of nouns. 
In i) we create entries for single characters and 
entries for combined characters, (e.g., (2)): 
(2) a. ~j (grape) 
b.   (sugar) 
c. ~j~j~ (glucose) 
In this paper we are concerned with ii) only. In 
the following, we investigate three ways to trans- 
late Chinese nominal compounds into English, using 
word order information, semantic information and 
co-occurring information in syntactic, semantic and 
transfer approaches, respectively. 
4.1 A. Syntactic A.pproach 
Compounds proliferate in Chinese. The head of 
the compound seems to be easily identified as the 
last noun in the sequence, and therefore in the 
task of translating Chinese compounds into English 
compounds, where English also makes use of com- 
pounds as opposed to say French, one could adopt a 
transfer-based approach, where each Chinese noun 
is translated into English in the same sequence: 1~ 
~ (application software); ~$.~ q~ ~.~ (data 
management system). It gets a bit more complex 
21 
when there is a large sequence of nouns in English, 
whereas it is still acceptable and normal ill Chi- 
nese. In our corpus we found compounds contain- 
ing up to 6 nouns: ~ ~ ~ t~1~ ~ 
(~ (military theory test database management sys- 
tem) (the management system of database for test- 
ing military theory). In these cases, it is difficult 
to comprehend the compound in English and some 
"linking information" is needed to break the com- 
pound and make it understandable in English. This 
is where the semantics comes in, as one needs to un- 
derstand the underlying relationships between the 
nouns, and identify "sub-heads" inside the Chinese 
compounds, which will become the heads of En- 
glish smaller compounds linked via relations. For 
instance, in ~t1~ ~ ~ ~ '~ ~ (mili- 
tary theory test database management system)(the 
management system of database for testing military 
theory) one might want to "break" the Chinese com- 
pound into smaller English compounds "manage- 
ment system," " database" and "military theory" 
with a relation "test" between the last compounds 
(the management system of database for testing mil- 
itary theories). 
4.2 A Lexlco-Semantic Approach 
We now show examples of how the semantics can 
help identify sub-heads inside the Chinese compound 
(the head of the Chinese compound is the last noun). 
Second, we show how a transcategorial approach can 
help go from an NN compound ~: ~ (economy 
policy) in Chinese to AdjN constructions (economic 
policy) in English. Finally, we show how nominal 
mismatches are dealt with as a generation issue. 
For illustration purposes, we will mainly consider 
compounds composed of two nouns; however, this 
semantic approach applies to more than 2 nouns. 
Lexemes can be mapped to Objects (O) (';Car" 
car), Events (E) ("Explode" explosion), Relations 
(R) ("Utilizes" use) or Attributes (A) ("ColourAt- 
tribute" colour). In the case of NNs, we have 14 
combinations allowed (RR and AA do not seem to 
co-occur), where E, O and R can be heads, with the 
following hierarchy of headhood: 
E>R>O 
When the semantics of the NN is expressed with 
a combination of identical types (e.g. EE or OO), 
the semantic analyser must score the constraints be- 
tween the two nouns to find the head. Sometimes it 
is possible to find the semantic relation linking the 
two nouns in the ontological entry of the nouns, as 
in the example OO below. 
(O0) Object - Object 
\[np \[mod ~\[L (n, ji4suan4jil, computer, Computer)\] 
\[n ~ 7~ (n, ji.lshu4, technology, Technology)\]\] 
Here, both nouns are typed as O, and therefore we 
need a mechanism to assign tile head. The genera- 
tor must identify the underlying relation between the 
Os. This can be done by searching for a relation R 
in the ontology shared by" the 20s, such as "applied- 
to" with a domain which is in an ISA relationship 
with "technology" and a range also in an ISA rela- 
tionship with "computer". Needless to say that this 
approach is knowledge intensive, and in case we do 
not have this type of knowledge encoded we rely on a 
transfer-based approach following tlle Chinese word 
order. Here, we could successfully generate technol- 
ogy about computer and computer technology, with a 
preference on the latter. 
(OR) Object - Relation 
lnp \[mod ~: (n, hang2, business, AreaOfExpertise)\] 
\[n ~ (n, zhang3, leader, HeadOf)\]\] 
"HeadOf" is a relation and therefore the head, 
as the other noun is an O. The generator can lex- 
icalise this as leader of business or business leader 
via a rule; the latter is assigned a preference in ab- 
sence of modifiers such that we can still generate 
the leader of a big business instead of big business 
leader. Note that we do not need to use the hierar- 
chy in the case of only two Ns to identify the head 
because, the head is the last noun in a Chinese com- 
pound; we showed this example, in case it entered 
in a larger compound such as "business leader ma- 
jor office" where one might want to break it as "the 
major office" of "the business leader". 
(EA) Event - Attribute 
\[np \[mod \[I~ (n, gonglzuo4, work, ~VorkActivity)\] 
\[n ~ (n, fanglshi4, style, StyleAttribute)\]\] 
Here, E is the head and this semantics is lexi- 
calised as way of working or work style, with again 
a preference on the latter. 
(OR) illustrates our transcategorial approach: 
\[np lmod ~g (n, jinglji4, economy, Economy)\] 
\[n ~ (n, xiao4)'i4, benefit, BenefitFrom)\]\] 
Here is a case where our transcategorial approach 
to lexicon representation helps in generating an 
AdjN construction economic benefit for an NN Chi- 
nese compound; this is due to the fact that both 
economy and economic share the same semantics, 
and thus the generator will present both possibil- 
ities; moreover, they co-occur in English whereas 
economy and benefit do not. The head is easily iden- 
tified in R "BenefitFrom" and as such the compound 
could also be generated as benefit to economy. 
(OEE) illustrates a phrase 
lap lmod 
\[mod ~t~\]~ (abbr, kelji4, science&technology, Science)\] 
\[n ~ ~#~ In, gonglguanl, attack-key-problem, Solve+Att)\]\] 
\[n ~J~,~ (n, ji4hua4, plan, PlanningEvent)\]\] 
22 
This NNN compound presents a case of mismatch 
between Chinese and English, it can be paraphrased 
as: plan to solve key problems in science and technol- 
ogy. Here, a transfer-based approach would fail to 
translate adequately, as ~ (attack-key-problem) 
must be expressed as an expression equivalent to 
solving important problems, and as such the follow- 
ing English compound science technology solving key 
problem plans must be broken into smaller com- 
pounds with explicit relations between them. s 
These examples illustrate why a semantic ap- 
proach is preferable, and sometimes necessary, to 
translate Chinese compounds into English. How- 
ever, as discussed earlier, 1) this approach is knowl- 
edge intensive, and 2) English compounding seems 
to follow the same Chinese word order regularly 
enough so that we consider using a transfer approach 
as a back-up to generation. 
4.3 A Transfer-based Approach 
Semantics can be expensive to use so we also rely 
on a transfer-based approach as a back-up method 
when semantics fails to give us the semantic relation 
between the nouns. We can do this because English 
allows compounding (whereas for French and Span- 
ish, a transfer approach would be more problematic 
as compounding is not as productive and relations 
must be identified). However, as we noticed previ- 
ously, it can become difficult in English to get the 
meaning of a large compound, it is therefore better 
to "break" the compound into 2 or 3 compounds. 
We hope to bypass part of this problem by using 
co-occurring information in a transfer approachfi 
\[np 
\[mod ~.~JL (n, ji4suandjil, computer, Computer)\] 
lmod \[mod 
\[mod ~J~ (n, juntshi4, military-affairs, MiiitaryActivity)\] 
\[n ~i~ (n, li31un4, theory, Theory)\]\] 
\[n :9~ (n, kao3he2, test, Examination)\]\] 
\[n ~J~ (n, ti2ku4, database, Database)\]\] 
computer database for test of ~heory of military 
affairs 
In this case, only co-occurring information will sig- 
nal the generator to link "computer" to "database" 
to produce "computer database"; this information 
must be encoded in the lexicon, as we show in next 
section. 
\[np 
lmod lmod 
\[rood 
\[mod ~i~ (n, junlshi4, military-affaim, MilitaryAetiwty)\] 
Sin "Solve+Attitude", Attitude reflects the impor- 
tance attached to the event. 
6We saw that in a semantic approach the headhood 
hierarchy provides a good clue to break a compound. 
\[n ~ (n, li31un4, theory, Theory)\]\] 
\[n :~J~ (n, kao3he2, test, Examination)\]\] . 
\[n ~J\[~ (abbr, ti2ku4, text database, Database)\]\] \[np 
\[mod ~ (n, guan31i3, management, ManagementAetivity)\] 
\[n ~ (n, xi4tong3, system, System)\]\]\] 
Following the Chinese word order seems to be ac- 
ceptable in English, to produce military theory test 
database management system. However, a better 
translation might be the management system of a 
database for testing military theory, in which case, 
relations between nouns must be made explicit, us- 
ing the semantic information found in the ontological 
concepts in a semantic approach. 
5 Processing of Chinese Nominals 
and Nominal Compounds 
We utilise an efficient constraint-based control mech- 
anism called Hunter-Gatherer (HG) (Beale, 1997) 
to process Chinese nominals and compounds. This 
mechanism has been successfully applied to the anal- 
ysis of Spanish and generation of English. We refer 
to (Beale et al., 1995) for details on how the seman- 
tic analyser works, and (Beale et at., 1997) on how 
the generator works. 
In this paper, we are interested in showing how 
HG allows us to mark certain compositions as being 
dependent on each other: once we \]lave two lexicon 
entries that we know go together, from either syntac- 
tic, lexical, or semantic viewpoints, HG will ensure 
that they are correctly treated. HG gives preference 
to words which "co-occur" together, from any of the 
above viewpoints. The analyser simply needs to de- 
tect the co-occurrence and add the constraint that 
the corresponding senses be used together. 
In the case of "computer database," the lexicon 
entry for "database" encodes the syntagmatic re- 
lation (LSFSyn) which keeps the semantics of the 
nouns compositional and signals the processor (anal- 
yser or generator) to consider the nouns as syntac- 
tically linked: 
#O=\[key: "~ \]~", 
reh lsyntagmati¢: LSFSyn \[base: #0, 
co-occur: \[key: " ~ ~\[ ~", sense: n i, ...\]\]\]\] 
We provide below the example of a Chinese sen- 
tence, its English translation and relevant parts of 
the result of the semantic analysis, showing the anal- 
ysis of the compound ~ ~ "tackle-key-problem". 
Chinese Sentence Example i,~ 4` ~ ~ ~ I~1 
m~.~,~ J-:}.¢ ~ \[\] 3 4 4`$ff~ Z 0 0 0 ~ 
23 
English Literal Transla- 
tion This classifier attack-key-problem big project 
by State-Maritime-Bureau direct whole country 34 
classifier units adjective-marker 2000 more classifier 
scientific-and-technical personnel participate attack- 
key-problem, is one classifier including 7 classifier 
tasks, 39 classifier special topics adjective-marker 
large-scale engineering application project. 
English Translation This project which deals 
with important problems, directed by the State Mar- 
itime Bureau and in which participated more than 
2,000 scientific and technical personnel from 34 units 
throughout the country, was a large-scale engineer- 
ing application project including seven tasks and 39 
special topics. 
Partial Text Meaning Representation 
SOLVE-219 
LOCATION : SPEAKER 
TIME : SPEAKER-TIME 
RELATION : R.ESEARCH-216 
STRENGTH-ATTRIBUTE: 0.9 
THEME : PROBLEM-220 
PROBLEM-220 
THEME.OF : SOLVE-219 
ATTITUDE-225 
ATTRIBUTED-TO : SPEAKER 
ATTITUDE.VALUE : 1 
TYPE : SALIENCY 
ATTITUDE.SCOPE : PROBLEM-220 
6 Conclusions - Perspectives 
In this paper, we showed the advantage of adopt- 
ing a transcategorial (semantic-based) approach to 
relate verbs with their nominalisations. We showed 
how to use lexico-semantic rules to relate different 
forms carrying the same semantics. These rules can 
be applied at run time in analysis, thus facilitating 
a syntactico-semantic recovery for unknown words. 
Concerning compounds, we have shown that we can- 
not avoid a semantic approach if we want a high 
quality translation, because of the number of nouns 
which can enter into a Chinese compound making 
it difficult to get the meaning of the compound in 
English. Thus, breaking the compound necessitates 
an understanding of the Chinese compound. How- 
ever, we have suggested transfer-like approach for 
Chinese to English translation with the use of co- 
occurrences to "signal" privileged lexical links (com- 
puter database). 
We have illustrated that by considering tile in- 
formation in the lexicon as constraints, the linguis- 
tic difference between pure compositionality and co- 
occurrent information becomes a virtual difference 
for ttG. 

References 
S. Beale, S. Nirenburg and K. Mahesh. 1995. 
Semantic Analysis in the Mikrokosmos Machine 
Translation Project. In Proc. of the $nd Sympo- 
sium on NLP, Bangkok. 
S. Beale. 1997. HUNTER-GATHERER: Applying 
Constraint Satisfaction, Branch-and-Bound and 
Solution Synthesis to Computational Semantics. 
Ph.D. Diss., Carnegie Mellon University. 
S. Beale, E. Viegas and S. Nirenburg. 1997. Break- 
ing Down Barriers: The Mikrokosmos Genera- 
tor. In Proc. of the NLP Pacific Rim Symposium, 
Phuket. 
P. Bouillon, K. BSsefelt and G. Russell. 1992. Com- 
pounds Nouns in a Unification-Based MT System. 
In Proc. of the 3rd ANLP, Trento. 
F. Busa. 1996. Compositionality and the Semantics 
of Nominals. PhD. Disser. Brandeis University. 
A. Cruse. 1993. Towards a Theory of Polysemy. 
Building Lexicons for Machine Translation. TR. 
SS-93-2, Stanford: AAAI Press. 
C. Fillmore. 1971. Types of Lexical Information. In 
Steinberg and Jakobovitz. 
W. Jin and L. Chen. 1995. Identify Unknown 
Words in Chinese Corpus. In Proc. of the 3rd 
NLP Pacific-Rim Symposium, Vol. 1, Seoul. 
W. Jin. 1994. Chinese Segmentation Disambigua- 
tion. In Proc. of COLING, Japan. 
K. Mahesh. 1996. Ontology Development: Ideol- 
ogy and Methodology. TR. MCCS-96-292, NMSU, 
CKL. 
I. Meyer, B. Onyshkevych and L. Carlson. 1990. 
Lezicographic Principles and Design for KBMT. 
TR. CMU-CMT-90-118, CMU. 
S. Nirenburg, S. Beale, S. Helmreich, K. Mahesh, 
E. Viegas, and R. Zajac. 1996. Two principles 
and six techniques for rapid MT development. In 
Proc. of the 2nd AMTA. 
M. Palmer and Z. Wu. 1995. Verb Semantics for 
English-Chinese Translation. In Machine Trans- 
lation, Volume 10, Nos 1-2. 
J. Pustejovsky. 1995. TILe Generative Lexicon. MIT 
Press. 
E. Viegas, B. Onyshkevych, V. Raskin and S. Niren- 
burg. 1996. From Submit to Submitted via Sub- 
mission: on Lexical Rules in Large-scale Lexicon 
Acquisition. In Proc. of the 34th ACL, CA. 
E. Viegas and V. Raskin. 1998. Computational 
Semantic Lexicon Acquisition: Methodology and 
Guidelines. TR. MCCS-98-315. NMSU: CRL. 
U. Weinreich. 1964. Webster's Third: A Critique of 
its Semantics. In International Journal of Amer- 
ican Linguistics 30: 405-409. 
