I 
I 
I 
I 
i 
I 
i: 
AN ANNOTATED CORPUS IN JAPANESE USING 
TESNI\] RE'S STRUCTURAL SYNTAX 
Yves LEPAGE§, ANDO Shin-Ichi t, AKAMINE Susumu~, IIDA Hitoshi§ 
§ATR Interpreting Telecommunications Research Labs, Kyoto, Japan 
{lepage, iida}@i~l, air. co.jp 
NEC, ~C&C Media and :~Human Media Research Labs, Kanagawa, Japan 
ando@ccm, cl. nec. co. jp, akamineelml, cl. nec. co. jp 
INTRODUCTION 
TesniAre's attention to covering a maximal num- 
ber of syntactic phenomena explains the im- 
pressive number of languages - "timco hominem 
unius linguae" - cited in the El~raents de syn- 
taxe structurale. Although Japanese is correctly 
classified as a strongly centripetal language ac- 
cording to linear survey (relevd lin~aire, p. 33), 
no examples of Japanese are cited. Conse- 
quentl); we have endeavored to apply Tesni~re's 
ideas to Japanese by manually constructing the 
linguistic structures for more than six thousand 
sentences of a corpus of hotel reservation con- 
versations. 
In fact, Tesni~re's grammatical ideas, and 
among them, the most original ones, fit well to 
Japanese as they give simple and insightful de- 
scriptions of some usually controversial gram- 
matical phenomena (ergative constructions, na- 
adjectives). 
After describing the different types and cate- 
gories of words, we will focus on the three phe- 
nomena to which, according to Tesni~re, all syn- 
tactical phenomena reduce: connection, junc- 
tion and transference. From the representa- 
tional point of view, we will introduce corre- 
spondence intervals to code which part of the 
surface text corresponds to which nodes or sub- 
trees. 
1 WORDS 
We have taken the character (kana or kanji), 
which is the physical unit of a Japanese text, as 
the unit of measure of the length of a section of 
text. With the convention of starting at posi- 
tion O, we locate any piece of text, and hence 
words, using an interval notation. Note that 
there is no word separator (or blank spaces) in 
Japanese. In the following sentence 1, the word 
t~m is located by the interval \[3_5\] and the word 
~.'C by \[6.9\]. This notation will be used in 
correspondences (Section 2.2). 
11 ~ 12 ~ 13o 
Could I get a room upstairs ¢. 
1.1 Species and Categories of Words 
The differentiation between: content words, 
which are associated with a concept, and func- 
tion words, which express syntactical informa- 
tion was not difficult to apply to Japanese. 
1.1.1 Content Words 
Some examples of content words include :~.~ 
(yoyaku, reservation), ~L~ (okureru, to be 
late), ~ (takai, expensive), ~ (tyokusetu, 
directly). Tesni~re distinguishes between two 
categories of content words: processes and sub- 
stances, which are, for explanation purposes, 
usually exemplified by verbs and nouns, respec- 
tively, in Indo-European languages. This is also 
consistent with Japanese. 
These two categories are in turn divided into: 
concrete and abstract categories, which opposes 
the concrete notion of processes and substances 
to their abstract attributes, and gives rise to the 
following categorisation for content words (see 
also (Starosta 88), Tesni~re's notations is shown 
in capitals). 
concrete 
abstract 
substmlces processes 
substantive verbal 
O ! 
adjectival adverbial 
A E 
l Except when mentioned, examples axe from tile tree- 
baak. 
109 
It is to be noted that, in the case of Japanese, 
two categories of words are variable in relation 
to aspect and negation: abstract substances 
(A) and concrete processes (I), which are re- 
spectively (i-)adjectives and verbs in terms of 
Japanese grammars. 
Now, some classes of words, which pose prob- 
lems in Japanese grammar books written in En- 
gush, such as the so-called na-adjectives (W~" 
(sizuka, quiet)), and the Sino-Japanese nouns- 
verbs formed in conjunction with use of the 
Japanese verb -J-70 (suru, to do), can easily 
be categorised as nouns (O). This is consistent 
with w'hat is taught in Japanese schools, (see 
Appendix B), their syntactical behaviour be- 
ing prefectly described by transference (see Sec- 
tion 4). 
1.1.2 Function Words 
Grammatical tools, the role of which is to ei- 
ther make explicit, or change the category of a 
content word, or to define relationships between 
words, are called function words. These words 
will appear in eztenso in structural representa- 
tions. 
In Japanese, many can be easily identified, 
such as, 7~ (ga, t~, nominative case post° 
particule), 69 (no, J~{~'J, genitive case post- 
particle), 69"~ (node, ~l~'J, equivalent to 
subordinate conjunction), 7)~ (ka, ~.~, end of 
interrogative sentence particle), 3"70 (suru, +)" 
"~B~, support verb for Sino-Japanese nouns), 
etc. 
Of course, some function words can also be 
content words in a different context. For in- 
stance, the verb "J-70 (suru), is either the sup- 
port verb for Sino-Japanese nouns, (a function 
word in that case), or the verb "to do" (a con- 
tent word). 
2 CONNECTION 
Tesni~re speaks of connection to describe the re- 
lations between words in a sentence in terms of 
their subordination relations. This concept in- 
cludes predicate-argument or governor-modifier 
relations as w-eU as predicate-circumstantial re- 
lations (Eldments, p. 14). 
The study of sentences, which is the 
proper object of structural syntaz is es- 
sentially the study of its structure, i.e. 
the hierarchy of its connections. 
2.1 Tree Representation: Stemmas 
Tesni~re was the first to propose, in 1934, 
to systematically use graphical representations 2 
which he called sternmas, for representing this 
hierarchy (Tesni~re 34). However, these stem- 
mas were more than simple trees. Although, 
we will show that the introduction of correspon- 
dence makes it possible to encode Tesni~re's rep- 
resentations using just trees. 
Basic connections are those which link con- 
crete notions with their abstract attributes 
(Figure 1). 
yasui hoteru hayaku 
cheap hotel quickly 
tukimasu 
arrive 
Figure 1: Basic connections. 
By replacing content words with their class 
(O, I, A, E) "virtual" stemmas (on the right) 
can be derived from the "real" ones (on the left). 
2.2 Correspondences 
To explicitly indicate which word, or more 
specifically, especially in the case of Japanese, 
which chunk of text corresponds to which node 
in the stemma, we adopted the use of correspon- 
dences (Boitet and Zaharin 88). 
We note two kinds of correspondence: 
• words-to-node, and 
• sentence parts-to-complete 
substring-to-subtree. 
subtree, or 
Constraints Correspondences are noted by 
intervals, as introduced above, and are governed 
by three constraints (Lepage 94). 
• global correspondence: an entire tree 
corresponds to an entire sentence; 
2He acknowledged that two Russian linguists used 
trees in 1930 to explain some syntactic phenomenon, but, 
unlike Tesni~re, the use of trees was not pivotal in their 
explanations. 
110 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
• inclusion: a subtree which is part of an- 
other subtree T, must correspond to a sub- 
string in the substring corresponding to T; 
• membership: a node in a subtree T, must 
correspond to words members of the sub- 
string corresponding to T. 
I 
12 ta 
0 0 0_2/0..2 3_6/3_6 
o~2 
tyousyoku 
:breakfast' 
narimasu 
:become' 
t:t 3 Y~U~" e t: r 
ha beturyoukin ni 
TOP 'extra charge' LOC 
Breakfast is not inchtded. 
Figure 2: A sentence and its associated stemma. 
In Figure 2, on each node of the stemma, 
two intervals stand for the words-node and the 
substring-subtree correspondences, in that or- 
der. The entire sentence 3 extends from 0 to 11, 
as indicated by the root. This root is a verb, de- 
noted as I, and is located in position 7 to 11: f.¢ 
19 ~ 3"- (narimasu). Similarly, the node labelled 
t: (hi) corresponds as a word to the case-maker 
~:, which extends from 6 to 7 in the sentence. 
The entire subtree dominated by the node cor- 
responds to the phrase ~lJ~: (beturyoukin 
hi) which extends from 3 to 7. 
Discontinuous Intervals Discontinuous in- 
tervals are possible. In Figure 3, the deverbative 
noun ~b~ (negai, request) from ~li') (negau, to 
ask for) takes an accusative argument extend- 
ing from 0 to 4, ~3~1~" (o+namae wo, your 
"~Refer to Table A in Appendix for notations used in 
glosses. 
111 
name). Because the honorific prefix ~3 + (o+) 
can only be applied to a noun, obtained by at- 
taching the suffix + b~ (+i) (transference, see 
Section 4), the subtree dominated by the ver- 
bal root corresponds to a non-connex substring 
\[0_4\]+\[5.6\] in the surface form: ~3~'k ... ~1I. 
/-3+ I 
4-5/4.5 5-610-4+5-6 
~+ 
0_1/0_1 
o namae wo o negai simasu 
HON 'name' ACC HON 'request' 'do' 
What is your name, please? 
Figure 3: A case of a discontinuous interval. 
2.3 Predicate-Argument Structures 
Free-Order- Subject A main feature of de- 
pendency structures, to which Tesni~re's rep- 
resentations pertains, is that they do not pro- 
vide any preferred position to the subject (see 
Fourquet's foreword to (Grdciano and Schu- 
reacher 96), and (Zemb 78), p. 393, for a dis- 
cussion). This corresponds particularly well 
with our data because the free ordering of case- 
marked phrases (not words) is a property of 
Japanese, which makes dependency grammars 
more adequate in its description 4. For exam- 
4(Mel'~uk 88) and (Starv6ta 88), among others have 
already commented that constituency structures are 
English-oriented representations into which some lin- 
guists try desperately to cast other languages. An il- 
lustration is (Gunji 87). After a ten-page discussion, 
and despite an honest acknowledgment that there is ab- 
solutely no basis for this, he draws the conclusion that a 
preferred position for the subject, as a left sister of the 
pie, the two following propositions are equally 
valid, where location and subject have been ex- 
changed. 
rokunin ga hitoheya ni ireru 
'6-people' NOM 'l-room' LOC 'can-enter' 
a room that can accommodate 6 people 
--g~ ~z ~A ~ 
hitoheya ni rokunin ga 
'l-room' LOC '6-people' NOM 
L. fL ;b . . . 
ireru... 
'can-enter' 
a room that can accommodate 6 people 
Omission Moreover, in Japanese, the omis- 
sion of any of the case-marked phrases is pos- 
sible. One can perfectly imagine a situation 
where a traveler first announces that he is in 
a group of 6 people, and then merely utters the 
following sentence: 
--~ ~z X.~,~ 
hitoheya ni ireru 
'one-room' LOC 'can-enter' 
a room that can accommodate 6 people 
This sentence has no subject, and yet it is un- 
ambiguously understood as a request for a room 
which can accommodate 6 people altogether. 
Ergative Constructions Moreover, the 
search for the "real subject", as opposed to the 
syntactical subject, is meaningless in depen- 
dency representations of ergative constructions. 
Such constructions exist in Japanese 5 with 
a range of adjectives, such as, ~ L.~ ~ (hosii) 
(20 occurrences in our corpus), or verbal 
forms in -t~,~ (tai) (around 310 occurrences 
in the corpus), or the so-calhd "passive" or 
"medio-passive" verbs, such as, gP.. 1o (mieru, 
c.f. Fr. se voir). 
verb, has to be postulated for Japanese, because.., it is 
so in English. 
5However, the ergative case does not exist in 
Japanese, and it would be difficult to call Japanese an 
ergative language (see (Mel'euk 88), p. 250-253, for def- 
initions concerning ergativity). 
112 
~a0~ $~ ~b~,~ ... 
snityuumegane ga hosii 
'goggles' NOM 'want' 
I want goggles 
Figure 4: An ergative construction. 
Auxiliary Verbs In an original and interest- 
ing discussion, Tesni~re advocates that the sub- 
ject and the object of a French passd composdof 
a transitive verb, do not both link to the past 
participle. He shows that some clues indicate 
that the subject links to the auxiliary, while the 
object should be linked with the past participle. 
Similar analysis seems particularly well adapted 
to some Japanese constructions too, not because 
of the agreement in gender-number, but because 
of case semantics. 
For instance, in the following sentence, the 
subject, postal code, cannot be considered the 
subject of the verb, to write °. 
yuubinbangou ga kaite aru 
'postal code' NOM 'write' 'is' 
The postal code is written (e.g. on an enve- 
lope) 
However, changing the auxiliary, ab~ (aru) 
into b~ 7o (iru) implies a change in the case of 
postal code. 
yuubinbangou wo kaite iru 
'postal code' ACC 'write' 'is' 
The postal code is being written (e.g. by Lu- 
cien) - Somebody is writing the postal code. 
This convinced us to adopt Tesni~re's analy- 
sis, where the subject is linked with the auxil- 
iary (Figure 5). 
¢1~'~ (kaite) is a non-conclusive, pending, form of 
the verb 8 < (kaku), which is translated in English by 
"writing ~ or "written" according to the context. 
I 
I 
I 
I 
I, 
I 
I 
I 
I 
I 
I 
IT 
ET ~3 
~ 8-1i/0-5+8-10 
I + ~ ~ 
5_7/5.8 7.8/7.8 4.5/0. 5 
0 o.4/o_4 
yuubinbangou ga kaite aru 
'postal code' NOM 'write' 'is' 
The postal code is written 
Figure 5: Auxiliary dependency. 
3 JUNCTION 
Junction gathers the facts of coordination, and 
factorisation. 
Junction words in Japanese include words 
such as ~ (to, and for nouns), ~ (ya, or for 
nouns), L (si, or for verbs), ~;t 2" (kedo, but). 
We propose to represent them with one node 
bearing a special label: we prefix and suffix by 
- the function word. Accordingly, we can easily 
represent cap junctions as in Figure 6. 
ikkai to tikaikk~i ni 
'first-floor' 'and' 'basement floor' LOC 
On the first and basement floors 
Figure 6: A case of a cap junction. 
On the other hand, in cup cases, the same de- 
pendent shares several governors. A tree can be 
'~factored" by using a special node, V, bearing 
113 
the same correspondences as its root. Figure 7 
is a slightly modified corpus sentence. 
A -T- I 12_13/0.13 13-15/13.15 
|:t ET 
11.12/0.12 ~7 
. 
V 11.12/0_12 
A +< 15-16/15-1616A7/16A7 
0 \[\]~,:r- 4~ ~ 7"PX ,~ - ~t, 11 
kokusai ekisupuresu meeru 
'International express mail? 
~< ,r o~T2, 
hayaku tukimasu 
'quickly' 'arrive' 
International express mail is cheap, and it ar- 
rives quickly 
I~t12 ~<'C15 
ha yasukute 
TOP 'cheap-and' 
Figure 7: A case of a cup junction. 
Because of junctions, a structure representing 
a sentence may be a forest. This is a signifi- 
cant difference to constituency representations, 
but conforms with Tesni~re's description (e.g. 
p. 649). Figure 7 is such an example. 
4 TRANSFERENCE 
Transference r, in essence, consists in transfer- 
ring to a content word of a given category the 
function or role of another category. Accord- 
ing to Tesni~re, it is precisely this transference 
which aUows a speaker of any language to never 
be stopped by the fact that a needed concept 
does not fit, by category, into the role required 
at a given point in an utterance. 
Transferer Transference applies to a content 
word, called the transferee. It is performed by 
a transferer, which may be: 
• a function word 69 (no, of), ~:- (hi, to), T 
(sum, to do), etc. 
7Here, we follow the recommendation of Tesni~re him- 
self to render the French word translation with this En- 
glish term especially coined for the meaning here. 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
l 
some morphological device ¢ < (ku, adver- 
bial form of adjectives), ¢ "C (re, pending 
form of verbs), etc. 
no mark at all (the so-called "relatives" of 
Japanese are in fact transferences: a verb 
is transferred into an adjective without any 
marker). In this case, we indicate the trans- 
ferer node by ¢. 
As a result of transference, the category of 
the content word has been transformed into an- 
other category so that it can play the role of the 
resulting category. For instance: 
,-k,,'Y-Jt~:O ~ ,~,-Y-Jp0):A 
hoteru hoteru no 
hotel of the hotel 
Representation Depending on the position 
of the transferer, left and right transferences 
have to be distinguished. In Japanese, the 
transferer is predominantly on the right of the 
transferee. We represent the transference with 
the help of a 3-node subtree to render Tesni~re's 
capital T notation: 
• the mother node bears the target category, 
followed (or preceded) by T if the transferer 
is on the right of the transferee in the sen- 
tence, (usual case in Japanese), or on the 
left.; 
• the left (or right) daughter bears the trans- 
feree, represented by its category; 
• the other daughter bears the transferer, £e. 
the function word in extenso. 
A mother node does not correspond to any 
word in the surface text so it bears an empty 
interval (denoted as \[n_n\], with any u). How- 
ever, as the root of a subtree, it represents the 
sum of the intervals of all its subtrees. 
Na-Adjectlves A class of Sino-Japanese 
nouns exists in Japanese, extended in contem- 
poranean Japanese by a full range of English- 
Japanese nouns (Sells 96) (~--- ~' f,c (yuniiku- 
na. unique), 7 I/.:p -~ :~ fo¢ (huressyu-na, fresh)), 
which could be semantically interpreted as ad- 
jectives, but follow a specific syntactical be- 
haviour, different from standard adjectives end- 
ing in ~ (i) (Appendix C). They are the so- 
called na-adjectives in Japanese grammar books 
AT 
0 a) 0_3/0_3 3.4/3.4 
o ~" "Y'~V 3 o~ 4 
Figure 8: Representation of transference. 
written in English, although in Japanese termi- 
nology they are described as noun-adjectives. 
In attributive positions, these words require 
a special function word, tx (na). We analysed 
t.c as a transferer of nouns (0) into adjectives 
(A). This view meets that of (Kuwae 89), vol 1, 
p. 185, who considers that, "t~ (da) is the only 
variable word in Japanese for which there exists 
a determinative form, t,c (ha), distinct from the 
conclusive form." 
CONCLUSION 
We have presented a tree-bank of 6553 sentences 
of Japanese conversations in the domain of ho- 
tel reservations , which uses Tesni~re's structural 
syntax framework. Correspondences between 
surface texts and trees are ensured by means 
of intervals. 
It has long been felt in the NLP Japanese 
community that a dependency approach fits 
well to the description of Japanese. The 
privileged place for the subject in con- 
stituency descriptions generates artificial prob- 
lems, whereas, dependency allows simple and 
direct description of phenomena like, for in- 
stance, ergative constructions. 
Moreover, Tesni~re's original ideas give a 
clear insight to some area. For instance, the 
attachment of arguments under auxiliaries bet- 
ter renders case semantics. Also, transference 
permits a simple analysis of "na-adjectives", 
which respects the feeling of native speakers of 
Japanese. 
114 
A 
B 
Grammatical Labels Used in 
Glosses 
symbol particle or 
example 
TOP l;t (ha) 
NOM 7) t (ga) 
ACC ~ (wo) 
LOC ~: (ni) 
"~ (de) 
HON ~3 + (o+) honorific 
+ (go+) 
topicalisatioff" 
nominative 
accusative 
locative 
Structural Syntax Categories and 
Japanese School Grammar Classes 
symbol class example 
0 ~ , ~-~t~ (~L~) 
nouns :~ (#~J) ~" (~$~) 
-t-~L;~9 * :,' (~$) 
adj. .1: ~ Lb~ (~) 
verbs fz 19 it-J- (+iT) 
~ Lt: (+to) 
adv. 12~" (~,\]~) 
C I-Adjectives and "Na-Adjectives" 
polite 
takai desu 
it is expensive 
sizuka desu 
it is quiet 
predicative 
, \[familiar 
takai 
(it's) erpen. 
sire 
~h~fY. 
sizuka da 
(it's) quiet 
attributive 
takai heya desu 
it is an expensive 
rooffl 
sizuka na heya desu 
it is a quiet room 

References 
Christian Boitet and Zaharin Yusoff 
Representation trees and string-tree corre- 
spondences 
Proceedings of COLING-88, Budapest, 1988, 
pp 59-64. 

Gunji Takao 
Japanese Phrase Structure Grammar 
D. Reidel Publishing Company, 1987. 

Gertrud Gr~ciano und Helmut Schumacher 
(Herausgegeben yon) 
Lucien Tesni~re - Syntaxe structurale et 
operations mentales 
Max Niemeyer Verlag, Tfibingen, 1996. 

In Japanese, 1989. 

Kunio Kuwae 
Cours de japonais 
vol. 1 & 2, L'Asiath~que, Paris, 1989. 

Yves Lepage 
Te~ts and Structures - Pattern-matching and 
Distances 
ATR report TR-IT-0049, Kyoto, March 1994. 

Igor A. Mel'~uk 
Dependency Syntax: Theory and Practice 
State University of New York Press, 1988. 

In Chinese, 1989. 

Peter Sells 
IVhot Happens When A Word is Borrowed 
handout of communication, ATR-ITL, July 
llth, 1996. 

Stanley Starosta 
The Case for Lezicase 
Pinter Publishers, London and New York, 
1988. 

Lucien Tesnitre 
Comment eonstruire une syntaze 
Bulletin de la Facultd des Lettres de Stras- 
bourg, 12 e annie, n o 7, mai-juin 1934, 
pp. 219-229. 

Luden Tesni~re 
Elements de syntaze structurale 
Klincksieck, Paris, 1959. 

Jean Marie Zemb 
Ven2leichende Grammatik FranzSsisch- 
Deutseh 
Compamison de deux systdmes - Tell 1 
Dudenverlag, 1978. 
