  
 
Automatic Linguistic Analysis for Language Teachers:  
The Case of Zeros 
 
MITSUKO YAMURA-TAKEI 
Graduate School of Information Sciences 
Hiroshima City University  
3-4-1 Ozuka-higashi, Asaminami-ku,  
Hiroshima, JAPAN 731-3194 
yamuram@nlp.its.hiroshima-cu.ac.jp 
 
MIHO FUJIWARA 
Department of Japanese and Chinese 
Willamette University 
900 State Street, Salem,  
OR. USA 97301 
mfujiwar@willamette.edu 
MAKOTO YOSHIE 
Graduate School of Information Sciences 
Hiroshima City University  
yoshie@nlp.its.hiroshima-cu.ac.jp 
TERUAKI AIZAWA 
Faculty of Information Sciences 
Hiroshima City University  
aizawa@its.hiroshima-cu.ac.jp 
 
 
 
Abstract 
This paper presents the Natural Language 
Processing-based linguistic analysis tool that 
we have developed for Japanese as a Second 
Language teachers.  This program, Zero De-
tector (ZD), aims to promote effective instruc-
tion of zero anaphora, on the basis of a hy-
pothesis about ideal conditions for second 
language acquisition, by making invisible 
zeros visible.  ZD takes Japanese written 
narrative discourse as input and provides the 
zero-specified texts and their underlying 
structures as output.  We evaluated ZD’s 
performance in terms of its zero detecting 
accuracy.  We also present an experimental 
report of its validity for practical use.  As a 
result, ZD has proven to be pedagogically 
feasible in terms of its accuracy and its impact 
on effective instruction. 
 
Introduction 
Natural Language Processing (NLP) is an 
emerging technology with a variety of real-world 
applications.  Computer-Assisted Language 
Learning/Teaching (CALL/CALT) is one area 
that NLP techniques can contribute to.  Such 
techniques range from indexing and concor-
dancing to morphological processing with 
on-demand dictionary look-ups and syntactic 
processing with diagnostic error analysis, to 
name a few.  But little work has been done on 
discourse-level phenomena, including anaphora. 
Zero anaphora or zero pronouns (henceforth 
zeros) are referential noun phrases (NPs) that are 
not overtly expressed in Japanese discourse.  
These NPs can be omitted if they are recoverable 
from a given context or relevant knowledge.  
The use of zeros is common in Japanese and this 
poses a challenge for Japanese as a Second Lan-
guage (JSL) learners for their accurate compre-
hension and natural-sounding production of 
Japanese discourse with zeros.  Some learners 
fail to understand a passage correctly because of 
the difficulty of identifying zeros and/or their 
antecedents.  Other learners produce grammati-
cally correct but still unnatural-sounding Japa-
nese due to overuse or underuse of zeros. 
Yet, very few textbooks provide systematic 
instruction or intensive exercises to overcome 
these difficulties with zeros.  Consequently 
many Japanese language teachers rely on their 
intuitions when explaining zeros.  Intuition is a 
conventional tool in teaching one’s native lan-
guage, but from a student’s perspective, a 
well-developed systematic method of instruction 
can be more convincing.  Also from a teacher’s 
standpoint, such analysis will be helpful in pre-
paring teaching materials and evaluating stu-
dents’ performance. 
  
 
Analysis of zeros can be divided into three 
phases: zero identification, zero interpretation 
and zero production.  This paper focuses on the 
first phase and proposes a method of systemati-
cally identifying the presence of zeros in order 
that teachers might provide effective instruction 
of zeros, based on some pedagogical principles 
from relevant second language acquisition (SLA) 
theory.  We regard teachers as primary users of 
the program and aim to help them enhance their 
instruction.  We implemented the program and 
evaluated its potential benefits for language 
teachers. 
In Sections 1 and 2 we discuss the peda-
gogical assumptions from SLA theory that moti-
vate our program design, and present the linguis-
tic assumptions from which our heuristics were 
drawn. Section 3 provides an overview of our 
system implementation.  In Section 4, we pre-
sent the results of evaluation from the viewpoints 
of both the accuracy and the empirical validity of 
the program.  We conclude with a discussion of 
possible future work. 
1 Pedagogical Assumptions 
There have been many studies about how people 
learn foreign languages and what is responsible 
for successful language learning. 
Recent SLA theory progresses beyond 
Krashen (e.g., 1982)’s emphasis on automatic 
processes of acquisition.  Empirical research 
has shown that learners’ consciousness-raising 
through explicit instruction does contribute to 
successful second language learning (see Norris 
& Ortega, 2000 for comprehensive review). 
Chapelle (1998) reviewed seven hypotheses 
about ideal SLA conditions that are relevant for 
CALL program design.  At the top of her list is 
that “the linguistic characteristics of target lan-
guage input need to be made salient” (p. 23).  
Effective input enhancement, by prompting 
learners to notice particular learning items, with 
highlighting for example, plays a significant role 
in facilitating acquisition.  We conjecture that 
this salience effect can also be realized by mak-
ing zeros visible. 
2 Linguistic Assumptions 
Japanese is a head-final language.  A sentence 
or a clause is headed by a predicate, which takes 
a set of arguments and adjuncts.  Predicates in 
Japanese include verbs, adjectives, nominal ad-
jectives and copula, and usually consist of a core 
predicate and some auxiliary elements.  Argu-
ments are classified into three types: Topic 
Phrase (TP), headed by a topic marker wa, Focus 
Phrase (FP), headed by focus particles mo, koso, 
dake, sae, shika, etc., and Kase Phrase (KP), 
headed by case particles ga, wo, ni, e, to, yori, de, 
kara, and made.  We regard adjuncts as 
non-particle-headed phrases. 
We define zeros as unexpressed obligatory 
arguments of a core predicate.  What is 
“obligatory” is the next question to arise.  
Obligatoriness is a controversial issue, and there 
is no set agreement among linguists on its 
definition.  Somers (1984) proposed a six-level 
scale of valency binding that reflects the degree 
of closeness of an element to the predicate.  The 
levels are (i) integral complements, (ii) 
obligatory complements, (iii) optional 
complements, (iv) middles, (v) adjuncts and (vi) 
extraperipherals.  Ishiwata (1999) suggests that 
in Japanese group (i) is often treated as part of 
idioms and is not omissible, and Japanese 
nominative –ga and accusative –wo fall into the 
category (ii), while dative –ni belongs to (iii).  
In light of this, we assume that obligatory 
arguments that can be zero-pronominalized are 
phrases headed by nominative-case particle ga 
and accusative wo, and ni, excluding dative ni in 
an indirect object position. 
3 Zero Detector 
Zero Detector (henceforth ZD) is an automatic 
zero identifying tool, which takes Japanese writ-
ten narrative texts as input and provides the 
zero-specified texts and their underlying struc-
tures as output.  This aims to draw learners’ and 
teachers’ attention to zeros, by making these 
invisible elements visible in effectively enhanced 
formats. 
3.1 System Overview 
ZD employs a rule-based approach, with theo-
retically sound heuristics.  Our heuristics are 
drawn from the linguistic assumptions described 
in Section 2. 
ZD reuses and integrates two existing natu-
ral language analysis tools and an electronic dic-
tionary, none of which were intended for a lan-
guage learning purpose, into its architecture, 
attempting to make the best possible use of their 
  
 
capabilities for our purpose.  Morphological 
analysis is done by ChaSen 2.2.8 (NAIST, Ma-
tsumoto, Y. et al., 2001), and dependency struc-
ture analysis by CaboCha 0.21 (NAIST, Kudo, 
K., 2001).  The Goi-Taikei Valency Dictionary 
(hereafter GTVD; Ikehara et al., 1997) serves as 
a source for valency pattern search. 
The flow of the system is illustrated in Fig-
ure 1. 
 
 
Clause Splitter
Morphological Analysis
Clause Splitting
(Manual Correction)
Revised Split Clauses
Zero Detector
Dependency Structure Analysis
Zero Detection
Valency
Dictionary
Zero Insertion
OUTPUT(B):
Clause Structure Frames
OUTPUT(C):
Predicate-Argument
Structures with Zeros
OUTPUT(D):
Zero-inserted Text
Morphological Analysis
OUTPUT(A):
Split Clauses
INPUT: Text
 
Figure 1: Flow diagram of zero detecting processes 
 
 
3.2 ZD Output 
As shown in Figure 1, ZD produces four differ-
ent types of output: (A) split clauses, (B) clause 
structure frames, (C) predicate-argument struc-
tures with zeros, and (D) zero-inserted texts.  
We will show how these outputs are structured 
using the example text in Figure 2. 
 
 
komatta   Satsuki-wa     sassoku  
in trouble  Satsuki-TOP    immediately 
 
gennin-wo   shirabe-sase-ta. 
cause-ACC  investigate-CAUSATIVE-PAST 
 
 
“Satsuki, who was in trouble, immediately had 
(someone) investigate its cause.” 
 
Figure 2: An example input text 
 
First, output (A) provides a text divided into 
clauses, each consisting of one and only one 
predicate and its arguments.  Some predicates 
are simplex, while others are complex, consisting 
of more than one core predicate (i.e., verb, adjec-
tive).  Several complex predicates (e.g., ta-
beta-koto-ga-aru ate-experience-subject marker- 
have, “have eaten”) are predefined as simplex to 
avoid excessive clause splitting.  The clauses 
are labelled with their clause types: independent 
(main), dependent (coordinated/subordinated) or 
embedded (relative/nominal/quoted).  A clause 
serves as the basic unit for the zero detecting 
operation.  In this study, embedded clauses are 
excluded from this operation and are left within 
their superordinate clauses.  An example output 
(A) is given in Figure 3 (next page). 
 
 
 
 
  
 
 
komatta EC(RC)] Satsuki-wa sassoku  
 
gennin-wo shirabe-sase-mashita. IC] 
 
Figure 3: Split clauses
1
 
 
Once the text is split into clauses, each 
clause is analysed for its dependency structure 
and then converted into its clause structure frame.  
The noun phrases which depend on the predicate 
are extracted, and then classified into phrase 
types (TP, FP and KP) according to their accom-
panying particles.  An example of this frame, 
i.e., output (B), is given in Figure 4. 
 
 
Input: komatta Satsuki-wa sassoku gennin-wo 
shirabe-sase-ta. 
 
Paragraph#: 2 
Sentence#: 4 
Clause#: 5 
Clause Type: Independent with EC(RC) 
  ----------------------------------------------------- 
  [Predicate] : shirabe-sase-ta. 
    Core:     shiraberu   verb 
    Auxiliary:  saseru   verb 
  ta   auxiliary verb 
.  
    Voice: causative 
    Empathy: 
    Conjunction: 
  ----------------------------------------------------- 
  [Argument] : 
    Topic Phrase:  komatta Satsuki-wa 
      Topic-Case:  N1-ga 
    Focus Phrase:  <none> 
      Focus-Case:  <none> 
    Kase Phrase:  gennin-wo 
    Pre-copula: <none> 
  [Adjunct] :  sassoku 
 
Figure 4: A clause structure frame 
 
This frame also includes the result of 
valency checking, as in Figure 5, and zero iden-
tifying processes, as in Figure 6, at the bottom. 
 
                                                      
1
 Here, we use the acronyms: IC for Independent 
Clause, EC for Embedded Clause, and RC for 
Relative Clause. 
 
Valency Selected: N1 ga  N2 wo 
 
Valency Obligatory: N1 ga  N2 wo 
 
Valency Changed: N1 ga   N2 wo  N3 ni 
 
Figure 5: Valency checking 
 
A core predicate is checked against GTVD 
to search for its syntactic valency pattern.  
GTVD is a semantic valency dictionary, origi-
nally designed for transfer-based Japa-
nese-to-English machine translation, so it in-
cludes as many valency pattern entries for each 
predicate as are necessary for effective transfer.  
The entries are ordered according to expected 
frequency of occurrence.  We took the naïve 
approach of selecting the first-ranking entry from 
the listing for each core predicate (i.e.,‘Valency 
Selected’ in Figure 5). 
The next step is to apply the definition of 
‘obligatoriness’ described in Section 2 to refine 
the selected valency pattern (‘Valency Obliga-
tory’ in Figure 5).  If non-ga, wo, or ni cases are 
within the first three case slots of the selected 
valency pattern, they are excluded.  If a ni-case 
still remains in the third case slot, it is also de-
leted.  These operations leave us two valency 
patterns: (i) N1-ga N2-wo, and (ii) N1-ga N2-ni, 
in most cases. 
Then, a valency changing operation is done 
in the case of causatives or passives.  When an 
auxiliary verb is added to the core predicate in 
the causative or passive construction, the verb 
then requires three arguments.  In the causative 
case, these are a ga-marked causer, a wo-marked 
object and a ni-marked causee.  The valency 
changing operation adds the boxed valent, N3 ni, 
in Figure 5 (Valency Changed) because the voice 
slot is marked as causative in Figure 4. 
 
 
Valency Selected: N1 ga  N2 wo 
 
Valency Obligatory: N1 ga  N2 wo 
 
Valency Changed: N1 ga  N2 wo  N3 ni 
 
Zero: N3 ni 
 
Figure 6: Zero identifying 
 
  
 
Now that the valency pattern for the given 
predicate is assigned, it is checked against overt 
arguments listed in the frame.  The valent N2 is 
matched with the overt argument gennin-wo and 
removed from the zero candidates, as shown in 
Figure 6. 
Case-less elements, such as TP and FP, also 
need to have their canonical case markers re-
stored.  This is done by assigning the first re-
maining valent to TP and/or FP.  This is based 
on the linguistic fact that subjects are more likely 
to be topicalized or focused than objects.  In the 
example, TP, Satsuki-wa, is assigned ga case.  
The assigned case slot N1-ga is then matched 
with Satsuki-wa (ga) and is also deleted. 
Finally, the remaining valent, if any, is as-
sumed to be a zero (i.e., N3 ni in Figure 6). 
Once zeros are identified, ZD decides where 
to insert the identified zeros in the original text, 
by keeping canonical ordering as listed in the 
valency pattern.  An example of the predicate- 
(obligatory) argument structure from Figure 6, 
with the identified zero, is presented in Figure 7.  
This is output (C).  Here, the restored case 
marking particle is presented in parentheses. 
 
 
*komatta Satsuki-wa (ga) 
 
*gennin-wo 
 
*[   ni] 
 
*shirabe-sase-ta. 
 
Figure 7: Predicate-argument structure with zeros 
 
Finally, ZD outputs the original series of 
clauses with zeros inserted in the most plausible 
positions, along with adjuncts, output (D), as in 
Figure 8. 
 
 
komatta Satsuki-wa sassoku gennin-wo [   ni] 
 
shirabe-sase-ta. 
 
Figure 8: Zero-specified text 
 
These outputs can later be converted into 
the form of a slide presentation or hard-copy 
handouts, etc., depending on how they are used 
by teachers. 
4 Evaluation 
The purpose of the evaluation was to assess the 
validity of ZD output for practical use in a lan-
guage learning/teaching setting.  In the follow-
ing subsections, we evaluate ZD’s performance 
in terms of its accuracy and then present an ex-
perimental report of its validity for educational 
use. 
4.1 Performance 
First, we compared the ZD output with human 
judgements.  The test corpus consisted of two 
reading selections from a JSL textbook and one 
student written narrative monologue, all of which 
were representative samples for lower intermedi-
ate level Japanese.  Five subjects (native speak-
ers of Japanese and trained natural language re-
searchers) served as our human zero detectors.  
They were asked to intuitively identify missing 
arguments in each clause.  We used average 
human performance as a baseline against which 
to evaluate ZD output.  Here, zeros detected by 
three or more, out of five, subjects were regarded 
as average human performance. 
As Table 1 shows, ZD achieved a 73% 
per-clause matching rate with human output.  
That number represents the ratio of the number 
of exact matches between the two outputs over 
the total number of clauses. 
 
Table 1: Per-clause matching rates 
 # of clauses # of matched 
Reading (1) 30 22 (73%) 
Reading (2) 25 18 (72%) 
Writing 23 17 (74%) 
Total 78 57 (73%) 
 
A closer examination of each case element 
(ga, wo, ni) is given in Table 2 (next page).  The 
level ‘matched’ includes both cases where ZD 
and human detect a zero and cases where neither 
detects it.  The accuracy (89% average) is high 
enough for the ZD output to be put into practical 
use as a learning aid, without an excessive load 
on teachers for post-editing output errors.  Re-
leasing teachers from having to spend enormous 
amount of time on the tedious work of analysing 
educational materials is one of the biggest ad-
vantages of computerization of linguistic analy-
sis. 
 
References

Chapelle, Carol A. (1998). Multimedia CALL:
Lessons to be learned from research on instructed
SLA. Language Learning and Technology, vol.2,
no.1, pp.22-34.

Grosz, B. J., A. K. Joshi and S. Weinstein. (1995).
Centering: A framework for modelling the local
coherence of discourse. Computational Linguistics,
21/2, pp. 203-225.

Ikehara, S., M. Miyazaki, S. Shirai, A. Yokoo, H.
Nakaiwa, K. Ogura and Y. Hayashi (1997).
Goi-Taikei  A Japanese Lexicon, 5 volumes,
Iwanami Shoten, Tokyo.

Krashen, S. (1982). Principles and Practice in
Second Language Acquisition. Pergamon, Oxford.
NAIST, Kudo, K. (2001). CaboCha 0.21.
http://cl.aist-nara.ac.jp/~taku-ku/software/caboch
a/

NAIST, Matsumoto, Y. et al. (2001). ChaSen 2.2.8.
http://chasen.aist-nara.ac.jp/

Ishiwata, T. (1999). Gendai GengoRiron to Kaku,
Hituzi Shobo, Tokyo.

Norris, J. M. and L. Ortega (2000). Effectiveness of
L2 instruction: A research synthesis and quantitative
meta-analysis. Language Learning 50 (3),
pp.417-528.

Somers, H. L. (1984). On the validity of the complement-
adjunct distinction in valency grammar. Linguistics
22, pp. 507-53.

