Proceedings of the ACL Student Research Workshop, pages 139–144,
Ann Arbor, Michigan, June 2005. c©2005 Association for Computational Linguistics
Corpus-Oriented Development of Japanese HPSG Parsers
Kazuhiro Yoshida
Department of Computer Science,
University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033
kyoshida@is.s.u-tokyo.ac.jp
Abstract
This paper reports the corpus-oriented de-
velopment of a wide-coverage Japanese
HPSG parser. We first created an HPSG
treebank from the EDR corpus by us-
ing heuristic conversion rules, and then
extracted lexical entries from the tree-
bank. The grammar developed using this
method attained wide coverage that could
hardly be obtained by conventional man-
ual development. We also trained a statis-
tical parser for the grammar on the tree-
bank, and evaluated the parser in terms of
the accuracy of semantic-role identifica-
tion and dependency analysis.
1 Introduction
In this study, we report the corpus-oriented de-
velopment of a Japanese HPSG parser using the
EDR Japanese corpus (2002). Although several re-
searchers have attempted to utilize linguistic gram-
mar theories, such as LFG (Bresnan and Kaplan,
1982), CCG (Steedman, 2001) and HPSG (Pollard
and Sag, 1994), for parsing real-world texts, such at-
tempts could hardly be successful, because manual
development of wide-coverage linguistically moti-
vated grammars involves years of labor-intensive ef-
fort.
Corpus-oriented grammar development is a gram-
mar development method that has been proposed as
a promising substitute for conventional manual de-
velopment. In corpus-oriented methods, a treebank
of a target grammar is constructed first, and various
grammatical constraints are extracted from the tree-
bank. Previous studies reported that wide-coverage
grammars can be obtained at low cost by using this
method. (Hockenmaier and Steedman, 2002; Miyao
et al., 2004) The treebank can also be used for train-
ing statistical disambiguation models, and hence we
can construct a statistical parser for the extracted
grammar.
The corpus-oriented method enabled us to de-
velop a Japanese HPSG parser with semantic infor-
mation, whose coverage on real-world sentences is
95.3%. This high coverage allowed us to evaluate
the parser in terms of the accuracy of dependency
analysis on real-world texts, the evaluation measure
that is previously used for more statistically-oriented
parsers.
2 HPSG
Head-Driven Phrase Structure Grammar (HPSG) is
classified into lexicalized grammars (Schabes et al.,
1988). It attempts to model linguistic phenomena
by interactions between a small number of grammar
rules and a large number of lexical entries. Figure
1 shows an example of an HPSG derivation of a
Japanese sentence ‘kare ga shinda,’ which means,
‘He died.’ In HPSG, linguistic entities such as words
and phrases are represented by typed feature struc-
tures called signs, and the grammaticality of a sen-
tence is verified by applying grammar rules to a se-
quence of signs. The sign of a lexical entry encodes
the type and valence (i.e. restriction on the types of
phrases that can appear around the word) of a corre-
sponding word. Grammar rules of HPSG consist of
139
RULE complement_head
SIGN
HEAD verb
SPR
COMPS
HEAD verb
SPR
COMPS 2 PP"ga"
"shinda"
died
RULE specifier_head
SIGN 2
HEAD PP"ga"
SPR
COMPS
HEAD PP"ga"
SPR 1 noun
COMPS
"ga"
NOM
1
HEAD noun
SPR
COMPS
"kare"
he
Figure 1: Example of HPSG analysis.
schemata and principles, the former enumerate pos-
sible patterns of phrase structures, and the latter are
basically for controlling the inheritance of daugh-
ters’ features to the parent.
In the current example, the lexical entry for
“shinda” is of the type verb, as indicated in its
HEAD, and its COMPS feature restricts its preced-
ing phrase to be of the type PP“ga”. The HEAD
feature of the root node of the derivation is inher-
ited from the lexical entry for “shinda”, because
complement-head structures are head-final, and the
head feature principle states that the HEAD feature
of a phrase must be inherited from its head daughter.
There are several implementations of Japanese
HPSG grammars. JACY (Siegel and Bender, 2002)
is a hand-crafted Japanese HPSG grammar that pro-
vides semantic information as well as linguistically
motivated analysis of complex constructions. How-
ever, the evaluation of the grammar has not been
done on domain-independent real-world texts such
as newspaper articles. Although Bond et al. (2004)
attempted to improve the coverage of the JACY
grammar through the development of an HPSG tree-
bank, they limited the target of their treebank an-
notation to short sentences from dictionary defini-
tions. SLUNG (Mitsuishi et al., 1998) is an HPSG
grammar whose coverage on real-world sentences
is about 99%, but the grammar is underspecified,
which means that the constraints of the grammar are
not sufficient for conducting semantic analysis. By
employing corpus-oriented development, we aim to
develop a wide-coverage HPSG parser that enables
sign
SYNSEM
synsem
LOCAL
local
CAT
cat
HEAD
head
MOD RIGHT synsemLEFT synsem
BAR phrase/chunk
VAL
SPR local
COMPS
AGENT local
OBJECT local
GOAL local
CONT content
Figure 2: Sign of the grammar.
semantic analysis of real-word texts.
3 Grammar Design
First, we provide a brief description of some char-
acteristics of Japanese. Japanese is head final, and
phrases are typically headed by function words. Ar-
guments of verbs usually have no fixed order (this
phenomenon is called scrambling) and are freely
omitted. Arguments’ semantic relations to verbs
are chiefly determined by their head postpositions.
For example, ‘boku/I ga/NOM kare/he wo/ACC ko-
roshi/kill ta/DECL’ (I killed him) can be paraphrased
as ‘kare wo boku ga koroshi ta,’ without changing
the meaning.
The case alternation phenomenon must also be
taken into account. Case alternation is caused by
special auxiliaries “(sa)se” and “(ra)re,” which are
causative and passive auxiliaries, respectively, and
the verbs change their subcategorization behavior
when they are combined with these auxiliaries.
The following sections describe the design of our
grammar. Especially, treatment of the scrambling
and case alternation phenomena is provided in de-
tail.
3.1 Fundamental Phrase Structures
Figure 2 presents the basic structure of signs of our
grammar. The HEAD feature specifies phrasal cat-
egories, the MOD feature represents restrictions on
the left and right modifiees, and the VAL feature en-
codes valence information. (For the explanation of
the BAR feature, see the description of the promo-
140
Table 1: Schemata and their uses.schema name common use of the rule
specifier-head PP or NP + postposition
VP + verbal ending
NP + suffix
complement-head argument (PP/NP) + verb
compound-noun NP + NP
modifier-head modifier + head
head-modifier phrase + punctuation
promotion promotes chunks to phrases
tion schema below.) 1 For some types of phrases,
additional features are specified as HEAD features.
Now, we provide a detailed explanation of the de-
sign of the schemata and how the features in Figure
2 work. The following descriptions are also summa-
rized in Table 1.
specifier-head schema Words are first concate-
nated by this schema to construct basic word chunks.
Postpositional phrases (PPs), which consist of post-
positions and preceding phrases, are the most typi-
cal example of specifier-head structures. For post-
positions, we specify a head feature PFORM, with
the postposition’s surface string as its value, in addi-
tion to the features in Figure 2, because differences
of postpositions play a crucial role in disambiguat-
ing semantic-structures of Japanese. For example,
the postposition ‘wo’ has a PFORM feature whose
value is “wo,” and it accepts an NP as its specifier.
As a result, a PP such as “kare wo” inherits the value
of PFORM feature “wo” from ’wo.’
The schema is also used when VPs are con-
structed from verbs and their endings (or, sometimes
auxiliaries. See also Section 3.2).
complement-head schema This schema is used
for combining VPs with their subcategorized argu-
ments (see Section 3.2 for details).
compound-noun schema Because nouns can be
freely concatenated to form compound nouns, a spe-
cial schema is used for compound nouns.
modifier-head schema This schema is for modi-
fiers and their heads. Binary structures that cannot
be captured by the above three schemata are also
1The CONTENT feature, which should contain information
about the semantic contents of syntactic entities, is ignored in
the current implementation of the grammar.
considered to be modifier-head structures.2
head-modifier schema This schema is used when
the modifier-head schema is not appropriate. In the
current implementation, it is used for a phrase and
its following punctuation.
promotion schema This unary schema changes
the value of the BAR feature from chunk to phrase.
The distinction between these two types of con-
stituents is for prohibiting some kind of spurious
ambiguities. For example, ‘kinou/yesterday ko-
roshi/kill ta/DECL’ can be analyzed in two differ-
ent ways, i.e. ‘(kinou (koroshi ta))’ and ‘((kinou
koroshi) ta).’ The latter analysis is prevented by
restricting “kinou”’s modifiee to be a phrase, and
“ta”’s specifier to be a chunk, and by assuming “ko-
roshi” to be a chunk.
3.2 Scrambling and Case Alternation
Scrambling causes problems in designing a Japanese
HPSG grammar, because original HPSG, designed
for English, specifies the subcategorization frame of
a verb as an ordered list, and the semantic roles of
arguments are determined by their order in the com-
plement list.
Our implementation treats the complement fea-
ture as a list of semantic roles. Semantic roles for
which verbs subcategorize are agent, object, and
goal.3 Correspondingly, we assume three subtypes
of the complement-head schema: the agent-head,
object-head, and goal-head schemata. When verbs
take their arguments, arguments receive semantic
roles which are permitted by the subcategorization
of verbal signs. We do not restrict the order of
application of the three types of complement-head
schemata, so that a single verbal lexical entry can
accept arguments that are scrambled in arbitrary or-
der. In Figure 3, “kare ga” is a ga-marked PP, so it is
analyzed as an agent of “koro(su).” 4
Case alternation is caused by special auxiliaries
“(sa)se” and “(ra)re.” For instance, in ‘boku/I
2Current implementation of the grammar treats complex
structures such as relative clause constructions and coordina-
tions just the same as simple modification.
3These are the three roles most commonly found in EDR.
4We assume that a single semantic role cannot be occupied
by more than one syntactic entities. This assumption is some-
times violated in EDR’s annotation, causing failures in grammar
extraction.
141
comp_head
HEAD verb
AGENT 1 PP"ga"
OBJECT PP"wo"
"korosu"
kill
1 HEAD PP"ga"
"kare ga"
he-NOM
Figure 3: Verb and its argument.
HEAD verb
SPR
verb
HEAD PASSIVE plus
COMPS 1
COMPS 1
Figure 4: Lexical sign of “(ra)re”.
ga/NOM kare/he ni/DAT korosa/kill re/PASSIVE
ta/DECL’ (I was killed by him), “korosa” takes a
“ga”-marked PP as an object and a “ni”-marked PP
as an agent, though without “(sa)re,” it takes a “ga”-
marked PP as an agent and a “wo”-marked PP as an
object.
We consider auxiliaries as a special type of
verbs which do not have their own subcategoriza-
tion frames. They inherit the subcategorization
frames of verbs.5 To capture the case alternation
phenomenon, each verb has distinct lexical entries
for its passive and causative uses. This distinc-
tion is made by binary valued HEAD features, PAS-
SIVE and CAUSATIVE. The passive (causative) aux-
iliary restricts the value of its specifier’s PASSIVE
(CAUSATIVE) feature to be plus, so that it can only
be combined with properly case-alternated verbal
lexical entries.
Figure 4 presents the lexical sign of the passive
auxiliary “(ra)re.” Our analysis of an example sen-
tence is presented in Figure 5. Note that the passive
auxiliary “re(ta)” requires the value of the PASSIVE
feature of its specifier be plus, and hence “koro(sa)”
cannot take the same lexical entry as in Figure 3.
4 Grammar Extraction from EDR
The EDR Japanese corpus consists of 207802 sen-
tences, mainly from newspapers and magazines.
The annotation of the corpus includes word segmen-
5The control phenomena caused by auxiliaries are currently
unsupported in our grammar.
comp_head
HEAD verb
AGENT PP"ni"
OBJECT 3 PP"ga"
HEAD verb
SPR
verb
HEAD PASSIVE plus
AGENT 1
OBJECT 2
AGENT 1
OBJECT 2
"reta"
PASSIVE
HEAD verbPASSIVE plus
AGENT 1 PP"ni"
OBJECT 2 PP"ga"
"korosa"
kill
3 HEAD PP"ga"
"kare ga"
he-NOM
Figure 5: Example of passive construction.
tation, part-of-speech (POS) tags, phrase structure
annotation, and semantic information.
The heuristic conversion of the EDR corpus into
an HPSG treebank consists of the following steps. A
sentence ‘((kare/NP-he wo/PP-ACC) (koro/VP-kill
shi/VP-ENDING ta/VP-DECL))’ ([I] killed him yes-
terday) is used to provide examples in some steps.
Phrase type annotation Phrase type labels such
as NP and VP are assigned to non-terminal nodes.
Because Japanese is head final, the label of the right-
most daughter of a phrase is usually percolated to its
parent. After this step, the example sentence will be
‘((PP kare/NP wo/PP) (VP koro/VP shi/VP ta/VP)).’
Assign head features The types of head features
of terminal nodes are determined, chiefly from their
phrase types. Features specific to some categories,
such as PFORM, are also assigned in this step.
Binarization Phrases for which EDR employs flat
annotation are converted into binary structures. The
binarized phrase structure of the example sentence
will be ‘((kare wo) ((koro shi) ta)).’
Assign schema names Schema names are as-
signed according to the patterns of phrase structures.
For instance, a phrase structure which consists of
PP and VP is identified as a complement-head struc-
ture, if the VP’s argument and the PP are coindexed.
In the example sentence, ‘kare wo’ is annotated as
‘koro”s object in EDR, so the object-head schema is
applied to the root node of the derivation.
Inverse schema application The consistency of
the derivation of the obtained HPSG treebank is ver-
142
ified by applying the schemata to each node of the
derivation trees in the treebank.
Lexicon Extraction Lexical entries are extracted
from the terminal nodes of the obtained treebank.
5 Disambiguation Model
We also train disambiguation models for the gram-
mar using the obtained treebank. We employ log-
linear models (Berger et al., 1996) for the disam-
biguation. The probability of a parse a0 of a sentence
a1 is calculated as follows:
a2a4a3
a0a6a5a1a8a7a8a9 a10a12a11a14a13
a3a16a15a18a17a20a19
a17
a3
a0 a7a22a21
a17
a7
a15a24a23a26a25
a10a12a11a27a13
a3 a15a18a17 a19
a17
a3
a0a29a28 a7a22a21
a17
a7
where a19 a17 are feature functions, a21 a17 are strengths of the
feature functions, and a0 a28 spans all possible parses of
a1 . We employ Gaussian MAP estimation (Chen and
Rosenfeld, 1999) as a criterion for optimizing a21 a17 .
An algorithm proposed by Miyao et. al. (2002) pro-
vides an efficient solution to this optimization prob-
lem.
6 Experiments
Because the aim of our research is to construct a
Japanese parser that can extract semantic informa-
tion from real-world texts, we evaluated our parser
in terms of its coverage and semantic-role identifica-
tion accuracy. We also compare the accuracy of our
parser with that of an existing statistical dependency
analyzer, in order to investigate the necessity of fur-
ther improvements to our disambiguation model.
The following experiments were conducted using
the EDR Japanese corpus. An HPSG grammar was
extracted from 519516 sentences of the corpus, and
the same set of sentences were used as a training
set for the disambiguation model. 47767 sentences
(91.9%) of the training set were successfully con-
verted into an HPSG treebank, from which we ex-
tracted lexical entries.
When we construct a lexicon from the extracted
lexical entries, we reserved lexical entry templates
for infrequent words as default templates for un-
known words of each POS, in order to achieve suffi-
cient coverage. The threshold for ‘infrequent’ words
6We could not use the entire corpus for the experiments, be-
cause of the limitation of computational resources.
were determined to be 30 from the results of prelim-
inary experiments.
We used 2079 EDR sentences as a test set. (An-
other set of 2078 sentences were used as a devel-
opment set.) The test set is also converted into an
HPSG treebank, and the conversion was successful
for 1913 sentences. (We will call the obtained HPSG
treebank the “test treebank.”)
As features of the log-linear model, we extracted
the POS of the head, template name of the head,
surface string and its ending of the head, punctua-
tion contained in the phrase, and distance between
heads of daughters, from each sign in derivation
trees. These features are used in combinations.
The coverage of the parser7 on the test set was
95.3% (1982/2079). Though it is still below the cov-
erage achieved by SLUNG (Mitsuishi et al., 1998),
our grammar has richer information that enables se-
mantic analysis, which is lacking in SLUNG.
We evaluated the parser in terms of its accuracy
in identifying semantic roles of arguments of verbs.
For each phrase which is in complement-head rela-
tion with some VP, a semantic role is assigned ac-
cording to the type8 of the complement-head struc-
ture. The performance of our parser on the test tree-
bank was 63.8%/57.8% in precision/recall of seman-
tic roles.
As most studies on syntactic parsing of Japanese
have focused on bunsetsu-based dependency analy-
sis, we also attempted an evaluation in this frame-
work.9 In order to evaluate our parser by bunsetsu
dependency, we converted the phrase structures of
EDR and the output of our parser into dependency
structures of the right-most content word of each
bunsetsu. Bunsetsu boundaries of the EDR sen-
tences were determined by using simple heuristic
rules. The dependency accuracies and the senten-
tial accuracies of our parser and Kanayama et. al.’s
analyzer are shown in Table 2. (failure sentences
are not counted for calculating accuracies.) Our
results were still significantly lower than those of
7Coverage of the parser can be somewhat lower than that of
the grammar, because we employed a beam thresholding tech-
nique proposed by Tsuruoka et al. (Tsuruoka et al., 2004).
8As described in Section 3.2, there are three types of
complement-head structures.
9Bunsetsu is a widely accepted syntactic unit of Japanese,
which usually consists of a content word followed by a function
word.
143
accuracy (dependency) accuracy (sentence) # failure
(Kanayama et al., 2000) 88.6% (23078/26062) 46.9% (1560/3326) 1.4% (46/3372)
This paper 85.0% (13201/15524) 37.4% (705/1887) 1.4% (26/1913)
Table 2: Accuracy of dependency analysis.
Kanayama et. al., which are the best reported de-
pendency accuracies on EDR.
This experiment revealed that the accuracy of our
parser requires further improvement, although our
grammar achieved high coverage. Our expectation is
that incorporating grammar rules for complex struc-
tures which is ignored in the current implementation
(e.g. control, relative clause, and coordination con-
structions) will improve the accuracy of the parser.
In addition, we should investigate whether the se-
mantic analysis our parser provides can contribute
the performance of more application-oriented tasks
such as information extraction.
7 Conclusion
We developed a Japanese HPSG grammar by means
of the corpus-oriented method, and the grammar
achieved the high coverage, which we consider to be
nearly sufficient for real-world applications. How-
ever, the accuracy of the parser in terms of depen-
dency analysis was significantly lower than that of
the existing parser. We expect that the accuracy
can be improved through further elaboration of the
grammar design and disambiguation method.
References
Adam L. Berger, Stephen Della Pietra, and Vincent
J. Della Pietra. 1996. A Maximum Entropy Approach
to Natural Language Processing. Computational Lin-
guistics, 22(1).
Francis Bond, Sanae Fujita, Chikara Hashimoto, Kaname
Kasahara, Shigeko Nariyama, Eric Nichols, Akira
Ohtani, Takaaki Tanaka, and Shigeaki Amano. 2004.
The Hinoki Treebank: A Treebank for Text Under-
standing. In Proc. of IJCNLP-04.
J. Bresnan and R. M. Kaplan. 1982. Introduction:
Grammars as mental representations of language. In
The Mental Representation of Grammatical Relations.
MIT Press.
S. Chen and R. Rosenfeld. 1999. A Gaussian prior for
smoothing maximum entropy models. In Technical
Report CMUCS.
Julia Hockenmaier and Mark Steedman. 2002. Acquir-
ing Compact Lexicalized Grammars from a Cleaner
Treebank. In Proc. of Third LREC.
Hiroshi Kanayama, Kentaro Torisawa, Mitsuishi Yutaka,
and Jun’ichi Tsujii. 2000. A Hybrid Japanese Parser
with Hand-crafted Grammar and Statistics. In Proc. of
the 18th COLING, volume 1.
Yutaka Mitsuishi, Kentaro Torisawa, and Jun’ichi Tsujii.
1998. HPSG-Style Underspecified Japanese Grammar
with Wide Coverage. In Proc. of the 17th COLING–
ACL.
Yusuke Miyao and Jun’ichi Tsujii. 2002. Maximum En-
tropy Estimation for Feature Forests. In Proc. of HLT
2002.
Yusuke Miyao, Takashi Ninomiya, and Jun’ichi Tsujii.
2004. Corpus-oriented Grammar Development for
Acquiring a Head-driven Phrase Structure Grammar
from the Penn Treebank. In Proc. of IJCNLP-04.
National Institute of Information and Communications
Technology. 2002. EDR Electronic Dictionary Ver-
sion 2.0 Technical Guide.
Carl Pollard and Ivan A. Sag. 1994. Head-Driven Phrase
Structure Grammar. The University of Chicago Press.
Y. Schabes, A. Abeille, and A. K. Joshi. 1988. Pars-
ing Strategies with ’Lexicalized’ Grammars: Applica-
tion to Tree Adjoining Grammars. In Proc. of the 12th
COLING.
Melanie Siegel and Emily M. Bender. 2002. Ef-
ficient Deep Processing of Japanese. In Proc. of
the 3rd Workshop on Asian Language Resources and
International Standardization. COLING 2002 Post-
Conference Workshop, August 31.
Mark Steedman. 2001. The Syntactic Process. MIT
Press.
Yoshimasa Tsuruoka, Yusuke Miyao, and Jun’ichi Tsu-
jii. 2004. Towards efficient probabilistic HPSG pars-
ing: integrating semantic and syntactic preference to
guide the parsing. In Proc. of IJCNLP-04 Workshop:
Beyond shallow analyses - Formalisms and statistical
modeling for deep analyses.
144
