The Hinoki Treebank: Working Toward Text Understanding
Francis Bond, Sanae Fujita, Chikara Hashimoto, 
Kaname Kasahara, Shigeko Nariyama,y Eric Nichols,y
Akira Ohtani,z Takaaki Tanaka, Shigeaki Amano
NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation
 Kobe Shoin Women’s University yNAIST zOsaka Gakuin University
fbond, sanae, kaname, takaaki, amanog@cslab.kecl.ntt.co.jp  chashi@sils.shoin.ac.jp,
yferic-n, shigekog@is.naist.jp zohtani@utc.osaka-gu.ac.jp
Abstract
In this paper we describe the construction of a
new Japanese lexical resource: the Hinoki treebank.
The treebank is built from dictionary de nition sen-
tences, and uses an HPSG based Japanese grammar
to encode the syntactic and semantic information.
We show how this treebank can be used to extract
thesaurus information from de nition sentences in
a language-neutral way using minimal recursion se-
mantics.
1 Introduction
In this paper we describe the current state of a new
lexical resource: the Hinoki treebank. The motiva-
tion and initial construction was described in detail
in Bond et al. (2004a). The ultimate goal of our re-
search is natural language understanding | we aim
to create a system that can parse text into some use-
ful semantic representation. Ideally this would be
such that the output can be used to actually update
our semantic models. This is an ambitious goal, and
this paper does not present a completed solution,
but rather a road-map to the solution, with some
progress along the way.
The mid-term goal is to build a thesaurus from
dictionary de nition sentences and use it to enhance
a stochastic parse ranking model that combines syn-
tactic and semantic information. In order to do this
the Hinoki project is combining syntactic annotation
with word sense tagging. This will make it possible
to test the use of similarity and/or class based ap-
proaches together with symbolic grammars and sta-
tistical models. Our aim in this is to alleviate data
sparseness. In the Penn Wall Street Journal tree-
bank (Taylor et al., 2003), for example, the words
stocks and skyrocket never appear together. How-
ever, the superordinate concepts capital ( stocks)
and move upward ( skyrocket) often do.
We are constructing the ontology from the ma-
chine readable dictionary Lexeed (Kasahara et al.,
2004). This is a hand built self-contained lexicon:
it consists of headwords and their de nitions for the
most familiar 28,000 words of Japanese. This set
is large enough to include most basic level words
and covers over 75% of the common word tokens
in a sample of Japanese newspaper text. In order
to make the system self sustaining we base the  rst
growth of our treebank on the dictionary de nition
sentences themselves. We then train a statistical
model on the treebank and parse the entire lexicon.
From this we induce a thesaurus. We are currently
tagging the de nition sentences with senses. We will
then use this information and the thesaurus to build
a model that combines syntactic and semantic in-
formation. We will also produce a richer ontology
| for example extracting selectional preferences. In
the last phase, we will look at ways of extending our
lexicon and ontology to less familiar words.
In this paper we present the results from treebank-
ing 38,900 dictionary sentences. We also highlight
two uses of the treebank: building the statistical
models and inducing the thesaurus.
2 The Lexeed Semantic Database of
Japanese
The Lexeed Semantic Database of Japanese consists
of all Japanese words with a familiarity greater than
or equal to  ve on a seven point scale (Kasahara et
al., 2004). This gives 28,000 words in all, with 46,347
di erent senses. De nition sentences for these sen-
tences were rewritten to use only the 28,000 familiar
words (and some function words). The de ning vo-
cabulary is actually 16,900 di erent words (60% of
all possible words). An example entry for  rst two
senses of the word a0a2a1a4a3a6a5a8a7 doraib a \driver" is
given in Figure 1, with English glosses added (un-
derlined features are those added by Hinoki).
3 The Hinoki Treebank
The structure of our treebank is inspired by the Red-
woods treebank of English in which utterances are
parsed and the annotator selects the best parse from
the full analyses derived by the grammar (Oepen et
al., 2002). We had four main reasons for selecting
this approach. The  rst was that we wanted to de-
velop a precise broad-coverage grammar in tandem
with the treebank, as part of our research into nat-
ural language understanding. Treebanking the out-
2
66
66
66
66
66
66
4
Index a0a2a1a4a3a6a5a8a7 doraib a
POS noun Lexical-type noun-lex
Familiarity 6.5 [1{7]
Sense 1
2
4
Definition a9a11a10 /a12 /a13a15a14a17a16a8a18 /a19a21a20 / a22a24a23a26a25a2a27a26a28 /a19 /a29a26a30 /a31a8a32 /a33 \A tool for inserting and removing screws ."
Hypernym a31a8a32 1 equipment \tool"
Sem. Class h942:tooli ( 893:equipment)
3
5
Sense 2
2
4
Definition a34a36a35a38a37 /a12 /a39a8a40 /a29a26a30 /a41 /a33 \Someone who drives a car ."
Hypernym a41 1 hito \person"
Sem. Class h292:driveri ( 4:person)
3
5
3
77
77
77
77
77
77
5
Figure 1: Entry for the Word doraib a \driver" (with English glosses)
put of the parser allows us to immediately identify
problems in the grammar, and improving the gram-
mar directly improves the quality of the treebank in
a mutually bene cial feedback loop (Oepen et al.,
2004).
The second reason is that we wanted to anno-
tate to a high level of detail, marking not only
dependency and constituent structure but also de-
tailed semantic relations. By using a Japanese gram-
mar (JACY: Siegel and Bender (2002)) based on a
monostratal theory of grammar (HPSG: Pollard and
Sag (1994)) we could simultaneously annotate syn-
tactic and semantic structure without overburdening
the annotator. The treebank records the complete
syntacto-semantic analysis provided by the HPSG
grammar, along with an annotator’s choice of the
most appropriate parse. From this record, all kinds
of information can be extracted at various levels
of granularity. In particular, traditional syntactic
structure (e.g., in the form of labeled trees), de-
pendency relations between words and full meaning
representations using minimal recursion semantics
(MRS: Copestake et al. (1999)). A simpli ed exam-
ple of the labeled tree, MRS and dependency views
for the de nition of a0 a1 a3 a5 a7 2 doraib a \driver" is
given in Figure 2.
The third reason was that we expect the use of the
grammar as a base to aid in enforcing consistency |
all sentences annotated are guaranteed to have well-
formed parses. Experience with semi-automatically
constructed grammars, such as the Penn Treebank,
shows many inconsistencies remain (around 4,500
types estimated by Dickinson and Meurers (2003))
and the treebank does not allow them to be identi-
 ed automatically.
The last reason was the availability of a reason-
ably robust existing HPSG of Japanese (JACY), and
a wide range of open source tools for developing
the grammars. We made extensive use of the LKB
(Copestake, 2002), a grammar development environ-
ment, in order to extend JACY to the domain of
de ning sentences. We also used the extremely e -
cient PET parser (Callmeier, 2000), which handles
grammars developed using the LKB, to parse large
test sets for regression testing, treebanking and  -
nally knowledge acquisition. Most of our develop-
ment was done within the [incr tsdb()] pro ling en-
vironment (Oepen and Carroll, 2000). The existing
resources enabled us to rapidly develop and test our
approach.
3.1 Creating and Maintaining the Treebank
The construction of the treebank is a two stage pro-
cess. First, the corpus is parsed (in our case using
JACY with the PET parser), and then the annotator
selects the correct analysis (or occasionally rejects
all analyses). Selection is done through a choice of
discriminants. The system selects features that dis-
tinguish between di erent parses, and the annotator
selects or rejects the features until only one parse
is left. The number of decisions for each sentence
is proportional to log2 of the number of parses, al-
though sometimes a single decision can reduce the
number of remaining parses by more or less than
half. In general, even a sentence with 5,000 parses
only requires around 12 decisions.
Because the disambiguating choices made by the
annotators are saved, it is possible to update the
treebank when the grammar changes (Oepen et al.,
2004). Although the trees depend on the grammar,
re-annotation is only necessary in cases where either
the parse has become more ambiguous, so new de-
cisions have to be made, or existing rules or lexical
items have changed so much that the system cannot
reconstruct the parse.
One concern that has been raised with Redwoods
style treebanking is the fact that the treebank is tied
to a particular implementation of a grammar. The
ability to update the treebank alleviates this concern
to a large extent. A more serious concern is that it is
only possible to annotate those trees that the gram-
mar can parse. Sentences for which no analysis had
been implemented in the grammar or which fail to
parse due to processing constraints are left unan-
notated. This makes grammar coverage an urgent
issue. However, dictionary de nition sentences are
more repetitive than newspaper text. In addition,
there is little reference to outside context, and Lex-
UTTERANCE
NP
VP N
PP V
N CASE-P V V
a0a2a1a4a3 a5 a6a4a7 a8a10a9 a11
jid osha o unten suru hito
car acc drive do person
Parse Tree
hh0; x1fh0 :proposition rel(h1)
h1 :hito(x1) \person"
h2 :u def(x1; h1; h6)
h3 :jidosha(x2) \car"
h4 :u def(x2; h3; h7)
h5 :unten(e1; x1; x2)gi \drive"
MRS
fx1 :
e1 :unten(arg1 x1 : hito; arg2 x2 : jidosha)
r1 :proposition(marg e1 : unten)
Dependency
Figure 2: Parse Tree, Simpli ed MRS and Dependency Views for a0 a1 a3 a5 a7 2 doraib a \driver"
eed has a  xed de ning vocabulary. This makes it a
relatively easy domain to work with.
We extended JACY by adding the de ning vo-
cabulary, and added some new rules and lexical-
types (more detail is given in Bond et al. (2004a)).1
Almost none of the rules are speci c to the dic-
tionary domain. The grammatical coverage over
all sentences when we began to treebank was 84%,
and it is currently being increased further as we
work on the grammar. We have now treebanked
all de nition sentences for words with a familiar-
ity greater than or equal to 6.0. This came to
38,900 sentences with an average length of 6.7
words/sentence. The extended JACY grammar is
available for download from www.dfki.uni-sb.de/
~siegel/grammar-download/JACY-grammar.html.
4 Applications
The treebanked data and grammar have been tested
in two ways. The  rst is in training a stochastic
model for parse selection. The second is in building
a thesaurus from the parsed data.
4.1 Stochastic Parse Ranking
Using the treebanked data, we built a stochastic
parse ranking model with [incr tsdb()]. The ranker
uses a maximum entropy learner to train a PCFG
over the parse derivation trees, with the current node
as a conditioning feature. The correct parse is se-
lected 61.7% of the time (training on 4,000 sentences
and testing on another 1,000; evaluated per sen-
tence). More feature-rich models using parent and
grandparent nodes along with models trained on the
MRS representations have been proposed and imple-
mented with an English grammar and the Redwoods
treebank (Oepen et al., 2002). We intend to include
such features, as well as adding our own extensions
to train on constituent weight and semantic class.
1We bene ted greatly from advice from the main JACY
developers: Melanie Siegel and Emily Bender.
4.2 Knowledge Acquisition
We selected dictionary de nitions as our  rst corpus
in order to use them to acquire lexical and ontolog-
ical knowledge. Currently we are classifying hyper-
nym, hyponym, synonym and domain relationships
in addition to linking senses to an existing ontol-
ogy. Our approach is described in more detail in
Bond et al. (2004b). The main di erence between
our research and earlier approaches, such as Tsu-
rumaru et al. (1991), is that we are fully parsing
the input, not just using regular expressions. Pars-
ing sentences to a semantic representation (Minimal
Recursion Semantics, Copestake et al. (1999)) has
three advantages. The  rst is that it makes our
knowledge acquisition somewhat language indepen-
dent: if we have a parser for some language that
can produce MRS, and a dictionary, the algorithm
can easily be ported. The second reason is that we
can go on to use the same system to acquire knowl-
edge from non-dictionary sources, which will not be
as regular as dictionaries and thus harder to parse
using only regular expressions. Third, we can more
easily acquire knowledge beyond simple hypernyms,
for example, identifying synonyms through common
de nition patterns (Tsuchiya et al., 2001).
To extract hypernyms, we parse the  rst de ni-
tion sentence for each sense. The parser uses the
stochastic parse ranking model learned from the Hi-
noki treebank, and returns the MRS of the  rst
ranked parse. Currently, 84% of the sentences can
be parsed. In most cases, the word with the highest
scope in the MRS representation will be the hyper-
nym. For example, for doraib a1 the hypernym is a12
a13 d ogu \tool" and for doraib a
2 the hypernym is
a11 hito \person" (see Figure 1). Although the ac-
tual hypernym is in very di erent positions in the
Japanese and English de nition sentences, it takes
the highest scope in both their semantic representa-
tions.
For some de nition sentences (around 20%), fur-
ther parsing of the semantic representation is nec-
essary. For example, a14a16a15 1 ana is de ned as ana:
The abbreviation of \announcer" (translated to En-
glish). In this case abbreviation has the highest
scope but is an explicit relation. We therefore parse
to  nd its complement and extract the relationship
abbreviation(ana1,announcer1). The semantic repre-
sentation is largely language independent. In order
to port the extraction to another language, we only
have to know the semantic relation for abbreviation.
We evaluate the extracted pairs by comparison
with an existing thesaurus: Goi-Taikei (Ikehara et
al., 1997). Currently 58.5% of the pairs extracted
for nouns are linked to nodes in the Goi-Taikei on-
tology (Bond et al., 2004b). In general, we are ex-
tracting pairs with more information than the Goi-
Taikei hierarchy of 2,710 classes. In particular, many
classes contain a mixture of class names and instance
names: a0a2a1 buta niku \pork" and a1 niku \meat"
are in the same class, as are a0 a1a4a3 percussion in-
strument \drum" and a5a7a6a7a8 dagakki \percussion
instrument", which we can now distinguish.
5 Conclusion and Further Work
In this paper we have described the current state of
the Hinoki treebank. We have further showed how
it is being used to develop a language-independent
system for acquiring thesauruses from machine-
readable dictionaries.
We are currently concentrating on three tasks.
The  rst is improving the coverage of the grammar,
so that we can parse more sentences to a correct
parse. The second is improving the knowledge ac-
quisition and learning other information from the
parsed de ning sentences | in particular lexical-
types, semantic association scores, meronyms, and
antonyms. The third task is adding the knowledge
of hypernyms into the stochastic model.
With the improved the grammar and ontology, we
will use the knowledge learned to extend our model
to words not in Lexeed, using de nition sentences
from machine-readable dictionaries or where they
appear within normal text. In this way, we can grow
an extensible lexicon and thesaurus from Lexeed.

References
Francis Bond, Sanae Fujita, Chikara Hashimoto,
Kaname Kasahara, Shigeko Nariyama, Eric Nichols,
Akira Ohtani, Takaaki Tanaka, and Shigeaki Amano.
2004a. The Hinoki treebank: A treebank for text
understanding. In Proceedings of the First Interna-
tional Joint Conference on Natural Language Process-
ing (IJCNLP-04). Springer Verlag. (in press).
Francis Bond, Eric Nichols, Sanae Fujita, and Takaaki
Tanaka. 2004b. Acquiring an ontology for a funda-
mental vocabulary. In COLING 2004, Geneva. (to
appear).
Ulrich Callmeier. 2000. PET - a platform for experi-
mentation with e cient HPSG processing techniques.
Natural Language Engineering, 6(1):99{108.
Ann Copestake, Dan Flickinger, Carl Pollard, and
Ivan A. Sag. 1999. Minimal recursion semantics:
An introduction. (manuscript http://www-csli.
stanford.edu/~aac/papers/newmrs.ps).
Ann Copestake. 2002. Implementing Typed Feature
Structure Grammars. CSLI Publications.
Markus Dickinson and W. Detmar Meurers. 2003. De-
tecting inconsistencies in treebanks. In Proceedings
of the Second Workshop on Treebanks and Linguistic
Theories, V axj o, Sweeden.
Satoru Ikehara, Masahiro Miyazaki, Satoshi Shirai, Akio
Yokoo, Hiromi Nakaiwa, Kentaro Ogura, Yoshifumi
Ooyama, and Yoshihiko Hayashi. 1997. Goi-Taikei
| A Japanese Lexicon. Iwanami Shoten, Tokyo. 5
volumes/CDROM.
Kaname Kasahara, Hiroshi Sato, Francis Bond, Takaaki
Tanaka, Sanae Fujita, Tomoko Kanasugi, and
Shigeaki Amano. 2004. Construction of a Japanese se-
mantic lexicon: Lexeed. SIG NLC-159, IPSJ, Tokyo.
(in Japanese).
Stephan Oepen and John Carroll. 2000. Performance
pro ling for grammar engineering. Natural Language
Engineering, 6(1):81{97.
Stephan Oepen, Kristina Toutanova, Stuart Shieber,
Christoper D. Manning, Dan Flickinger, and Thorsten
Brant. 2002. The LinGO redwoods treebank: Mo-
tivation and preliminary applications. In 19th In-
ternational Conference on Computational Linguistics:
COLING-2002, pages 1253{7, Taipei, Taiwan.
Stephan Oepen, Dan Flickinger, and Francis Bond.
2004. Towards holistic grammar engineering and test-
ing | grafting treebank maintenance into the gram-
mar revision cycle. In Beyond Shallow Analyses |
Formalisms and Satitistical Modelling for Deep Anal-
ysis (Workshop at IJCNLP-2004), Hainan Island.
(http://www-tsujii.is.s.u-tokyo.ac.jp/bsa/).
Carl Pollard and Ivan A. Sag. 1994. Head Driven Phrase
Structure Grammar. University of Chicago Press,
Chicago.
Melanie Siegel and Emily M. Bender. 2002. E cient
deep processing of Japanese. In Procedings of the 3rd
Workshop on Asian Language Resources and Interna-
tional Standardization at the 19th International Con-
ference on Computational Linguistics, Taipei.
Ann Taylor, Mitchel Marcus, and Beatrice Santorini.
2003. The Penn treebank: an overview. In Anne
Abeill e, editor, Treebanks: Building and Using Parsed
Corpora, chapter 1, pages 5{22. Kluwer Academic
Publishers.
Masatoshi Tsuchiya, Sadao Kurohashi, and Satoshi Sato.
2001. Discovery of de nition patterns by compressing
dictionary sentences. In Proceedings of the 6th Natu-
ral Language Processing Paci c Rim Symposium, NL-
PRS2001, pages 411{418, Tokyo.
Hiroaki Tsurumaru, Katsunori Takesita, Itami Katsuki,
Toshihide Yanagawa, and Sho Yoshida. 1991. An ap-
proach to thesaurus construction from Japanese lan-
guage dictionary. In IPSJ SIGNotes Natural Lan-
guage, volume 83-16, pages 121{128. (in Japanese).
