Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006, pages 94–97,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Constructing an English Valency Lexicon∗
Jiˇr´ı Semeck´y, Silvie Cinkov´a
Institute of Formal and Applied Linguistics Af liation
Malostransk·e n·am est· 25
CZ11800 Prague 1
Czech Republic
(semecky,cinkova)@ufal.mff.cuni.cz
Abstract
This paper presents the English valency
lexicon EngValLex, built within the Func-
tional Generative Description framework.
The form of the lexicon, as well as the
process of its semi-automatic creation is
described. The lexicon describes valency
for verbs and also includes links to other
lexical sources, namely PropBank. Basic
statistics about the lexicon are given.
The lexicon will be later used for anno-
tation of the Wall Street Journal section
of the Penn Treebank in Praguian for-
malisms.
1 Introduction
The creation of a valency lexicon of English verbs
is part of the ongoing project of the Prague En-
glish Dependency Treebank (PEDT). PEDT is be-
ing built from the Penn Treebank - Wall Street
Journal section by converting it into dependency
trees and providing it with an additional deep-
syntactic annotation layer, working within the lin-
guistic framework of the Functional Generative
Description (FGD)(Sgall et al., 1986).
The deep-syntactic annotation in terms of FGD
pays special attention to valency. Under va-
lency we understand the ability of lexemes (verbs,
nouns, adjectives and some types of adverbs) to
combine with other lexemes. Capturing of valency
is pro table in Machine Translation, Information
Extraction and Question Answering since it en-
ables the machines to correctly recognize types of
∗ The research reported in this paper has been partially
supported by the grant of Grant Agency of the Czech Repub-
lic GA405/06/0589, the project of the Information Society
No. 1ET101470416, and the grant of the Grant Agency of
the Charles University No. 372/2005/A-INF/MFF.
events and their participants even if they can be
expressed by many different lexical items. A va-
lency lexicon of verbs is inevitable for the project
of the Prague English Dependency Treebank as a
supporting tool for the deep-syntactic corpus an-
notation.
We are not aware of any lexical source from
which such a lexicon could be automatically de-
rived in the desired quality. Manual creation of
gold-standard data for computational applications
is yet very time-consuming and expensive. Hav-
ing this in mind, we decided to adapt the already
existing lexical source PropBank (M. Palmer and
D. Gildea and P. Kingsbury, 2005) to FGD, mak-
ing it comply with the structure of the origi-
nal Czech valency lexicons VALLEX (  Zabokrtsk·y
and Lopatkov·a, 2004) and PDT-VALLEX (J. Haji c
et al., 2003), which have been designed for the
deep-syntactic annotation of the Czech FGD-
based treebanks (The Prague Dependency Tree-
bank 1.0 and 2.0) (J. Haji c et al., 2001; Haji c,
2005). Manual editing follows the automatic pro-
cedure. We are reporting on a work that is still
ongoing (which is though nearing completion).
Therefore this paper focuses on the general con-
ception of the lexicon as well as on its technical
solutions, while it cannot give a serious evaluation
of the completed work yet.
The paper is structured as follows. In Section 2,
we present current or previous related projects in
more detail. In Section 3, we introduce the formal
structure of the EngValLex lexicon. In Section 4,
we describe how we semi-automatically created
the lexicon and describe the annotation tool. Fi-
nally in Section 5, we state our outlooks for the
future development and uses of the lexicon.
94
2 Valency Lexicon Construction
2.1 FGD
The Functional Generative Description (FGD)
(Sgall et al., 1986) is a dependency-based for-
mal strati cational language description frame-
work that goes back to the functional-structural
Prague School. For more detail see (Panevov·a,
1980) and (Sgall et al., 1986). The theory of FGD
has been implemented in the Prague Dependency
Treebank project (Sgall et al., 1986; Haji c, 2005).
FGD captures valency in the underlying syntax
(the so-called tectogrammatical language layer). It
enables listing of complementations (syntactically
dependent autosemantic lexemes) in a valency lex-
icon, regardless of their surface (morphosyntac-
tic) forms, providing them with semantic labels
(functors) instead. Implicitly, a complementation
present in the tectogrammatical layer can either be
directly rendered by the surface shape of the sen-
tence, or it is omitted but can be inferred from the
context or by common knowledge. A valency lex-
icon describes the valency behavior of a given lex-
eme (verb, noun, adjective or adverb) in the form
of valency frames.
2.2 Valency within FGD
A valency frame in the strict sense consists of in-
ner participants and obligatory free modi cations
(see e.g. (Panevov·a, 2002)). Free modi cations
are prototypically optional and do not belong to
the valency frame in the strict sense though some
frames require a free modi cation (e.g. direction
in verbs of movement). Free modi cations have
semantic labels (there are some more than 40 in
PDT) and they are distributed according to seman-
tic judgments of the annotators. FGD introduces
 ve inner participants. Unlike free modi cations,
inner participants cannot be repeated within one
frame. They can be obligatory as well as optional
(which is to be stated by the judgment on gram-
maticality of the given sentence and by the so-
called dialogue test, (Panevov·a, 1974 75)). Both
the obligatory and the optional inner participants
belong to the valency frame in the strict sense.
Like the free modi cations, the inner participants
have semantic labels according to the cognitive
roles they typically enter: ACT (Actor), PAT (Pa-
tient), ADDR (Addressee), ORIG (Origin) and
EFF (Effect). Syntactic criteria are used to identify
the  rst two participants ACT and PAT ( shifting ,
see (Panevov·a, 1974 75)). The other inner partic-
ipants are identi ed semantically; i.e. a verb with
one inner participant will have ACT, a verb with
two inner participants will have ACT and PAT re-
gardless the semantics and a verb with three and
more participants will get the label assigned by the
semantic judgment.
2.3 The Prague Czech-English Dependency
Treebank
In order to develop a state-of-the-art machine
translation system we are aiming at a high-quality
annotation of the Penn Treebank data in a for-
malism similar to the one developed for PDT.
When building PEDT we can draw on the suc-
cessfully accomplished Prague Czech-English De-
pendency Treebank 1.0 (J. Cu r· n and M.  Cmejrek
and J. Havelka and J. Haji c and V. Kubo n and Z.
 Zabokrtsk·y, 2004) (PCEDT).
PCEDT is a Czech-English parallel corpus,
consisting of 21,600 sentences from the Wall
Street Journal section of the Penn Treebank 3
corpus and their human translations to Czech. The
Czech data was automatically morphologically
analyzed and parsed by a statistical parser on
the analytical (i.e. surface-syntax) layer. The
Czech tectogrammatical layer was automatically
generated from the analytical layer. The En-
glish analytical and tectogrammatical trees were
derived automatically from the Penn Treebank
phrasal trees.
2.4 The Prague English Dependency
Treebank
The Prague English Dependency Treebank
(PEDT) stands for the data from Wall Street Jour-
nal section of the Penn Treebank annotated in the
PDT 2.0 shape. EngValLex is a supporting tool
for the manual annotation of the tectogrammatical
layer of PEDT.
3 Lexicon Structure
On the topmost level, EngValLex consists of word
entries, which are characterized by lemmas. Verbs
with a particle (e.g. give up) are treated as separate
word entries.
Each word entry consists of a sequence of
frame entries, which roughly correspond to indi-
vidual senses of the word entry and contain the
valency information.
95
Figure 1: EngValLex editor: the list of words and frames
Each frame entry consists of a sequence of va-
lency slots, a sequence of example sentences and
a textual note. Each valency slot corresponds to a
complementation of the verb and is described by
a tectogrammatical functor de ning the relation
between the verb and the complementation, and a
form de ning the possible surface representations
of the functor. Valency slots can be marked as op-
tional, if not, they are considered to be obligatory.
The form is listed in round brackets following
the functor name. Surface representations of func-
tors are basically de ned by combination of mor-
phological tags and lemmas. Yet to save annota-
tors’ effort, we have introduced several abbrevia-
tions that substitute some regularly co-occurring
sequences. E.g. the abbreviation n means ‘noun
in the subjective case’ and is de ned as follows:
NN:NNS:NP:NPS
meaning one of the Penn Treebank part-of-
speech tags: NN, NNS, NP and NPS (colon delim-
its variants). Abbreviation might be de ned recur-
sively.
Apart from describing only the daughter node
of the given verb, the surface representation can
describe an entire analytical subtree whose top-
most node is the daughter of the given verb node.
Square brackets are used to indicate descendant
nodes. Square brackets allow nesting to indicate
the dependency relations among the nodes of a
given subtree. For example, the following state-
ment describes a particle to whose daughter node
is a verb.
to.TO[VB]
The following statement is an example of a def-
inition of three valency slots and their correspond-
ing forms:
ACT(.n) PAT(to.TO[VB])
LOC(at.IN)
The ACT (Actor) can be any noun in the sub-
jective case (the abbreviation n), the PAT (Patient)
can be a particle to with a daughter verb, and the
LOC (Locative) can be the preposition at.
Moreover, EngValLex contains links to external
data sources (e.g. lexicons) from words, frames,
valency slots and example sentences.
The lexicon is stored in an XML format which
is similar to the format of the PDT-VALLEX lexi-
con used in the Prague Dependency Treebank 2.0.
4 Creating the Lexicon
The lexicon was automatically generated from
PropBank using XSLT templates. Each PropBank
example was expanded in a single frame in the
destination lexicon. When generating the lexicon,
we have kept as many back links to PropBank as
possible. Namely, we stored links from frames
to Propbank rolesets, links from valency slots to
PropBank arguments and links from examples to
PropBank examples. Rolesets were identi ed by
the roleset id attribute. Arguments were identi ed
by the roleset id, the name and the function of the
96
role. Examples were identi ed by the roleset id
and their name.
After the automatic conversion, we had 8,215
frames for 3,806 words.
Tectogrammatical functors were assigned semi-
automatically according to hand-written rules,
which were conditioned by PropBank arguments.
It was yet clear from the beginning that manual
corrections would be necessary as the relations of
Args to functors varied depending on linguistic de-
cisions1.
The annotators were provided with an anno-
tation editor created on the base of the PDT-
VALLEX editor. Apart from interface for editing
EngValLex, the tool contains integrated viewers
of PropBank and VerbNet, which allows of ine
browsing of the lexicons. Those viewers can be
run as a stand-alone application as well and are
published freely on the web2. The editor allows
the annotator to create, delete, and modify word
entries, and frame entries. Links to PropBank can
be set up, if necessary.
Figure 1 displays the main window of the editor.
The left part of the window shows list of words.
The central part shows the list of the frames con-
cerning the selected verb.
For the purpose of annotation, we divided the
lexicon into 1,992  les according to the name of
PropBank rolesets (attribute name of the XML el-
ement roleset), and the  les are annotated sepa-
rately. When the annotation is  nished, the  les
will be merged again. Currently, we have about
80% of the lexicon annotated, which already con-
tains the most dif cult cases.
5 Outlook
We have annotated the major part of EngValLex.
In the  nal version, a small part of the lexicon will
be annotated by a second annotator in order to de-
termine the inter-annotator agreement.
The annotation of the Prague English Depen-
dency Treebank on the tectogrammatical level will
be started soon and we will use EngValLex for as-
signing valency frames to verbs. The annotation
1E.g. Arg0 typically corresponds to ACT, and Arg1 to
PAT when they co-occur. Yet, a roleset including an inchoa-
tive sentence (The door.ARG1 opened.) and a causative sen-
tence (John.Arg0 opened the door.Arg1) will be split into two
FGD frames. The causative frame will keep Arg0→ACT and
Arg1→PAT whereas the inchoative will get Arg1→ACT.
2 http://ufal.mff.cuni.cz/∼semecky/
software/{propbank|verbnet}viewer/
will be based on the same theoretical background
as the Prague Dependency Treebank.
Due to the PropBank links in EngValLex, we
will be able to automatically derive frame anno-
tation of PEDT from PropBank annotation of the
Penn Treebank.
As the Wall Street Journal sentences are manu-
ally translated into Czech, we will be able to ob-
tain their Czech tectogrammatical representations
automatically using state-of-art parsers.
A solid platform for testing Czech-English and
English-Czech machine translation will be given.
In the future we will also try to improve the trans-
lation by mapping the Czech PDT-ValLex to the
English EngValLex.

References
J. Haji c, 2005. Complex Corpus Annotation: The
Prague Dependency Treebank, pages 54 73. Veda
Bratislava, Slovakia.
J. Cu r· n and M.  Cmejrek and J. Havelka and J. Haji c
and V. Kubo n and Z.  Zabokrtsk·y. 2004. Prague
Czech-English Dependency Treebank Version 1.0.
(LDC2004T25).
J. Haji c et al. 2001. Prague Dependency Treebank 1.0.
LDC2001T10, ISBN: 1-58563-212-0.
J. Haji c et al. 2003. PDT-VALLEX: Creating a Large-
coverage Valency Lexicon for Treebank Annotation.
In Proceedings of The Second Workshop on Tree-
banks and Linguistic Theories, volume 9.
M. Palmer and D. Gildea and P. Kingsbury. 2005. The
Proposition Bank: An Annotated Corpus of Seman-
tic Roles. Computational Linguistics, 31(1):71.
J. Panevov·a. 1974 75. On verbal Frames in Functional
Generative Description. Prague Bulletin of Mathe-
matical Linguistics (PBML), 22, pages 3 40, Part II,
PBML 23, pages. 17 52.
J. Panevov·a. 1980. Formy a funkce ve stavbˇe ˇcesk´e
vˇety [Forms and functions in the structure of the
Czech sentence]. Academia, Prague, Czech Rep.
J. Panevov·a. 2002. Sloveso: centrum v ety; va-
lence: centr·aln· pojem syntaxe. In Aktu´alne ot´azky
slovenskej syntaxe, pages x1 x5.
P. Sgall, E. Haji cov·a, and J. Panevov·a. 1986. The
Meaning of the Sentence in its Semantic and Prag-
matic Aspects. Academia, Prague, Czech Repub-
lic/Reidel Publishing Company, Netherlands.
Z.  Zabokrtsk·y and M. Lopatkov·a. 2004. Valency
Frames of Czech Verbs in VALLEX 1.0. In Fron-
tiers in Corpus Annotation. Proceedings of the
Workshop of the HLT/NAACL Conference, pages
70 77, May 6, 2004.
