Categorial Type Logic meets Dependency Grammar to annotate
an Italian Corpus
R. Bernardi
KRDB,
Free University of Bolzano Bozen,
P.zza Domenicani, 3
39100 Bolzano Bozen,
Italy,
bernardi@inf.unibz.it
A. Bolognesi and F. Tamburini
CILTA,
University of Bologna,
P.zza San Giovanni in Monte, 4,
I-40124, Bologna,
Italy,
{bolognesi,tamburini}@cilta.unibo.it
M. Moortgat
UiL OTS,
Utrecht University,
Trans 10,
3512 JK, Utrecht,
The Netherlands
Moortgat@phil.uu.nl
Abstract
In this paper we present work in progress on the
annotation of an Italian Corpus (CORIS) devel-
oped at CILTA (University of Bologna). We in-
duce categorial type assignments from a depen-
dency treebank (Torino University treebank,
TUT) and use the obtained categories with an-
notated dependency relations to study the dis-
tributional behavior of Italian words and reach
an empirically founded part-of-speech classifica-
tion.
1 Introduction
The work reported on here is part of a project1
aimed at annotating the CORIS/CODIS 100-
million-words synchronic corpus of contempo-
rary Italian with linguistic information: first
part-of-speech tagging for the complete corpus,
and in a later stage syntactic analysis for a sub-
corpus.
We have been investigating existing Italian
treebanks in order to assess their potential use-
fulness for bootstrapping the CORIS/CODIS an-
notation tasks. We are aware of two such tree-
banks that are relevant to our purposes: the
TUTcorpusdeveloped at Torino University, and
ISST (Italian Syntactic-Semantic treebank) de-
veloped under the national program SI-TAL by
a consortium of companies and research centers
coordinated by the “Consorzio Pisa Ricerche”
(CPR)2.
ISST is a multi-layered corpus, annotated at
the syntactic and lexico-semantic levels. A user
interface is provided to explore the corpus. The
ISST corpus is rather competitive in terms of
size: it counts 305,547 word tokens (Monte-
magni et al., 2003). A drawback is that the
corpus is not publicly available yet. The TUT
1The project is funded under the FIRB 2001 action.
2The parnters of the consortium were: ILC-
CNR/CPR, Venice University/CVR, ITC-IRST, “Tor
Vergata” University/CERTIA, and Synthema.
corpus is rather small, consisting only of 38,653
words. There is no user interface for TUT but
the corpus is downloadable (http://www.di.
unito.it/~tutreeb/). Despite its small size,
TUT can serve as a training corpus for creating
larger annotated resources.
Our goal is to annotate CORIS with part-of-
speech (PoS) tags and semi-automatically build
a treebank for a fragment of it. To achieve this
two-fold task we start from TUT exploiting its
information on dependency relations. We are
still in a preliminary phase of the project and
so far attention has been focused on the first
task. However, the work done in this phase is
expected to play a role in our second task too.
In Section 2, we will describe in more detail
the problems which arise when working out the
first task. In Section 3, we briefly introduce the
formalisms we work with. In Section 4, we ex-
plain how we encode dependency relations into
categorial type assignments (CTAs), and how we
automatically induce these types from TUT de-
pendency structures. Finally, in Section 5 we
draw some preliminary conclusions and briefly
describe our action list.
2 PoS tagging for Italian
Before embarking on our first task, we have
studied the current situation with respect to
PoS tagging for Italian. Italian is one of the
languages for which a set of annotation guide-
lines has been developed in the context of the
EAGLES project (Expert Advisory Group on
Language Engineering Standards (Monachini,
1995)). Several research groups have worked on
PoS annotation in practice (for example, Torino
University, Xerox and Venice University).
We have compared the tag sets used by these
groups with Monachini’s guidelines. From this
comparison, it results that though there is a
general agreement on the main parts of speech
to be used3, considerable divergence exists when
it comes to the actual classification of Italian
words with respect to these main PoS classes.
The classes for which differences of opinion are
most evident are adjectives, determiners and
adverbs. For instance, words like molti (tr.
many) have been classified as “indefinite deter-
miners” by Monachini, “plural quantifiers” by
Xerox, “indefinite adjectives” by the Venice and
Turin groups. This simple example shows that
the choice of PoS tags is already influenced by
the linguistic theory adopted in the background.
This theoretical bias will then influence the kind
of conclusions one can draw from the annotated
corpus.
Our aim is to derive an empirically founded
PoS classification, making no a priori assump-
tions about the PoS classes to be distinguished.
Our background assumptions are minimal and,
we hope, uncontroversial: we assume that
we have access to head-dependent (H-D) and
functor-argument (F-A) relations in our mate-
rial. We encode the H-D and F-A information
into categorial type formulas. These formulas
then serve as “labels/tags” from which we ob-
tain the desired empirically founded PoS classi-
fication by means of a clustering algorithm.
To bootstrap the process of type induction,
we transform the TUT corpus into a simpli-
fied dependency treebank. The transformation
keeps the bare dependency relations but re-
moves the more theory-laden annotation. In
Section 4, we describe how we use the simpli-
fied dependency treebank for our distributional
study of Italian PoS classification. First, we
briefly look at H-D and F-A relations as they
occur in the TUT corpus and in Categorial Type
Logic (CTL).
3 Dependency and
functor-argument relations
3.1 Dependency structures in TUT
The Turin University Treebank (TUT) is a cor-
pus of Italian sentences annotated by specifying
relational structures augmented with morpho-
syntactic information and semantic role (hence-
forth ARS) in a monostratal dependency-based
representation. The treebank in its current re-
lease includes 38,653 words and 1,500 sentences
3The standard classification consists of nouns, verbs,
adjectives, determiners, articles, adverbs, prepositions,
conjunctions, numerals, interjections, punctuation and
a class of residual items which differs from project to
project.
from the Italian civil law code, the national
newspapers La Stampa and La Repubblica, and
from various reviews, newspapers, novels, and
academic papers.
The ARS schema consists of i) morpho-
syntactic, ii) functional-syntactic and iii) se-
mantic components, specifying part-of-speech,
grammatical relations, and thematic role infor-
mation, respectively. The reader is referred
to (Bosco, 2003) for a detailed description of the
TUT annotation schema. An example is given
below (tr. “The first steps have not been encour-
aging”). In this example, the node TOP-VERB is
the root of the whole structure4.
************** FRASE ALB-71 **************
1 I (IL ART DEF M PL)
[6;VERB-SUBJ]
2 primi (PRIMO ADJ ORDIN M PL)
[3;ADJC+ORDIN-RMOD]
3 approcci (APPROCCIO NOUN COMMON M PL)
[1;DET+DEF-ARG]
4 non (NON ADV NEG)
[6;ADVB-RMOD]
5 sono (ESSERE VERB AUX IND PRES INTR 3 PL)
[6;AUX+TENSE]
6 stati (ESSERE VERB MAIN PART PAST INTR PL M)
[0;TOP-VERB]
7 esaltanti (ESALTANTE ADJ QUALIF ALLVAL PL)
[6;VERB-PREDCOMPL+SUBJ]
8 . (#\. PUNCT) [6;END]
Because we are interested in extracting
dependency relations, we can focus on the
functional-syntactic component of the TUT an-
notation, where information relating to gram-
matical relations (heads and dependents) is en-
coded.
The TUT annotation schema for depen-
dents makes a primary distinction between
(a) functional and (b) non-functional tags,
for dependents that can and that can-
not be assigned thematic roles, respectively.
These two classes are further divided into
(a’) arguments (ARG) and modifiers (MOD)
and (b’), AUX, COORDINATOR, INTERJECTION,
CONTIN, EMPTYCOMPL, EMPTYLOC, SEPARATOR
and VISITOR5; and furthermore, the arguments
4The top nodes used in TUT are TOP-VERB,
TOP-NOUN, TOP-CONJ, TOP-ART, TOP-NUM, TOP-PRON,
TOP-PHRAS and TOP-PREP
5The labels that require some explanation are: (i)
CONTIN, (ii) EMPTYCOMPL, (iii) EMPTYLOC and (iv) VISITOR.
They are used for expressions that (i) introduce a part
of an expression with a non-compositional interpreta-
tion (e.g. locative or idiomatic expressions and denom-
inative structures: “Arriv`o [prima]H [de]D ll’alba”, lit.
tr. “(She) arrived ahead of the daybreak”); (ii) link a re-
(ARG) and modifiers(MOD) are sub-divided as fol-
lowing
ARG
SUBJ OBJ INDOBJ INDCOMPL PREDCOMPL
MODIFIER
RMOD
RELCLR RMODPRED
APPOSITION
RELCLA
3.2 Categorial functor-argument
structures
Categorial Type Logic (CTL) (Moortgat, 1997)
is a logic-based formalism belonging to the fam-
ily of Categorial Grammars (CG). In CTL, the
type-forming operations of CG are viewed as
logical connectives. As the slogan “Parsing-
as-Deduction” suggests, such a view makes it
possible to do away with combinatory syn-
tactic rules altogether; establishing the well-
formedness of an expression becomes a process
of deduction in the logic of the type-forming
connectives.
The basic distinction expressed by the cat-
egorial type formulas is the Fregean opposi-
tion between complete and incomplete expres-
sions. Complete expressions are categorized by
means of atomic type formulas; grammatical-
ity judgements for expressions with an atomic
type do not require further contextual informa-
tion. Typical examples of atomic types would
be ‘sentence’ (s) and ‘noun’ (n). Incomplete ex-
pressions are categorized by means of fractional
type formulas; the denominators of these frac-
tions indicate the material that has to be found
in the context in order to obtain a complete ex-
pression of the type of the numerator.
Definition 3.1 (Fractional type formulas)
Given a set of basic types ATOM, the set of
types TYPE is the smallest set such that:
i. if A∈ATOM, then A∈TYPE;
ii. if A and B ∈ TYPE, then A/B and
B\A∈TYPE.
There are different ways of presenting valid
type computations. In a Natural Deduction for-
mat, we write Γ turnstileleft A for a demonstration that
flexive personal pronoun with particular verbal head (e.g.
“La porta [si]D [apre]H”, lit. tr. “the door (it) opens”);
(iii) link a pronoun with a verbal head introducing a sort
of metaphorical location of the head (e.g. “In Albania
[ci]D [sono]H molti problemi”, lit. tr. “In Albany there
are many problems”.); (iv) mark the extraction of a part
of a structure (e.g. “Cos`ı devi vedere questo argomento”,
lit. tr. “This way (you) must see this topic”).
the structure Γ has the type A. The statement
AturnstileleftA is axiomatic. Each of the connectives /
and\has an Elimination rule and an Introduc-
tion rule. Below, we give these inference rules
for / (incompleteness to the right). The cases
for\(incompleteness to the left) are symmetric.
Given structures Γ and ∆ of types A/B and B
respectively, the Elimination rule builds a com-
pound structure Γ◦∆ of type A. The Introduc-
tion rule allows one to take apart a compound
structure Γ◦B into its immediate substructures.
ΓturnstileleftA/B ∆turnstileleftB
Γ◦∆turnstileleftA /E
Γ◦BturnstileleftA
ΓturnstileleftA/B /I
Notice that the language of fractional types
is essentially higher-order: the denominator of
a fraction does not have to be atomic, but can
itself be a fraction. The Introduction rules are
indispensable if one is interested in capturing
the full set of theorems of the type calculus.
Classical CG (in the style of Ajdukiewicz and
Bar-Hillel) uses only the Elimination rules, and
hence has restricted inferential capacities. It is
impossible in classical CG to obtain the validity
A turnstileleft B/(A\B), for example. Still, the classi-
cal CG perspective will be useful to realize our
aim of automatically inducing type assignments
from structured data obtained from the TUT
corpus thanks to the type resolution algorithm
explained below.
Type inference algorithms for classical CG
have been studied by (Buszkowski and Penn,
1990). The structured data needed by their
type inference algorithms are so-called functor-
argument structures (fa-structures). An fa-
structure for an expression is a binarybranching
tree; the leaf nodes are labeled by lexical expres-
sions (words), the internal nodes by one of the
symbols triangleleftsld (for structures with the functor as
the left daughter) or trianglerightsld (for structures with the
functor as the right daughter).
To assign types to the leaf nodes of an fa-
structure, one proceeds in a top-down fashion.
The type of the root of the structure is fixed (for
example: s). Compound structures are typed as
follows:
- to type a structure Γ triangleleftsld ∆ as A, type Γ as
A/B and ∆ as B;
- to type a structure Γ trianglerightsld ∆ as A, type Γ as
B and ∆ as B\A.
If a word occurs in different structural environ-
ments, the typing algorithm will produce dis-
tinct types. The set of type assignments to a
word can be reduced by factoring: one identi-
fies type assignments that can be unified. For
an example, compare the structured input be-
low:
a. Claudia trianglerightsld parla
b. Claudia trianglerightsld (parla trianglerightsld bene)
Assuming a goal type s, from (a) we obtain the
assignments
Claudia : A,parla : A\s
and from (b)
Claudia : C,parla : B,bene : B\(C\s)
Factoring leads to the identifications A = C,
B = (A\s), producing for “bene” the modifier
type (A\s)\(A\s).
3.3 From TUT dependency structures
to categorial types
To accomplish our aims, we will have an oc-
casion to use two extensions of the basic cate-
gorial machinery outlined in the section above:
a generalization of the type language to multi-
ple modes of composition, and the addition of
structural rules of inference to the logical rules
of slash Elimination and Introduction.
Multimodal composition The intuitions
underlying the distinction between heads and
dependentsin Dependency Grammars (DG) and
between functors and arguments in CG often
coincide, but there are also cases where they
diverge (Venneman, 1977). In the particu-
lar case of the TUT annotation schema, we
see that for all instances of dependents la-
beled as ARG (or one of its sublabels), the
DG head/dependent articulation coincides with
the CG functor/argument asymmetry. But for
DG modifiers, or dependents without thematic
roles of the class AUX (auxiliary)6 there is a
mismatch between dependency structure and
functor-argument structure. Modifiers would be
functors in terms of their categorial type: func-
tors where the numerator and the denominator
are identical. This makes them into ‘identities’
for the fractional multiplication, which explains
their optionality and the possibility of iteration.
AUX elements in DG would count as morpholog-
ical modifiers of the head verbs. From the CG
point of view, they would be typed as functors
6And also COORDINATOR, INTERJECTION.
with non-identical numerator and denomina-
tor, distinguishing them that way from optional
modifiers, and capturing the fact that they are
indispensable to build a complete grammatical
structure.
To reconcile the competing demands of the
head-dependent and functor-argument classifi-
cation, we make use of the type calculus pro-
posed in (Moortgat and Morrill, 1991), which
treats dependency and functor-argument rela-
tions as two orthogonal dimensions of linguistic
organization. Instead of one composition oper-
ation ◦, the system of (Moortgat and Morrill,
1991) has two: ◦l for structures where the left
daughter is the head, and ◦r for right-headed
structures. The two composition operations
each have their slash and backslash operations
for the typing of incomplete expressions:
- A/lB: a functor looking for a B to the right
to form an A; the functor is the head, the
argument the dependent;
- A/rB: a functor looking for a B to theright
to form an A; the argument is the head, the
functor the dependent;
- B\lA: a functor looking for a B to the left
to form an A; the argument is the head,
the functor the dependent;
- B\rA: a functor looking for a B to the left
to form an A; the functor is the head, the
argument the dependent.
The type inference algorithm of (Buszkowski
and Penn, 1990) can be straightforwardly
adapted to the multimodal situation. The inter-
nal nodes of the fa-structures now are labeled
with a fourfold distinction: as before, the tri-
angle points to the functor daughter of a con-
stituent; in the case of the black triangle, the
functor daughter is the head constituent, in the
case of the white triangle, the functor daughter
is the dependent.
ad ah fd fh
fh triangleleftsld
fd triangleleft
ah triangleright
ad trianglerightsld
The type-inference clauses can be adapted ac-
cordingly.
- to type a structure Γ triangleleftsld ∆ as A, type Γ as
A/lB and ∆ as B;
- to type a structure Γ triangleleft ∆ as A, type Γ as
A/rB and ∆ as B.
- to type a structure Γ trianglerightsld ∆ as A, type ∆ as
B\rA and Γ as B;
- to type a structure Γ triangleright ∆ as A, type ∆ as
B\lA and Γ as B.
Structural reasoning The dependency rela-
tions in the TUT corpus abstract from surface
word order. When we induce categorial type
formulas from these dependency relations, as we
will see in Section 4.1, the linear order imposed
by ’/’ and ’\’ in the obtained formulas will not
always be compatible with the observable sur-
face order. Incompatibilities will arise, specif-
ically, in the case of non-projective dependen-
cies. Where such mismatches occur, the induced
types will not be immediately useful for parsing
— the longer term subtask of the project dis-
cussed here.
To address this issue, we can extend the in-
ference rules of our categorial logic with struc-
tural rules. The general pattern of these rules
is: infer Γprime turnstileleft A from Γ turnstileleft A, where Γprime is some
rearrangement of the constituents of Γ. These
rules, in other words, characterize the structural
deformations under which type assignment is
preserved. Structural rules can be employed
in two ways in CTL (see (Moortgat, 2001) for
discussion). In an on-line use, they actually
manipulate structural configurations during the
parsing process. Such on-line use can be very
expensive computationally. Used off-line, they
play a role complementary to the factoring op-
eration, producing a number of derived lexical
type-assignments from some canonical assign-
ment. With the derived assignments, parsing
can then proceed without altering the surface
structure.
As indicated in the introduction, the use of
CTL in the construction of a treebank for a part
of the CILTA corpus belongs to a future phase
of our project. For the purposes of this paper
we must leave the exact nature of the required
structural rules, and the trade-off between off-
line and on-line uses, as a subject for further
research.
4 A distributional study of Italian
part-of-speech tagging
In order to annotate the CORIS corpus with a
theory-neutral set of PoS tags, we plan to carry
out a distributional study of its lexicon.
Early approaches to this problem were based
on the hypothesis that if two words are syn-
tactically and semantically different, they will
appear in different contexts. There are a num-
ber of studies that, starting from this hypoth-
esis, have built automatic or semi-automatic
procedures for clustering words (Brill and Mar-
cus, 1992; Pereira et al., 1993; Martin et al.,
1998), especially in the field of cognitive sci-
ences (Redington et al., 1998; Gobet and Pine,
1997; Clark, 2000). They examine the distribu-
tional behaviour of some target words, compar-
ing the lexical distribution of their respective
collocates using quantitative measures of distri-
butional similarity (Lee, 1999).
In (Brill and Marcus, 1992) it is given a semi-
automatic procedure that, starting from lexical
statistical data collected from a large corpus,
aims to arrange target words in a tree (more
precisely a dendrogram), instead of clustering
them automatically. This procedure requires a
linguistic examination of the resulting tree, in
order to identify the word classes that are most
appropriate to describe the phenomenon under
investigation. In this sense, they use a semi-
automatic word-class generator method.
A similar procedure has been applied on Ital-
ian in (Tamburini et al., 2002). The novelty of
this work is that it derives the distributional in-
formation on words from a very basic set of PoS
tags, namely nouns, verbs and adjectives. This
method, completely avoiding the sparseness of
the data affecting Brill and Marcus’ method,
uses general information about the distribution
of lexical words to study the internal subdivi-
sions of the set of grammatical words, and re-
sults more stable than the method based only
on lexical co-occurrence.
The main drawback of these techniques is the
limited context of analysis. Collecting informa-
tion from a defined context, typically two or
three words will invariably miss syntactic de-
pendencies longer than the context interval. To
overcome this problem we propose to exploit the
expressivity of CTAs (with encoded core depen-
dency relations, as we saw in the section above)
by applying the clustering algorithms on them.
Below we sketch how we intend to induce CTAs
from the TUT dependency treebank, and the
clustering method we plan to implement. The
whole procedure can be summarized by the pic-
ture below.
Treebank conversion−−−−−−−→ CTL structures
↓ type resolution
PoS tagset clustering←−−−−−− Categorial Types
4.1 Inducing categorial types from TUT
The first step is to reduce the distinctions
encoded in the TUT treebank to bare head-
dependent relations: the ARG type on the one
hand, and the MOD and AUX types on the other.
These relations are converted into fa-structures
built by means of the dependency-sensitive op-
erators triangleleftsld, trianglerightsld, triangleleft , triangleright .
By means of example, we consider some sim-
ple sentences exemplifying the different rela-
tions.
Figure 1 shows a head-dependent structure
in which edges represent head-dependent rela-
tions and each edge points to the dependent of
each relation. In this example, each H-D rela-
tion agrees with the F-A relation, i.e. each head
corresponds to a functor and the dependents are
all labeled as arguments (or sub-tags of it).7
Alan
0
Alan
SUBJ
mangia
1
eats
TOP_VERB
la
2
the
OBJ
mela
3
apple
ARG
SUBJ OBJ ARG
Figure 1: ARG: Functor and Head coincide
Figure 2 adds to the example from figure 1
the use of qualifying adjectives, which is an ex-
ample of a modifier, and past tense auxiliaries.
Considering the relation between “mela”(apple)
and “rossa” (red), and between “ha” (has) and
“mangiato” (eaten), we have the dependency
trees in Figure 2.
In the first case, the noun is the head and
the adjective is the dependent, but from the
functor-argument perspective, the adjective (in
general, the modifier) is the incomplete functor
component. A similar discrepancy is observed
for the auxiliary and the main verb, where the
auxiliary should be classified as the incomplete
functor, but as the dependent element with re-
spect to the main verb. In this case the absence
7The example follows TUT practice in designating the
determiner as the head of the noun phrases. We are
aware of the fact that this is far from controversial in
the dependency community. In preprocessing TUT be-
fore type inference, we have the occasion to adjust such
debatable decisions, and representational issues such as
the use of empty categories, for which there is no need
in a CTL framework.
Alan
0
Alan
SUBJ
ha
1
AUX
mangiato
2
ate
TOP_VERB
la
3
the
OBJ
mela
4
apple
ARG
SUBJ AUX OBJ ARG
Figure 2: MOD and AUX: Functors as Dependents
of the auxiliary would result in an ungrammat-
ical sentence. The relations of MOD and AUX ex-
hibit a different behavior than ARG, and hence
are depicted with different arcs.
Our simple example sentences could be con-
verted into the following fa-structures:
- Allen trianglerightsld (mangia triangleleftsld (la triangleleftsld mela)
- Allen trianglerightsld (mangia triangleleftsld (la triangleleftsld (mela triangleright rossa))
- Allen trianglerightsld ((ha triangleleft mangiato) triangleleftsld (la triangleleftsld mela))
The second step is to run the Buszkowski-
Penn type-inference algorithm (in its extended
form, discussed above) on the fa-structures ob-
tained from TUT, and to reduce the lexicon
by factoring (identification of unifiable assign-
ments) and (in a later phase) structural deriv-
ability. Fixing the goal type for these examples
as s, we obtain the following type assignments
from the fa-structures given above:
Allan A
mangia (A\rs)/lB
la B/lC
mela C
rossa C\lC
ha ((A\rs)/lB)/rD
mangiato D
Notice that from the output in our tiny sam-
ple, we have no information allowing us to iden-
tify the argument assignments A and B. No-
tice also that from an fa-structure which takes
together “ha mangiato” in a constituent, we
obtain a type assignment for “mangiato” that
does not express its incompleteness anymore
— instead, the combination with the auxil-
iary expresses this. This is already an exam-
ple where structural reasoning can play a role:
compare the above analysis with the type so-
lution one would obtain by starting from an
fa-structure which takes “mangiato la mela”
as a constituent, which yields a type solution
(A\rs)/rE for the auxiliary, and E/lB for the
head verb. We are currently experimenting with
the effect of different constituent groupings on
the size of the induced type lexicon.
4.2 Clustering Algorithms
Once we have induced the categorial type as-
signments for the TUT lexicon, the last step of
our first task is to divide it into clusters so to
study the distributional behavior of the corre-
sponding lexical entries. The advantage of us-
ing categorial types as objects of the clustering
algorithm is that they represent long distance
dependencies as well as limited distributional
information. Thus the categorial types become
the basic elements of syntactic information as-
sociated with lexical entries and the basic “dis-
tributional fingerprints” used in the clustering
process.
Every clustering process is based on a no-
tion of “distance” between the objects involved
in the process. We should define an appropri-
ate metric among categorial types. We believe
that a crucial role will be played by the depen-
dency relation encoded into the types by means
of compositional modes.
Currently, we are studying the application of
proper distance measures considering types as
trees and adapting the theoretical results on
tree metrics to our problem. The algorithm for
computing the tree-edit distance (Shasha and
Zhang, 1997), designed for generic trees, ap-
pears to be a good candidate for clustering in
categorial-type domain. What remains to be
done is to experiment the algorithm and fine-
tune the metrics to our purpose.
5 Conclusions and Further Research
In this paper we have presented work in progress
devoted to the syntactic annotation of a large
Italian corpus. We have just started working in
this direction and the biggest part of the work
has still to be done. We are currently evaluating
the TUT encoding of dependency information,
and identifying areas that allow optimization
from the point of view of CTL type induction.
A case in point is the heavy reliance of TUT
on empty elements and/or traces, which con-
flicts with our desire for an empirically-based
and theory-neutral representation of linguistic
dependencies. It seems that the trace artifact
can be avoided if one properly exploits the more
expressive category concept of CTL, allowing
product types for asyndetic constructions, and
higher-order typesfor multiple dependencies. In
parallel, we are looking for other sources of de-
pendency information for Italian, in order to
complement the rather small TUT database we
have at our disposal now.
6 Acknowledgments
Our thanks go to FIRB 2001 project
RBNE01H8RS coordinated by prof. R. Rossini
Favretti for the funding supports. Thanks to
L. Surace and C. Seidenari for the detailed
comparison on Italian PoS classifications.

References
C. Bosco. 2003. A grammatical relation system
for treebank annotation. Ph.D. thesis, Com-
puter Science Department, Turin University.
E. Brill and M. Marcus. 1992. Tagging an un-
familiar text with minimal human supervi-
sion. In Proceedings of the Fall Symposium
on Probabilistic Approaches to Natural Lan-
guage, pages 10–16, Cambridge. MA: Ameri-
can Association for Artificial Intelligence.
W. Buszkowski and G. Penn. 1990. Categorial
grammars determined from linguistic data by
unification. Studia Logica, 29:431–454.
A. Clark. 2000. Inducing syntactic categories
by context distribution clustering. In Pro-
ceedings of CoNLL-2000 and LLL-2000 Con-
ference, pages 94–91, Lisbon, Portugal.
F. Gobet and J. Pine. 1997. Modelling the ac-
quisition of syntactic categories. In Proceed-
ings of the 19th Annual Meeting of the Cog-
nitive Science Society, pages 265–270.
L. Lee. 1999. Measures of distributional simi-
larity. In Proceedings of the 37th ACL, pages
25–32, College Park, MD.
S. Martin, J. Liermann, and H. Ney. 1998. Al-
gorithms for bigram and trigram word clus-
tering. Speech Communication, 24:19–37.
M. Monachini. 1995. ELM-IT: An Italian in-
carnation of the EAGLES-TS. definition of
lexicon specification and classification guide-
lines. Technical report, Pisa.
S. Montemagni, F. Barsotti, M. Battista,
N. Calzolari, O. Corazzari, A. Lenci, A. Zam-
polli, F. Fanciulli, M. Massetani, R. Raf-
faelli, R. Basili, M. T. Pazienza, D. Saracino,
F. Zanzotto, N. Mana, F. Pianesi, and R. Del-
monte, 2003. Building and using parsed cor-
pora, chapter Building the Italian Syntactic-
Semantic Treebank, pages 189–210. Lan-
guage and Speech series. Kluwer, Dordrecht.
M. Moortgat and G. Morrill. 1991. Heads
and phrases. Type calculus for dependency
and constituent structure. Technical report,
Utrecht.
M. Moortgat. 1997. Categorial type logics.
In J. van Benthem and A. ter Meulen, edi-
tors, Handbook of Logic and Language, pages
93–178. The MIT Press, Cambridge, Mas-
sachusetts.
Michael Moortgat. 2001. Structural equa-
tions in language learning. In P. de Groote,
G. Morrill, and C. Retor´e, editors, Logical As-
pects of Computational Linguistics, volume
2099 of Lecture Notes in Artificial Intelli-
gence, pages 1–16, Berlin. Springer.
F. Pereira, T. Tishby, and L. Lee. 1993. Dis-
tributional clustering of English words. In
Proceedings of the 31st ACL, pages 183–190,
Columbus, Ohio.
M. Redington, N. Chater, and S. Finch. 1998.
Distributional information: a powerfulcue for
acquiring syntactic categories. Cognitive Sci-
ence, 22(4):425–469.
D. Shasha and D. Zhang. 1997. Approximate
tree pattern matching. In A. Apostolico and
Z. Galig, editors, Pattern matching algo-
rithms. Oxford University Press.
F. Tamburini, C. De Santis, and Zamuner E.
2002. Identifying phrasal connectives in Ital-
ian using quantitative methods. In S. Nuc-
corini, editor, Phrases and Phraseology -Data
and Description. Berlin: Peter Land.
T. Venneman. 1977. Konstituenz und Depen-
denz in einigen neueren Grammatiktheorien.
Sprachwissenschaft, 2:259–301.
