Using ‘smart’ bilingual projection to feature-tag a monolingual dictionary
Katharina Probst
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
kathrin@cs.cmu.edu
Abstract
We describe an approach to tagging a monolin-
gual dictionary with linguistic features. In par-
ticular, we annotate the dictionary entries with
parts of speech, number, and tense information.
The algorithm uses a bilingual corpus as well
as a statistical lexicon to find candidate train-
ing examples for specific feature values (e.g.
plural). Then a similarity measure in the space
defined by the training data serves to define a
classifier for unseen data. We report evaluation
results for a French dictionary, while the ap-
proach is general enough to be applied to any
language pair.
In a further step, we show that the proposed
framework can be used to assign linguistic
roles to extracted morphemes, e.g. noun plu-
ral markers. While the morphemes can be
extracted using any algorithm, we present a
simple algorithm for doing so. The emphasis
hereby is not on the algorithm itself, but on the
power of the framework to assign roles, which
are ultimately indispensable for tasks such as
Machine Translation.
1 Introduction and motivation
The Machine Translation community has recently under-
gone a major shift of focus towards data-driven tech-
niques. Among these techniques, example-based (e.g.
(Brown, 1997)) and statistical (e.g. (Brown et al.,
1990; Brown et al., 1993)) are best known and stud-
ied. They aim at extracting information from bilingual
text and building translation systems automatically. This
empirical approach overcomes the development bottle-
neck that traditional transfer- and interlingua-based ap-
proaches face. What used to take years of human devel-
opment time can now be achieved in a fraction of the time
with similar accuracy. However, in studying such empir-
ical approaches and the output of the resulting systems,
there have been calls for the re-incorporation of more lin-
guistic intuition and/or knowledge. One notable example
in this context is (Yamada and Knight, 2001; Yamada and
Knight, 2002), who introduce syntactic knowledge into
their statistical translation model. Our approach goes in a
similar direction. The AVENUE system (Carbonell et al.,
2002) infers syntactic transfer rules similar to the ones a
human grammar writer would produce. The training data
is bilingual text, and learning is facilitated by the usage
of linguistic information (e.g. parses, feature informa-
tion). We focus primarily on a resource-rich/resource-
poor situations, i.e. on language pairs where for one of
the languages resources such as a parser are available,
but not for the other language. It is outside the scope of
this paper to describe our rule learning approach. The
interested reader should refer to (Carbonell et al., 2002;
Probst, 2002).
From the brief description above it should become ap-
parent that heavy reliance on feature-tagged dictionaries
and/or parsers becomes a new bottleneck for Machine
Translation development. Our work focuses on target lan-
guages for which there does exist a dictionary, but its en-
tries may are initially not tagged with linguistic feature
values, so that the dictionary is a mere word list (which is
what Example-based Machine Translation and Statistical
Machine Translation systems use most frequently). Hav-
ing the feature values can become very useful in transla-
tion. For example, if the English sentence contains a plu-
ral noun, the system can ensure that this word is translated
into a plural noun in the target language (if the learned
rule requires this).
Despite the importance of the feature tags, we cannot
afford to build such a rich dictionary by hand. Moreover,
we cannot even rely on the availability of experts that can
write morphological rules for a given language. Rather,
we wish to develop an algorithm that is general enough
that it can be applied to any language pair and does not
require knowledge of the target language’s morphology.
In this paper, we explore the following features: parts
of speech (pos), number on nouns, adjectives, and verbs,
and tense on verbs. Furthermore, the process is fully au-
tomatic, thus eliminating the need for human expertise
in a given language. Our main idea is based on using a
bilingual corpus between English and a target language.
In our experiments, we report results for French as the tar-
get language. We annotate the English side of the corpus
with pos tags, using the Brill tagger (Brill, 1995). We fur-
ther utilize a statistical bilingual (Englisha0 French) dic-
tionary in order to find candidates translations for partic-
ular English words. Our work falls in line with the bilin-
gual analysis described in (Yarowsky and Ngai, 2001;
Yarowsky et al., 2001). While we use a different ap-
proach and tackle a different problem, the major reason-
ing steps are the same. (Yarowsky and Ngai, 2001) aim
at pos tagging a target language corpus using English pos
tags as well as estimation of lexical priors (i.e. what pos
tags can a word have with what probability) and a tag
sequence model. The authors further report results on
matching inflected verb forms in the target language with
infinitive verbs, as well as on noun phrase chunking. In
all three cases, the information on the English side is used
to infer linguistic information on the target language side.
Our work follows the same idea.
2 Tagging the target language dictionary
with pos
In a first step, we tag the target language dictionary en-
tries with likely pos information. It is important to note
that this is the first step in the process. The following
steps, aiming at tagging entries with features such as
number, are based on the possible pos assigned to the
French entries.
We would like to emphasize clearly that the goal of our
work is not ‘traditional’ pos tagging. Rather, we would
like to have the target language dictionary tagged with
likely pos tags, possibly more than one per word1.
Having said this, we follow in principle the algorithm
proposed by (Yarowsky and Ngai, 2001) to estimate lex-
ical priors. We first find the most likely corresponding
French word for each English word. Then we project the
English pos onto the French word. While it is clear that
words do not always translate into words of the same pos,
the basic idea is that overall they are likely to transfer into
the same pos most of the time. Using a large corpus will
then give us averaged information on how often a word is
the most likely correspondent of a noun, a verb, etc.
1Each of the pos assignments is also annotated with a prob-
ability. The probabilities are not actually used in the work de-
scribed here, but they can be utilized in the rule learning system.
In this section, we restrict our attention (again, fol-
lowing (Yarowsky and Ngai, 2001)) to five ‘core’ pos, N
(noun), V (verb), J (adjective), R (adverb), and I (prepo-
sition or subordinating conjunction). The algorithm was
further only evaluated on N, V, and J, first because they
are the most likely pos (so more reliable estimates can
be given), and second because the remainder of the paper
only deals with these three pos.
In preparation, we use the Brill tagger (Brill, 1995)
to annotate the English part of the corpus with pos tags.
Suppose we have an aligned bilingual sentence pair a1a3a2a5a4 –
a1a7a6a5a4 . The algorithm then proceeds as follows: for each En-
glish word a8a10a9 in sentence a1a10a2a11a4 tagged with one of the core
tags, look up all words in the statistical Englisha0 French
dictionary that are proposedly translations of it. Then
pick as a likely translation the word a12a3a9 with the highest
probability in the statistical dictionary that also occurs in
the French sentence a1 a6 a4 . We then simply add the num-
ber of times that a given word corresponds to an English
word of tag a13a15a14 , denoted by a16a3a17a18a13a15a14a20a19a21a12 a9a23a22 . This information is
used to infer a24a25a17a18a13a15a14a27a26 a12 a9a23a22 :
a24a25a17a28a13 a14 a26 a12a10a9 a22a30a29
a16a3a17a28a13 a14 a19a31a12a3a9 a22
a32a34a33
a35a15a36a38a37
a16a3a17a28a13
a35
a19a31a12 a9a39a22
Given the way our algorithm estimates the probabil-
ities of pos for each French word, it is clear that some
noise is introduced. Therefore, any pos will be assigned
a non-zero probability by the algorithm for each French
word. However, as was noted by (Yarowsky and Ngai,
2001), most words tend to have at most two pos. In an
attempt to balance out the noise introduced by the al-
gorithm itself, we do not want to assign more than two
possible pos to each word. Thus, for each French word
we only retain the two most likely tags and rescale their
probabilities so that they sum to 1. Denote a13
a37 as the most
likely tag for word a12a3a9 , a13a41a40 as the second most likely tag for
word a12a10a9 . Then
a24a25a17a28a13
a37
a26 a12 a9a39a22a41a42a21a2a41a43a41a44a5a45a21a46a47a2a49a48a50a29
a24a38a17a18a13
a37
a26 a12 a9a39a22
a24a25a17a18a13
a37
a26 a12a10a9 a22a52a51 a24a25a17a28a13a41a40a53a26 a12a10a9 a22
and
a24a25a17a28a13 a40 a26 a12 a9a23a22a49a42a31a2a41a43a49a44a5a45a21a46a47a2a41a48a54a29
a24a25a17a28a13a41a40a53a26 a12a10a9 a22
a24a38a17a18a13
a37
a26 a12a10a9 a22a55a51 a24a25a17a18a13a41a40a20a26 a12a10a9 a22a27a56
In order to have a target language dictionary tagged
with pos, we use the statistical bilingual dictionary and
extract all French words. If a French word was encoun-
tered during the pos training, it is assigned the one or two
most likely tags (together with the probabilities). Other-
wise, the word remains untagged, but is retained in the
target language dictionary.
In a second round of experiments, we slightly altered
our algorithm. Instead of only extracting the most likely
Nouns Verbs Adjectives
No Probabilities 79.448 81.265 70.707
With Probabilities 78.272 80.279 71.809
Table 1: Accuracies of pos Estimation for nouns, verbs,
and adjectives, evaluated on 2500 French dictionary en-
tries.
French correspondent, we take into account the corre-
spondence probability as assigned by the statistical bilin-
gual dictionary. Instead of simply counting how many
times a French word corresponds to, say, a noun, the
counts are weighted by the probability of each of the cor-
respondences. The remainder of the algorithm proceeds
as before.
We tested the algorithm on parts of the Hansard data,
200000 sentence pairs between English and French. The
evaluation was done on 2500 French lexicon entries,
which were hand-tagged with pos. For each automati-
cally assignment pos assignment, we check whether this
assignment was also given in the hand-developed partial
dictionary. Partial matches are allowed, so that if the al-
gorithm assigned one correct and one incorrect pos, the
correctly assigned label is taken into account in the ac-
curacy measure. Table 1 shows the results of the tagging
estimates. It can be seen that due to the relative rarity of
adjectives, the estimates are less reliable than for nouns
and verbs. Further, the results show that incorporating the
probabilities from the bilingual lexicon does not result in
consistent estimation improvement. A possible explana-
tion is that most of the French words that are picked as
likely translations are highly ranked and correspond to
the given English word with similar probabilities.
(Yarowsky and Ngai, 2001) propose the same algo-
rithm as the one proposed here for their estimation of
lexical priors, with the exception that they use automatic
word alignments rather than our extraction algorithm for
finding corresponding words. As for (Yarowsky and
Ngai, 2001) estimating lexical priors is merely an inter-
mediate step, they do not report evaluation results for this
step. Further experiments should show what impact the
usage of automatic alignments has on the performance of
the estimation algorithm.
3 A feature value classifier
In this section, we describe the general algorithm of train-
ing a classifier that assigns a feature value to each word
of a specific core pos. The following sections will detail
how this algorithm is applied to different pos and differ-
ent features. The algorithm is general enough so that it
can be applied to various pos/feature combinations. The
extraction of training examples is the only part of the pro-
cess that changes when applying the algorithm to a differ-
ent pos and/or feature.
3.1 Extraction of training data
Although the following sections will describe in greater
detail how training data is obtained for each pos/feature
combination, the basic approach is outlined here. As in
the previous section, we use the sentence-aligned bilin-
gual corpus in conjunction with the statistical bilingual
dictionary to extract words that are likely to exhibit a fea-
ture. In the previous section, this feature was a particular
pos tag. Here, we focus on other features, such as plural.
For instance, when looking for plural nouns, we extract
plural nouns from the English sentences (they are tagged
as such by the Brill tagger, using the tag ‘NNS’). We then
extract the French word in the corresponding sentence
that has the highest correspondence probability with the
English word according to the statistical bilingual dictio-
nary. This process again ensures that most (or at least
a significant portion) of the extracted French words ex-
hibits the feature in question. In principle, the purpose
of the classifier training is then to determine what all (or
most) of the extracted words have in common and what
sets them apart.
3.1.1 Tagging of tense on verbs
The first feature we wish to add to the target language
lexicon is tense on verbs. More specifically, we restrict
our attention to PAST vs. NON-PAST. This is a pragmatic
decision: the tagged lexicon is to be used in the context
of Machine Translation, and the most common two tenses
that Machine Translation systems encounter are past and
present. In the future, we may investigate a richer tense
set.
In order to tag tense on verbs, we proceed in principle
as was described before when estimating lexical priors.
We consider each word a8a10a9 in the English corpus that is
tagged as a past tense verb. Then we find the likely cor-
respondence on the French side, a12 a9 , by considering the
list of French words that correspond to the given English
word, starting from the pair with the highest correspon-
dence probability (as obtained from the bilingual lexi-
con). The first French word from the top of the list that
also occurs in the French sentence is extracted and added
to the training set:
a12a10a9 a29
a1a3a2a5a4a3a6a7a1a3a8
a6a5a4a10a9a11a10a12
a35
a12
a33
a24a25a17a39a12
a35
a26 a8a7a9 a22 a19
where a6 is the number of French words in the lexicon.
3.1.2 Tagging of number on nouns, adjectives, and
verbs
Further, we tag nouns with number information.
Again, we restrict our attention to two possible values:
SINGULAR vs. PLURAL. Not only does this make sense
from a pragmatic standpoint (i.e. if the Machine Trans-
lation system can correctly determine whether a word
should be singular or plural, much is gained); it also al-
lows us to train a binary classifier, thus simplifying the
problem.
The extraction of candidate French plural nouns is
done as expected: we find the likely French correspon-
dent of each English plural noun (as specified by the En-
glish pos-tagger), and add the French words to the train-
ing set.
However, when tagging number on adjectives and
verbs, things are less straight-forward, as these features
are not marked in English and thus the information cannot
be obtained from the English pos-tagger. In the case of
verbs, we look for the first noun left of the candidate verb.
More specificially, we consider an English verb from the
corpus only if the closest noun to the left is tagged for plu-
ral. This makes intuitive sense linguistically, as in many
cases the verb will follow the subject of a sentence.
For adjectives, we apply a similar strategy. As most ad-
jectives (in English) appear directly before the noun that
they modify, we consider an adjective only if the closest
noun to the right is in the plural. If this is the case, we
extract the likely French correspondent of the adjective as
before.
3.2 Projection into a similarity space of characters
The extracted words are then re-represented in a space
that is similar in concept to a vector space. This process
is done as follows: Let
a0a2a1a4a3a6a5a8a7
a43a10a9 a9
a3
a6a7a2a41a45a12a11a14a13 a42a21a2a15a9a17a16 a29a19a18 a12
a37
a19a31a12 a40 a19
a56 a56a47a56 a56
a19a31a12a21a20a23a22
denote the set of French words that have been extracted
as training data for a particular pos/feature combina-
tion. For notational convenience, we will usually refer
to a0a24a1a17a3a25a5a26a7 a43a15a9 a9 a3 a6a7a2a41a45a27a11a14a13a7a42a21a2a15a9a14a16 as a0 in the remainder of the paper.
The reader is however reminded that each a0 is associated
with a particular pos/feature combination. Let
a6a7a1a3a8 a28
a8a8a20
a4
a13a10a29 a29
a6a7a1a3a8
a9a31a30a11a10a12 a9 a12a33a32 a26 a12 a9 a26
denote the length of the longest word in
a0a24a1a17a3a6a5a8a7
a43a15a9 a9
a3
a6 a2a49a45a27a11a14a13a7a42a31a2a15a9a17a16 . Then we project all words in this
set into a space of a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 dimensions, where
each character index represents a dimension. This
implies that for the longest word (or all words of
length a6 a1a3a8 a28 a8a34a20 a4 a13a10a29 ), each character is one dimension.
For shorter words, the projection will contain empty
dimensions. Our idea is based on the fact that in many
languages, the most common morphemes are either
prefixes or suffixes. We are interested in comparing what
most words in a0 begin or end in, rather than emphasizing
on the root part, which tends to occur inside the word.
Therefore, we simply assign an empty value (‘-’) to
those dimensions for short words that are in the middle
of the word. A word a12a3a9 , such that a26 a12a3a9a31a26a2a35 a6a7a1 a8 a28 a8a34a20 a4 a13a10a29 ,
is split in the middle and its characters are assigned to
the dimensions of the current space from both ends. In
case a26 a12 a9 a26 a29a37a36a39a38 a51a41a40 a19 a38a43a42a45a44a23a46 , we double the character at
position a47a15a26 a12 a9 a26 a48 a36a50a49 , so that it can potentially be part of a
suffix or a prefix.
For example if a0a24a1a17a3a25a5a26a7 a43a15a9 a9 a3 a6a7a2a41a45a27a11a14a13a7a42a21a2a15a9a14a16 a29
a18a34a51
a2a50a52a54a53
a13a49a1 a19
a53a55a28
a1a53a19a31a12 a8
a6 a6
a8a10a1a53a19
a56 a56a47a56
a19
a52 a2 a1
a13a41a8a34a56
a2
a1a57a22 , then the cor-
responding space will be represented as follows:
d r o - - i t s
i l - - - - l s
f e m - - m e s
...
o r a t e u r s
3.3 Similarity measure
In order to determine what the common feature between
most of the words in a0a24a1a4a3a6a5a8a7 a43a10a9 a9 a3 a6a7a2a41a45a12a11a14a13 a42a21a2a15a9a17a16 is, we define a sim-
ilarity measure between any two words as represented in
the space.
We want our similarity measure to have certain prop-
erties. For instance, we want to ‘reward’ (consider as in-
creasing similarity) if two words have the same character
in a dimension. By the same token, a different charac-
ter should decrease similarity. Further, the empty char-
acter should not have any effect, even if both words have
the empty character in the same dimension. Regarding
the empty character a match would simply consider short
words similar, clearly not a desired effect.
We therefore define our similarity measure as a
measure related to the inner product of two vectors
a58
a8
a19a10a59a61a60 a29
a32 a14a62
a36a25a37
a8
a62
a59
a62 , where
a38 is the number of dimen-
sions. Note however two differences: first, the product
a8
a62
a59
a62 is dependent on the specific vector pair. It is
defined as
a8
a62
a59
a62
a29
a63
a40 a19
a8
a62
a29 a59
a62
a19
a8
a62a65a64
a29 ‘-’
a66
a19 otherwise
Second, we must normalize the measure by the num-
ber of dimensions. This will become important later in
the process, when certain dimensions are ignored and we
do not always compute the similarity over the same num-
ber of dimensions. The similarity measure then looks as
follows:
a1
a53 a6
a17
a8
a19a10a59 a22a30a29
a32
a14a62
a36a25a37
a8
a62
a59
a62
a38
a19
Note that when all dimensions are considered, a38
will correspond to a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 . The similarity mea-
sure is computed for each pair of words a12 a9a15a19a31a12 a35 a42
a0a24a1a17a3a6a5a8a7
a43a15a9 a9
a3
a6 a2a49a45a27a11a14a13a7a42a31a2a15a9a17a16 a19
a53
a64
a29a45a67 . Then the average is computed.
This number can be regarded as a measure of the inco-
herence of the space:
a53
a20a55a16
a52
a29a1a0 a29
a32
a9 a9
a35a3a2
a0 a30a9a5a4
a36 a35
a1
a53 a6
a17a23a12a3a9a23a12
a35
a22
a37
a40a7a6
a8
a32
a14a10a9
Although it seems counterintuitive to define an
incoherence measure as opposed to a coherence measure,
calling the measure an incoherence measure fits with the
intuition that low incoherence corresponds to a coherent
space.
4 Run-time classification
4.1 Perturbing and unifying dimensions
The next step in the algorithm is to determine what influ-
ence the various dimensions have on the coherence of the
space. For each dimension, we determine its impact: does
it increase or decrease the coherence of the space. To this
end, we compute the incoherence of the space with one
dimension blocked out at a time. We denote this new in-
coherence measure as before, but with an additional sub-
script to indicate which dimension was blocked out, i.e.
disregarded in the computation of the incoherence. Thus,
for each dimension a53 a19 a40 a35 a53 a35 a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 , we obtain a
new measure a53 a20a55a16 a52 a29a1a0 a9a9 . Two things should be noted: first,
a53
a20a55a16
a52
a29a1a0 a9a9 measures the coherence of the space without di-
mension a53 . Further, the normalization of the similarity
metric becomes important now, if we want to be able to
compare the incoherence measures.
In essence, the impact of a dimension is perturbing if
disregarding it increases the incoherence of the space.
Similarly, it is unifying if its deletion decreases the in-
coherence of the space. The impact of a dimension is
measured as follows:
a53 a6
a24 a0 a9a9a55a29
a17
a53
a20a55a16
a52
a29 a0 a9a9a12a11
a53
a20a55a16
a52
a29 a0a30a22
a53
a20a55a16
a52
a29a13a0
We then conjecture that those dimensions whose im-
pact is positive (i.e. disregarding it results in an increased
incoherence score) are somehow involved in marking the
feature in question. These features, together with their
impact score a53 a6 a24a14a0 a9a9 are retained in a set
a15
a24 a13a17a16a30a8 a13 a0 a29a19a18
a53
a26 a40 a35
a53
a35
a6a7a1a3a8 a28
a8a8a20
a4
a13a10a29a52a19
a53 a6
a24 a0 a9a9a19a18
a66
a22
a56
The a15 a24 a13a17a16 a8 a13a20a0 is used for classification as described in
the following section.
4.2 Classification of French dictionary entries
From the start we have aimed at tagging the target lan-
guage dictionary with feature values. Therefore, it is
clearly not enough to determine which dimensions in the
space carry information about a given feature. Rather,
we use this information to classify words from the target
language dictionary.
To this end, all those words in the target language dic-
tionary that are of the pos in question are classified us-
ing the extracted information (the reader is reminded that
the system learns a classifier for a particular pos/feature
combination). For a given word a12 a11a28a2a41a43a55a11 , we first project
the word into the space defined by the training set. Note
that in can happen that a26 a12 a11a28a2a49a43 a11 a26 a18 a6a7a1a3a8 a28 a8a34a20 a4 a13a10a29 , i.e. that
a12a54a11a28a2a41a43a55a11 is longer than any word encountered during train-
ing. In this case, we delete enough characters from the
middle of the word to fit it into the space defined by the
training set. Again, this is guided by the intuition that
often morphemes are marked at the beginning and/or the
end of words. While the deletion of characters (and thus
elimination of information) is theoretically a suboptimal
procedure, it has a negligible effect at run-time.
After we project a12a54a11a28a2a49a43 a11 into the space, we compute the
coherence of the combined space defined by the set de-
noted by a18 a0 a19a21a12 a11a28a2a41a43 a11 a22 a29 a0a22a21 a12 a11a28a2a41a43 a11 as follows, where the
similarity is computed as above and a20 again denotes the
size of the set F:
a53
a20a55a16
a52
a29a12a23 a0 a9a6a17a24a26a25a28a27a29a24a31a30 a29
a32
a9
a2
a0
a1
a53 a6
a17a23a12 a11a28a2a49a43 a11 a19a31a12 a9a39a22
a20
In words, the test word a12 a11a28a2a41a43 a11 is compared to each word
a12 a9 in the set
a0 .
In the following, all dimensions a53 a42 a15 a24 a13a17a16 a8 a13 a0 are
blocked out in turn, and a53 a20a55a16
a52
a29a32a23 a0 a9a6 a24a26a25a28a27a29a24 a30 a9a9 is computed, i.e.
the incoherence of the set a18 a0 a19a21a12a50a11a28a2a41a43a55a11a27a22 with one of the di-
mensions blocked out. As before, the impact of dimen-
sion is defined by
a53 a6
a24a12a23 a0 a9a6a20a24a33a25a34a27a34a24a28a30 a9a9 a29
a17
a53
a20a55a16
a52
a29a12a23 a0 a9a6 a24a26a25a28a27a34a24 a30 a9a9 a11
a53
a20a55a16
a52
a29a14a23 a0 a9a6 a24a33a25a34a27a34a24 a30 a22
a53
a20a55a16
a52
a29a12a23 a0 a9a6 a24a33a25a34a27a34a24 a30
Finally, the word a12 a11a28a2a41a43a55a11 is classified as ‘true’ (i.e. as
exhibiting the feature) if blocking out the dimensions in
a15
a24 a13a17a16a30a8 a13a20a0 descreases incoherence more than the average,
i.e. when the incoherence measures were computed on
the training set. Thus, the final decision rule is:
a12 a11a28a2a41a43 a11a30a29
a35
a36a37
a13
a2
a56 a8 a19
a32
a9
a2a39a38
a5
a11a29a40a20a2 a11a34a41a20a42a44a43a45
a24a33a25a28a27a29a24a17a46
a53 a6
a24a32a23 a0 a9a6 a24a26a25a28a27a29a24 a30 a9a9
a18
a32
a9
a2a39a38
a5
a11a29a40a20a2 a11 a42
a53 a6
a24 a0 a9a9
a12
a1 a28
a1 a8 a19 otherwise
In practice, this decision rule has the following impact:
If, for instance, we wish to tag nouns with plural informa-
tion, a word a12 a11a28a2a49a43 a11 will be tagged with plural if classified
as true, with singular if classified as false.
5 Experimental results
As with pos estimation, we tested the feature tagging al-
gorithms on parts of the Hansards, namely on 200000
No Probs With Probs
N: Pl vs. Sg 95.584 95.268
J: Pl vs. Sg 97.143 97.037
V: Pl vs. Sg 85.075 85.019
V: Past vs. Non-Past 72.832 73.043
Table 2: Accuracies of tagging nouns, adjectives, and
verbs with plural or singular, and tagging verbs with past
vs. non-past, based on two dictionaries that was tagged
with pos automatically, one of which used the probabili-
ties of the translation dictionary for pos estimation.
sentence pairs English-French. Accuracies were obtained
from 2500 French dictionary entries that were not only
hand-tagged with pos, but also with tense and number as
appropriate. Table 2 summarizes the results. As men-
tioned above, we tag nouns, adjectives, and verbs with
PLURAL vs. SINGULAR values, and additionally verbs
with PAST vs. NON-PAST information. In order to ab-
stract away from pos tagging errors, the algorithm is only
evaluated on those words that were assigned the appro-
priate pos for a given word. In other words, if the test set
contains a singular noun, it is looked up in the automati-
cally produced target language dictionary. If this dictio-
nary contains the word as an entry tagged as a noun, the
number assignment to this noun is checked. If the clas-
sification algorithm assigned singular as the number fea-
ture, the algorithm classified the word successfully, other-
wise not. This way, we can disregard pos tagging errors.
When estimating pos tags, we produced two separate
target language dictionaries, one where the correspon-
dence probabilities in the bilingual Englisha0 French dic-
tionary were ignored, and one where they were used to
weigh the correspondences. Here, we report results for
both of those dictionaries. Note that the only impact of
the a different dictionary (automatically tagged with pos
tags) is that the test set is slightly different, given our eval-
uation method as described in the previous paragraph.
The fact that evaluating on a different dictionary has no
consistent impact on the results only shows that the algo-
rithm is robust on different test sets.
The overall results are encouraging. It can be seen that
the algorithm very successfully tags nouns and adjectives
for plural versus singular. In contrast, tagging verbs is
somewhat less reliable. This can be explained by the
fact that French tags number in verbs differently in dif-
ferent tenses. In other words, the algorithm is faced with
more inflectional paradigms, which are harder to learn
because the data is fragmented into different patterns of
plural markings.
A similar argument explains the lower results for past
versus non-past marking. French has several forms of
past, each with different inflectional paradigms. Further,
different groups of verbs inflect for tense differntly, frag-
menting the data further.
6 Morpheme role assignment
While in this work we use the defined space merely for
classification, our approach can also be used for assigning
roles to morphemes. Various morpheme extraction algo-
rithms can be applied to the data. However, the main ad-
vantage of our framework is that it presents the morphol-
ogy algorithm of choice with a training set for particular
linguistic features. This means that whatever morphemes
are extracted, they can immediately be assigned their lin-
guistic roles, such as number or tense. Role assignment is
generally not focused on or excluded entirely in morphol-
ogy learning. While mere morpheme extraction is useful
and sufficient for certain tasks (such as root finding and
stemming), for Machine Translation and other tasks in-
volving deeper syntactic analysis it is not enough to find
the morphemes, unless they are also assigned roles. If,
for instance, we are to translate a word for which there is
no entry in the bilingual dictionary, but by stripping off
the plural morpheme, we can find a corresponding (sin-
gular) word in the other language, we can ensure that the
target language word is turned into the plural by adding
the appropriate plural morpheme.
In this section, we present one possible algorithm for
extracting morphemes in our framework. Other, more so-
phisticated, unsupervised morphology algorithms, such
as (Goldsmith, 1995), are available and can be applied
here. Staying within our framework ensures the addi-
tional benefit of immediate role assignment.
Another strength of our approach is that we make no
assumption about the contiguity of the morphemes. Re-
lated work on morphology generally makes this assump-
tion (e.g. (Goldsmith, 1995)), with the notable exception
of (Schone and Jurafsky, 2001). While in the current ex-
periments the non-contiguity possibility is not reflected
in the results, it can become important when applying the
algorithm to other languages such as German.
We begin by conjecturing that most morphemes will
not be longer than four characters, and learn only pat-
terns up to that length. Our algorithm starts by extract-
ing all patterns in the training data of up to four charac-
ters, however restricting its attention to the dimensions
in a15 a24 a13a17a16 a8 a13a20a0 . If a15 a24 a13a17a16 a8 a13a20a0 contains more than 4 dimen-
sions, the algorithm works only with those 4 dimensions
that had the greatest impact. All 1, 2, 3, and 4 character
combinations that occur in the training data are listed to-
gether with how often they occur. The reasoning behind
this is that those patterns that occur most frequently in
the training data are likely those ‘responsible’ for mark-
ing the given feature.
However, it is not straightforward to determine auto-
matically how long a morpheme is. For instance, consider
the English morpheme ‘-ing’ (the gerund morpheme).
The algorithm will extract the patterns ‘i ’, ‘ n ’, ‘ g’,
‘in ’, ‘i g’, ‘ ng’, and ‘ing’. If we based the morpheme
extraction merely on the frequency of the patterns, the
algorithm would surely extract one of the single letter
patterns, since they are guaranteed to occur at least as
many times as the longer patterns. More likely, they will
occur more frequently. In order to overcome this diffi-
culty, we apply a subsumption filter. If a shorter patterns
is subsumed by a longer one, we no longer consider it.
Subsumption is defined as follows: suppose pattern a0 a9
appears with frequency a16a2a1a4a3 , where as pattern a0 a35 appears
with frequency a16a2a1 a4 , and that a0 a9 is shorter than a0 a35 . Then
a0a38a9 is subsumed by a0
a35 if
a16 a1 a4
a16a2a1a5a3
a18
a40
a36 a56
The algorithm repeatedly checks for subsumption until
no more subsumptions are found, at which point the re-
maining patterns are sorted by frequency. It then outputs
the most frequent patterns. The cutoff value (i.e. how far
down the list to go) is a tunable parameter. In our exper-
iments, we set this parameter to 0.05 probability. Note
that we convert the frequencies to probabilities by divid-
ing the counts by the sum of all patterns’ frequencies.
The patterns are listed simply as arrays of 4-characters
(or fewer if a15 a24 a13a17a16a30a8 a13a17a0 contains fewer elements). It should
be noted that the characters are listed in the order of the
dimensions. This, however, does not mean that the pat-
terns have to be contiguous. For instance, if dimension 1
has a unifying effect, and so do dimensions 14, 15, and
16, the patterns are listed as 4-character combinations in
increasing order of the dimensions.
For illustration purposes, table 3 lists several patterns
that were extracted for past tense marking on verbs2. All
highly-ranked extracted patterns contained only letters in
the last two dimensions, so that only those two dimen-
sions are shown in the table.
Further investigation and the development of a more
sophisticated algorithm for extracting patterns should en-
able us to collapse some of the patterns into one. For
instance, the patterns ‘´ee’ and ‘´es’ should be considered
special cases of ‘´e ’. Note further that the algorithm ex-
tracted the pattern ‘ s’, which was caused by the fact that
many verbs were marked for plural in the pass´e compos´e
in French. In order to overcome this difficulty, a more
complex morphology algorithm should combine findings
from different pos/feature combinations. This has been
left for future investigation.
2Note that no morphemes for the imparfait were extracted.
This is an artifact of the training data which contains very few
instances of imparfait.
dimension a38 a11 a40 dimension a38
´e e
´e s
´e
´e
s
Table 3: Sample morpheme patterns extracted for past
tense markers on verbs. For this run, a38 a29 a40 a6 . Only the
last two dimensions are shown. No extracted pattern in-
volved any of the other dimensions.
7 Discussion and conclusion
We have presented an approach to tagging a monolingual
dictionary with linguistic features such as pos, number,
and tense. We use a bilingual corpus and the English pos
tags to extract information that can be used to infer the
feature values for the target language.
We have further argued that our approach can be used
to infer the morphemes that mark the linguistic features
in question and to assign the morphemes linguistic mean-
ing. While various frameworks for unsupervised mor-
pheme extraction have been proposed, many of them
more sophisticated than ours, the main advantage of this
approach is that the annotation of morphemes with their
meaning is immediate. We believe that this is an impor-
tant contribution, as role assignment becomes indispensi-
ble for tasks such as Machine Translation.
One area of future investigation is the improvement
of the classification algorithm. We have only presented
one approach to classification. In order to apply estab-
lished algorithms such as Support Vector Machines, we
will have to adopt our algorithm to extract a set of likely
positive examples as well as a set of likely negative ex-
amples. This will be the next step in our process, so that
we can determine the performance of our system when
using various well-studied classification methods.
This paper represents our first steps in bilingual feature
annotation. In the future, we will investigate tagging tar-
get language words with gender and case. This informa-
tion is not available in English, so it will be a more chal-
lenging problem. The extracted training data will have
to be fragmented based on what has already been learned
about other features.
We believe that our approach can be useful for any ap-
plication that can gain from linguistic information in the
form of feature tags. For instance, our system (Carbonell
et al., 2002) infers syntactic transfer rules, but it relies
heavily on the existence of a fully-inflected, tagged target
language dictionary. With the help of the work described
here we can obtain such a dictionary for any language
for which we have a bilingual, sentence-aligned corpus.
Other approaches to Machine Translation as well as ap-
plications like shallow parsing could also benefit from
this work.

References
Eric Brill. 1995. Transformations-based error-driven
learning and natural language processing: A case
study in part of speech tagging. Computational Lin-
guistics, 16(2):29-85.
Peter Brown, J. Cocke, V.D. Pietra, S.D. Pietra, J. Jelinek,
J. Lafferty, R. Mercer, and P. Roossina. 1990. A statis-
tical approach to Machine Translation. Computational
Linguistics, 16(2):79-85.
Peter Brown, S.D. Pietra, V.D. Pietra, and R. Mer-
cer. 1993. The mathematics of statistical Machine
Translation: Parameter estimation. Computational
Linguistics,19(2):263-311.
Ralf Brown. 1997. Automated Dictionary Extraction for
‘Knowledge-Free’ Example-Based Translation. Pro-
ceedings of the 7th International Conference on Theo-
retical and Methodological Issues in Machine Transla-
tion (TMI-97), pp. 111-118.
Jaime Carbonell, Katharina Probst, Erik Peterson, Chris-
tian Monson, Alon Lavie, Ralf Brown, and Lori Levin.
2002. Automatic Rule Learning for Resource-Limited
MT. Proceedings of the 5th Biennial Conference of the
Association for Machine Translation in the Americas
(AMTA-02).
John Goldsmith. 1995. Unsupervised Learning of the
Morphology of a Natural Language. Computational
Linguistics 27(2): 153-198.
Katharina Probst. Semi-Automatic Learning of Transfer
Rules for Machine Translation of Low-Density Lan-
guages. Proceedings of the Student Session at the 14th
European Summer School in Logic, Language and In-
formation (ESSLLI-02).
Patrick Schone and Daniel Jurafsky. 2001. Knowledge-
Free Induction of Inflectional Morphologies. Proceed-
ings of the Second Meeting of the North American
Chapter of the Association for Computational Linguis-
tics (NAACL-01).
Kenji Yamada and Kevin Knight. 2002. A Decoder for
Syntax-Based Statistical MT. Proceedings of the 40th
Anniversary Meeting of the Association for Computa-
tional Linguistics (ACL-02).
Kenji Yamada and Kevin Knight. 2001. A Syntax-Based
Statistical Translation Model. Proceedings of the 39th
Annual Meeting of the Association for Computational
Linguistics (ACL), 2001.
David Yarowsky and Grace Ngai. 2001. Inducing Mul-
tilingual POS Taggers and NP Bracketers via Robust
Projection Across Aligned Corpora. In Proceedings
of the Second Meeting of the North American Chap-
ter or the Association for Computational Linguistics
(NAACL-2001), pp. 200-207.
David Yarowsky, Grace Ngai, and Richard Wicentowski.
2001. Inducing Multilingual Text Analysis Tools via
Robust Projection across Aligned Corpora. Proceed-
ings of the First International Conference on Human
Language Technology Research (HLT-01).
