Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pages 20–27,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Using Information about Multi-word Expressions
for the Word-Alignment Task
Sriram VenkatapathyBD
Language Technologies Research Center,
Indian Institute of
Information Technology,
Hyderabad, India.
sriramv@linc.cis.upenn.edu
Aravind K. Joshi
Department of Computer and
Information Science and Institute for
Research in Cognitive Science,
University of Pennsylvania, PA, USA.
joshi@linc.cis.upenn.edu
Abstract
It is well known that multi-word expres-
sions are problematic in natural language
processing. In previous literature, it has
been suggested that information about
their degree of compositionality can be
helpful in various applications but it has
not been proven empirically. In this pa-
per, we propose a framework in which
information about the multi-word expres-
sions can be used in the word-alignment
task. We have shown that even simple
features like point-wise mutual informa-
tion are useful for word-alignment task in
English-Hindi parallel corpora. The align-
ment error rate which we achieve (AER =
0.5040) is significantly better (about 10%
decrease in AER) than the alignment error
rates of the state-of-art models (Och and
Ney, 2003) (Best AER = 0.5518) on the
English-Hindi dataset.
1 Introduction
In this paper, we show that measures representing
compositionality of multi-word expressions can
be useful for tasks such as Machine Translation,
word-alignment to be specific here. We use an on-
line learning framework called MIRA (McDon-
ald et al., 2005; Crammer and Singer, 2003) for
training a discriminative model for the word align-
ment task (Taskar et al., 2005; Moore, 2005). The
discriminative model makes use of features which
represent the compositionality of multi-word ex-
pressions.
1At present visiting Institute for Research in Cognitive
Science, University of Pennsylvania, PA, USA.
Multi-word expressions (MWEs) are those
whose structure and meaning cannot be derived
from their component words, as they occur inde-
pendently. Examples include conjunctions such
as ‘as well as’ (meaning ‘including’), idioms like
‘kick the bucket’ (meaning ‘die’) phrasal verbs
such as ‘find out’ (meaning ‘search’) and com-
pounds like ‘village community’. They can be de-
fined roughly as idiosyncratic interpretations that
cross word boundaries (Sag et al., 2002).
A large number of MWEs have standard
syntactic structure but are semantically non-
compositional. Here, we consider the class of verb
based expressions (verb is the head of the phrase),
which occur very frequently. This class of verb
based multi-word expressions include verbal id-
ioms, support-verb constructions, among others.
The example ‘take place’ is a MWE but ‘take a
gift’ is not.
In the past, various measures have been sug-
gested for measuring the compositionality of
multi-word expressions. Some of these are mu-
tual information (Church and Hanks, 1989), dis-
tributed frequency (Tapanainen et al., 1998) and
Latent Semantic Analysis (LSA) model (Baldwin
et al., 2003). Even though, these measures have
been shown to represent compositionality quite
well, compositionality itself has not been shown to
be useful in any application yet. In this paper, we
explore this possibility of using the information
about compositionality of MWEs (verb based) for
the word alignment task. In this preliminary work,
we use simple measures (such as point-wise mu-
tual information) to measure compositionality.
The paper is organized as follows. In section 2,
we discuss the word-alignment task with respect
to the class of multi-word expressions of interest
in this paper. In section 3, we show empirically,
20
the behavior of verb based expressions in a paral-
lel corpus (English-Hindi in our case). We then
discuss our alignment algorithm in section 4. In
section 5, we describe the features which we have
used in our training model. Section 6 discusses the
training algorithm and in section 7, the results of
our discriminative model for the word alignment
task. Related work and conclusion follow in sec-
tion 8 and 9 respectively.
2 Task: Word alignment of verbs and
their dependents
The task is to align the verbs and their dependents
(arguments and adjuncts) in the source language
sentence (English) with words in the target lan-
guage sentence (Hindi). The dependents of the
verbs in the source sentence are represented by
their head words. Figure 1. shows an example
of the type of multi-word expressions which we
consider for alignment.
subj obj  prep_in
event place
took
The cycling
Philadelphia
     (The cycling event took place in Philadelphia) 
Figure 1: Example of MWEs we consider
In the above example, the goal will the to align
the words ‘took’, ‘event’, ‘place’ and ‘Philadel-
phia’ with corresponding word(s) in the target lan-
guage sentence (which is not parsed) using a dis-
criminative approach. The advantage in using the
discriminative approach for alignment is that it lets
you use various compositionality based features
which are crucial towards aligning these expres-
sions. Figure 2. shows the appropriate alignment
of the expression in Figure 1. with the words in the
target language. The pair (take place), in English,
a verb and one of its dependents is aligned with a
single verbal unit in Hindi.
It is essential to obtain the syntactic roles for de-
pendents in the source language sentence as they
are required for computing the compositionality
value between the dependents and their verbs. The
Philadelphia   mein   saikling     kii    pratiyogitaa   hui
Philadelphia
The cycling
took
placeevent
 prep_in
objsubj
Figure 2: Alignment of Verb based expression
syntactic roles on the source side are obtained by
applying simple rules to the output of a depen-
dency parser. The dependency parser which we
used in our experiments is a stochastic TAG based
dependency parser (Shen, 2006). A sentence
could have one or more verbs. We would like
to align all the expressions represented by those
verbs with words in the target language.
3 Behavior of MWEs in parallel corpora
In this section, we will briefly discuss the com-
plexity of the alignment problem based on the
verb based MWE’s. From the word aligned sen-
tence pairs, we compute the fraction of times a
source sentence verb and its dependent are aligned
together with the same word in the target lan-
guage sentence. We count the number of times a
source sentence verb and its dependent are aligned
together with the same word in the target lan-
guage sentence, and divide it by the total num-
ber of dependents. The total size of our word
aligned corpus is 400 sentence pairs which in-
cludes both training and test sentences. The total
number of dependents present in these sentences
are 2209. Total number of verb dependent pairs
which aligned with same word in target language
are 193. Hence, the percentage of such occur-
rences is 9%, which is a significant number.
4 Alignment algorithm
In this section, we describe the algorithm for align-
ing verbs and their dependents in the source lan-
guage sentence with the words in the target lan-
guage. Let V be the number of verbs and A be the
number of dependents. Let the number of words in
21
the target language be N. If we explore all the ways
in which the CE B7 BT words in the source sentence
are aligned with words in the target language be-
fore choosing the best alignment, the total number
of possibilites are C6CEB7BT. This is computationally
very expensive. Hence, we use a Beam-search al-
gorithm to obtain the K-best alignments.
Our algorithm has three main steps.
1. Populate the Beam : Use the local features
(which largely capture the co-occurence in-
formation between the source word and the
target word) to determine the K-best align-
ments of verbs and their dependents with
words in the target language.
2. Re-order the Beam: Re-order the above
alignments using more complex features
(which include the global features and the
compositionality based feature(s)).
3. Post-processing : Extend the alignment(s) of
the verb(s) (on the source side) to include
words which can be part of the verbal unit
on the target side.
For a source sentence, let the verbs and depen-
dents be denoted by D7
CXCY
. Here CX is the index of
the verb (BD BOBP CX BOBP CE ). The variable CY is
the index of the dependents (BC BOBP CY BOBP BT)
except when CY BP BC which is used to represent
the verb itself. Let the source sentences be de-
noted as CB BP CUD7
CXCY
CV and the target sentences by
CC BP CUD8
D2
CV. The alignment from a source sen-
tence S to target sentence T is defined as the map-
ping AMCP BP CUCP
CXCYD2
CY CP
CXCYD2
AH B4D7
CXCY
AX D8
D2
B5BNBKCXBNCYCV. A
beam is used to store a set of K-best alignments
between a source sentence and the target sentence.
It is represented using the symbol BU where BU
CZ
(BC BOBP CZ BOBP C3) is used to refer to a particular
alignment configuration.
4.1 Populate the Beam
The task in this step is to obtain the K-best can-
didate alignments using local features. The local
features mainly contain the coccurence informa-
tion between a source and a target word and are in-
dependent of other alignment links or words in the
sentences. Let the local feature vector be denoted
as CU
C4
B4D7
CXCY
BND8
CZ
B5. The score of a particular alignment
link is computed by taking the dot product of the
weight vector CF with the local feature vector (of
words connected by the alignment link). Hence,
the local score will be
D7CRD3D6CT
C4
B4D7
CXCY
BND8
CZ
B5 BP CFBMCU
C4
B4D7
CXCY
BND8
CZ
B5
The total score of an alignment configuration is
computed by adding the scores of individual links
in the alignment configuration. Hence, the align-
ment score will be
D7CRD3D6CT
C4CP
B4AMCPBNCBBNCCB5 BP
CG
D7CRD3D6CT
C4
B4D7
CXCY
BND8
CZ
B5
BKD7
CXCY
BE CB B2 D7
CXCY
AX D8
CZ
BE AMCP
We propose an algorithm of order C7B4B4CE B7
BTB5C6D0D3CVB4C6B5 B7 C3B5 to compute the K-best align-
ment configurations. First, the local scores of each
verb and its dependents are computed for each
word in the target sentence and stored in a lo-
cal beam denoted by CQ
CXCY
. The local beams cor-
responding to all the verbs and dependents are
then sorted. This operation has the complexity
B4CE B7 BTB5 C6 D0D3CVB4C6B5.
The goal now is to pick the K-best configura-
tions of alignment links. A single slot in the local
beam corresponds to one alignment link. We de-
fine a boundary which partitions each local beam
into two sets of slots. The slots above the bound-
ary represent the slots which have been explored
by the algorithm while slots below the boundary
have still to be explored. The figure 3. shows the
boundary which cuts across the local beams.
Bb (i,j)
Beam
Alignment
Boundary
Local Beams
Figure 3: Boundary
We keep on modifying the boundary untill all
the K slots in the Alignment Beam are filled with
the K-best configurations. At the beginning of the
algorithm, the boundary is a straight line passing
through the top of all the local beams. The top slot
of the alignment beam at the beginning represents
22
the combination of alignment links with the best
local scores.
The next slot CQ
CXCY
CJD4CL (from the set of unexplored
slots) to be included in the boundary is the slot
which has the least difference in score from the
score of the slot at the top of its local beam. That
is, we pick the slot CQ
CXCY
CJD4CL such that D7CRD3D6CTB4CQ
CXCY
CJD4CLB5A0
D7CRD3D6CTB4CQ
CXCY
CJBDCLB5 is the least among all the unexplored
slots (or alignment links). Trivially, CQ
CXCY
CJD4A0BDCL was
already a part of the boundary.
When the slot CQ
CXCY
CJD4CL is included in the boundary,
various configurations, which now contain CQ
CXCY
CJD4CL,
are added to the alignment beam. The new con-
figurations are the same as the ones which previ-
ously contained CQ
CXCY
CJD4 A0 BDCL but with the replace-
ment of CQ
CXCY
CJD4A0 BDCL by CQ
CXCY
CJD4CL. The above procedure
ensures that the the alignment configurations are
K-best and are sorted according to the scores ob-
tained using local features.
4.2 Re-order the beam
We now use global features to re-order the beam.
The global features look at the properties of the en-
tire alignment configuration instead of alignment
links locally.
The global score is defined as the dot product of
the weight vector and the global feature vector.
D7CRD3D6CT
BZ
B4AMCPB5 BP CFBMCU
BZ
B4AMCPB5
The overall score is calculated by adding the local
score and the global score.
D7CRD3D6CTB4AMCPB5 BP D7CRD3D6CT
C4CP
B4AMCPB5 B7 D7CRD3D6CT
BZ
B4AMCPB5
The beam is now sorted based on the overall
scores of each alignment. The alignment config-
uration at the top of the beam is the best possible
alignment between source sentence and the target
sentence.
4.3 Post-processing
The first two steps in our alignment algorithm
compute alignments such that one verb or depen-
dent in the source language side is aligned with
only one word in the target side. But, in the case
of compound verbs in Hindi, the verb in English is
aligned to all the words which represent the com-
pound verb in Hindi. For example, in Figure 3, the
verb “lost” is aligned to both ’khoo’ and ’dii’.
Our alignment algorithm would have aligned
“lost” only to ’khoo’. Hence, we look at the win-
dow of words after the word which is aligned to
mainee    Shyam   ki    kitaaba  khoo   dii
Shyam’s
bookI
lost
Figure 4: Case of compound verb in Hindi
the source verb and check if any of them is a verb
which has not been aligned with any word in the
source sentence. If this condition is satisfied, we
align the source verb to these words too.
5 Parameters
As the number of training examples (294 sen-
tences) is small, we choose to use very representa-
tive features. Some of the features which we used
in this experiment are as follows,
5.1 Local features (BY
C4
)
The local features which we consider are mainly
co-occurence features. These features estimate the
likelihood of a source word aligning to a target
word based on the co-occurence information ob-
tained from a large sentence aligned corpora1.
1. DiceWords: Dice Coefficient of the source
word and the target word
BWBVD3CTABB4D7
CXCY
BND8
CZ
B5 BP
BE A3BVD3D9D2D8B4D7
CXCY
BND8
CZ
B5
BVD3D9D2D8B4D7
CXCY
B5 B7 BVD3D9D2D8B4D8
CZ
B5
where BVD3D9D2D8B4D7
CXCY
BND8
CZ
B5 is the number of times
the word D8
CZ
was present in the translation of
sentences containing the word D7
CXCY
in the par-
allel corpus.
2. DiceRoots: Dice Coefficient of the lemma-
tized forms of the source and target words.
It is important to consider this feature be-
cause the English-Hindi parallel corpus is not
large and co-occurence information can be
learnt effectively only after we lemmatize the
words.
3. Dict: Whether there exists a dictionary entry
from the source word D7
CXCY
to the target word
150K sentence pairs originally collected as part of TIDES
MT project and later refined at IIIT-Hyderabad, India.
23
D8
CZ
. For English-Hindi, we used a dictionary
available at IIIT - Hyderabad, India.
4. Null: Whether the source word D7
CXCY
is aligned
to nothing in the target language.
5.2 Global features
The following are the four global features which
we have considered,
AF AvgDist: The average distance between the
words in the target language sentence which
are aligned to the verbs in the source lan-
guage sentence . AvgDist is then normalized
by dividing itself by the number of words in
the target language sentence. If the average
distance is small, it means that the verbs in
the source language sentence are aligned with
words in the target language sentence which
are located at relatively close distances, rela-
tive to the length of the target language sen-
tence.
This feature expresses the distribution of
predicates in the target language.
AF Overlap: This feature stores the count of
pairs of verbs in the source language sentence
which align with the same word in the target
language sentence. Overlap is normalized by
dividing itself by the total pairs of verbs.
This feature is used to discourage overlaps
among the words which are alignments of
verbs in the source language sentence.
AF MergePos: This feature can be considered as
a compositionality based feature. The part
of speech tag of a dependent is essential to
determine the likelihood of the dependent to
align with the same word in the target lan-
guage sentence as the word to which its verb
is aligned.
This binary feature is active when the align-
ment links of a dependent and its verb
merge. For example, in Figure 5., the feature
‘merge RP’ will be active (that is, merge RP
= 1).
AF MergeMI: This is a compositionality based
feature which associates point-wise mutual
information (apart from the POS informa-
tion) with the cases where the dependents
which have the same alignment in the target
He/N away/RP
ran/V
    vaha     bhaaga     gayaa
Figure 5: Example of MergePos feature
language as their verbs. This features which
notes the the compositionality value (repre-
sented by point-wise mutual information in
our experiments) is active if the alignment
links of dependent and its verb merge.
The mutual information (MI) is classified
into three groups depending on its absolute
value. If the absolute value of mutual infor-
mation rounded to nearest integer is in the
range 0-2, it is considered LOW. If the value
is in the range 3-5, it is considered MEDIUM
and if it is above 5, it is considered HIGH.
The feature “merge RP HIGH” is active in
the example shown in figure 6.
He/N away/RP
ran/V
    vaha     bhaaga     gayaa
MI = HIGH
Figure 6: Example of MergeMI feature
6 Online large margin training
For parameter optimization, we have used an on-
line large margin algorithm called MIRA (Mc-
Donald et al., 2005) (Crammer and Singer, 2003).
We describe the training algorithm that we used
very briefly. Our training set is a set of English-
Hindi word aligned parallel corpus. We get the
verb based expressions in English by running a de-
pendency parser (Shen, 2006). Let the number of
sentence pairs in the training data be D1. We have
24
CUCB
D5
BNCC
D5
BNCMCP
D5
CV for training where D5 BOBP D1 is the in-
dex number of the sentence pair CUCB
D5
BNCC
D5
CV in the
training set and CMCP
D5
is the gold alignment for the
pair CUCB
D5
BNCC
D5
CV. Let W be the weight vector which
has to be learnt, CF
CX
be the weight vector after the
end of CXD8CW update. To avoid over-fitting, CF is ob-
tained by averaging over all the weight vectors CF
CX
.
A generic large margin algorithm is defined fol-
lows for the training instances CUCB
D5
BNCC
D5
BNCMCP
D5
CV,
1. Initialize CF
BC
, CF, CX
2. for p:1 to NIterations
3. for q:1 to m
4. Get K-Best predictions AB
D5
BP CUCP
BD
BNCP
BE
BMBMBMCP
CZ
CV
for the training example B4CB
D5
BNCC
D5
BN CMCP
D5
B5 using
the current model CFCX and applying step
1 and 2 of section 4. Compute CFCXB7BD by
updating CFCX based on B4CB
D5
BNCC
D5
BN CMCP
D5
BNAB
D5
B5.
5. i = i + 1
6. CF BP CF B7 CFCXB7BD
7. CF BP
CF
C6C1D8CTD6CPD8CXD3D2D7A3D1
The goal of MIRA is to minimize the change in
CF
CX such that the score of the gold alignment
CMCP ex-
ceeds the score of each of the predictions in AB by a
margin which is equal to the number of mistakes in
the predictions when compared to gold alignment.
While computing the number of mistakes, the mis-
takes due to the mis-alignment of head verb could
be given greater weight, thus prompting the opti-
mization algorithm to give greater importance to
verb related mistakes and thereby improving over-
all performance.
Step 4 in the algorithm mentioned above can
be substituted by the following optimization
problem,
minimize CZB4CFCXB7BD A0CFCXB5CZ
s.t. BKCZ, D7CRD3D6CTB4CMCP
D5
BNCB
D5
BNCC
D5
B5 A0D7CRD3D6CTB4CP
D5BNCZ
BNCB
D5
BNCC
D5
B5
BQBP C5CXD7D8CPCZCTD7B4CP
CZ
BNCMCP
D5
BNCB
D5
BNCC
D5
B5
The above optimization problem is converted to
the Dual form using one Lagrangian multiplier for
each constraint. In the Dual form, the Lagrangian
multipliers are solved using Hildreth’s algorithm.
Here, prediction of AB is similar to the prediction
of C3 A0CQCTD7D8 classes in a multi-class classification
problem. Ideally, we need to consider all the possi-
ble classes and assign margin constraints based on
every class. But, here the number of such classes
is exponential and thus we restrict ourselves to the
C3 A0CQCTD7D8 classes.
7 Results on word-alignment task
7.1 Dataset
We have divided the 400 word aligned sentence
pairs into a training set consisting of 294 sen-
tence pairs and a test set consisting of 106 sentence
pairs. The source sentences are all dependency
parsed (Shen, 2006) and only the verb and its de-
pendents are considered for both training and test-
ing our algorithm. Our training algorithm requires
that the each of the source words is aligned to only
one or zero target words. For this, we use simple
heuristics to convert the training data to the appro-
priate format. For the words aligned to a source
verb, the first verb is chosen as the gold alignment.
For the words aligned to any dependent which is
not a verb, the last content word is chosen as the
alignment link. For test data, we do not make any
modifications and the final output from our align-
ment algorithm is compared with the original test
data.
7.2 Experiments with Giza
We evaluated our discriminative approach by com-
paring it with the state-of-art Giza++ alignments
(Och and Ney, 2003). The metric that we have
used to do the comparison is the Alignment Error
Rate (AER). The results shown below also contain
Precision, Recall and F-measure.
Giza was trained using an English-Hindi
aligned corpus of 50000 sentence pairs. In Table
1., we report the results of the GIZA++ alignments
run from both the directions (English to Hindi and
Hindi to English). We also show the results of the
intersected model. See Table 1. for the results of
the GIZA++ alignments.
Prec. Recall F-meas. AER
EngAXHin 0.45 0.38 0.41 0.5874
HinAXEng 0.46 0.27 0.34 0.6584
Intersected 0.82 0.19 0.31 0.6892
Table 1: Results of GIZA++ - Original dataset
We then lemmatize the words in both the source
and target sides of the parallel corpora and then
run Giza++ again. As the English-Hindi dataset
25
of 50000 sentence pairs is relatively small, we ex-
pect lemmatizing to improve the results. Table 2.
shows the results. As we hoped, the results after
lemmatizing the word forms are better than those
without.
Prec. Recall F-meas. AER
EngAXHin 0.52 0.40 0.45 0.5518
HinAXEng 0.53 0.30 0.38 0.6185
Intersected 0.82 0.23 0.36 0.6446
Table 2: Results of GIZA++ - lemmatized set
7.3 Experiments with our model
We trained our model using the training set of 294
word aligned sentence pairs. For training the pa-
rameters, we used a beam size of 3 and number of
iterations equal to 3. Table 3. shows the results
when we used only the basic local features (Dice-
Words, DiceRoots, Dict and Null) to train and test
our model.
Prec. Recall F-meas. AER
Local Feats. 0.47 0.38 0.42 0.5798
Table 3: Results using the basic features
When we add the the global features (AvgDist,
Overlap), we obtain the AER shown in Table 4.
Prec. Recall F-meas. AER
+ AvgD., Ove. 0.49 0.39 0.43 0.5689
Table 4: Results using the features - AvgDist,
Overlap
Now, we add the transition probabilities ob-
tained from the experiments with Giza++ as fea-
tures in our model. Table 5. contains the results.
The compositionality related features are now
added to our discriminative model to see if there is
any improvement in performance. Table 6. shows
the results by adding one feature at a time.
We observe that there is an improvement in the
AER by using the compositionality based features,
thus showing that compositionality based features
aid in the word-alignment task in a significant way
(AER = 0.5045).
8 Related work
Various measures have been proposed in the past
to measure the compositionality of multi-word ex-
Prec. Recall F-meas. AER
+ Giza++ prob. 0.54 0.44 0.49 0.5155
Table 5: Results using the Giza++ probabilities
Prec. Recall F-meas. AER
+ MergePos 0.54 0.45 0.49 0.5101
+ MergeMI 0.55 0.45 0.50 0.5045
Table 6: Results using the compositionality based
features
pressions of various types. Some of them are Fre-
quency, Point-wise mutual information (Church
and Hanks, 1989), Distributed frequency of object
(Tapanainen et al., 1998), Distributed frequency
of object using verb information (Venkatapathy
and Joshi, 2005), Similarity of object in verb-
object pair using the LSA model (Baldwin et al.,
2003), (Venkatapathy and Joshi, 2005) and Lex-
ical and Syntactic fixedness (Fazly and Steven-
son, 2006). These features have largely been eval-
uated by the correlation of the compositionality
value predicted by these measures with the gold
standard value suggested by human judges. It has
been shown that the correlation of these measures
is higher than simple baseline measures suggest-
ing that these measures represent compositionality
quite well. But, the compositionality as such has
not been used in any specific application yet.
In this paper, we have suggested a framework
for using the compositionality of multi-word ex-
pressions for the word alignment task. State-of-art
systems for doing word alignment use generative
models like GIZA++ (Och and Ney, 2003; Brown
et al., 1993). Discriminative models have been
tried recently for word-alignment (Taskar et al.,
2005; Moore, 2005) as these models give the abil-
ity to harness variety of complex features which
cannot be provided in the generative models. In
our work, we have used the compositionality of
multi-word expressions to predict how they align
with the words in the target language sentence.
For parameter optimization for the word-
alignment task, Taskar, Simon and Klein (Taskar
et al., 2005) used a large margin approach by fac-
toring the structure level constraints to constraints
at the level of an alignment link. We cannot do
such a factorization because the scores of align-
ment links in our case are not computed in a com-
pletely isolated manner. We use an online large
margin approach called MIRA (McDonald et al.,
26
2005; Crammer and Singer, 2003) which fits well
with our framework. MIRA has previously been
used by McDonald, Pereira, Ribarov and Hajic
(McDonald et al., 2005) for learning the param-
eter values in the task of dependency parsing.
It should be noted that previous word-alignment
experiments such as Taskar, Simon and Klein
(Taskar et al., 2005) have been done with very
large datasets and there is little word-order vari-
ation in the languages involved. Our dataset is
small at present and there is substantial word order
variation between the source and target languages.
9 Conclusion and future work
In this paper, we have proposed a discriminative
approach for using the compositionality informa-
tion about verb-based multi-word expressions for
the word-alignment task. For training our model,
use used an online large margin algorithm (Mc-
Donald et al., 2005). For predicting the alignment
given a model, we proposed a K-Best beam search
algorithm to make our prediction algorithm com-
putationally feasible.
We have investigated the usefulness of simple
features such as point-wise mutual information for
the word-alignment task in English-Hindi bilin-
gual corpus. We have show that by adding the
compositionality based features to our model, we
obtain an decrease in AER from 0.5155 to 0.5045.
Our overall results are better than those obtained
using the GIZA++ models (Och and Ney, 2003).
In future, we will experiment with more ad-
vanced compositionality based features. But, this
would require a larger dataset for training and we
are working towards buidling such a large dataset.
Also, we would like to conduct similar exper-
iments on other language pairs (e.g. English-
French) and compare the results with the state-of-
art results reported for those languages.

References
Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and
Dominic Widdows. 2003. An empirical model
of multiword expression decomposability. In Di-
ana McCarthy Francis Bond, Anna Korhonen and
Aline Villavicencio, editors, Proceedings of the ACL
2003 Workshop on Multiword Expressions: Analy-
sis, Acquisition and Treatment, pages 89–96.
P. Brown, S. A. Pietra, V. J. Della, Pietra, and R. L.
Mercer. 1993. The mathmatics of stastistical ma-
chine translation. In Computational Linguistics.
Kenneth Church and Patrick Hanks. 1989. Word as-
sociation norms, mutual information, and lexicog-
raphy. In Proceedings of the 27th. Annual Meet-
ing of the Association for Computational Linguis-
tics, 1990.
Koby Crammer and Yoram Singer. 2003. Ultraconser-
vative online algorithms for multiclass problems. In
Journal of Machine Learning Research.
Afsaneh Fazly and Suzanne Stevenson. 2006. Auto-
matically constructing a lexicon of verb phrase id-
iomatic combinations. In Proceedings of European
Chapter of Association of Computational Linguis-
tics. Trento, Italy, April.
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Jan Hajic. 2005. Non-projective dependency pars-
ing using spanning tree algorithms. In Proceed-
ings of Human Language Technology Conference
and Conference on Empirical Methods in Natural
Language Processing, pages 523–530, Vancouver,
British Columbia, Canada, October. Association of
Computational Linguistics.
Robert C. Moore. 2005. A discriminative frame-
work for bilingual word alignment. In Proceedings
of Human Language Technology Conference and
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 81–88, Vancouver, British
Columbia, Canada, October. Association of Compu-
tational Linguistics.
F. Och and H. Ney. 2003. A systematic comparisoin
of various statistical alignment models. In Compu-
tational Linguistics.
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann
Copestake, and Dan Flickinger. 2002. Multi-word
expressions: a pain in the neck for nlp. In Proceed-
ings of CICLing , 2002.
Libin Shen. 2006. Statistical LTAG Parsing. Ph.D.
thesis.
Pasi Tapanainen, Jussi Piitulaine, and Timo Jarvinen.
1998. Idiomatic object usage and support verbs. In
36th Annual Meeting of the Association for Compu-
tational Linguistics.
Ben Taskar, Locoste-Julien Simon, and Klein Dan.
2005. A discriminative machine learning approach
to word alignment. In Proceedings of Human Lan-
guage Technology Conference and Conference on
Empirical Methods in Natural Language Process-
ing, pages 73–80, Vancouver, British Columbia,
Canada, October. Association of Computational
Linguistics.
Sriram Venkatapathy and Aravind Joshi. 2005. Mea-
suring the relative compositionality of verb-noun (v-
n) collocations by integrating features. In Proceed-
ings of Human Language Technology Conference
and Conference on Empirical Methods in Natural
Language Processing, pages 899–906. Association
of Computational Linguistics, Vancouver, British
Columbia, Canada, October.
