Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),
pages 45–52, New York City, June 2006. c©2006 Association for Computational Linguistics
Investigating Lexical Substitution Scoring for Subtitle Generation
Oren Glickman and Ido Dagan
Computer Science Department
Bar Ilan University
Ramat Gan, Israel
{glikmao,dagan}@cs.biu.ac.il
Mikaela Keller and Samy Bengio
IDIAP Research Institute
Martigny,
Switzerland
{mkeller,bengio}@idiap.ch
Walter Daelemans
CNTS
Antwerp, Belgium
walter.daelemans@ua.ac.be
Abstract
This paper investigates an isolated setting
of the lexical substitution task of replac-
ing words with their synonyms. In par-
ticular, we examine this problem in the
setting of subtitle generation and evaluate
state of the art scoring methods that pre-
dict the validity of a given substitution.
The paper evaluates two context indepen-
dent models and two contextual models.
The major findings suggest that distribu-
tional similarity provides a useful comple-
mentary estimate for the likelihood that
two Wordnet synonyms are indeed substi-
tutable, while proper modeling of contex-
tual constraints is still a challenging task
for future research.
1 Introduction
Lexical substitution - the task of replacing a word
with another one that conveys the same meaning -
is a prominent task in many Natural Language Pro-
cessing (NLP) applications. For example, in query
expansion for information retrieval a query is aug-
mented with synonyms of the original query words,
aiming to retrieve documents that contain these syn-
onyms (Voorhees, 1994). Similarly, lexical substi-
tutions are applied in question answering to identify
answer passages that express the sought answer in
different terms than the original question. In natu-
ral language generation it is common to seek lex-
ical alternatives for the same meaning in order to
reduce lexical repetitions. In general, lexical sub-
stitution aims to preserve a desired meaning while
coping with the lexical variability of expressing that
meaning. Lexical substitution can thus be viewed
within the general framework of recognizing entail-
ment between text segments (Dagan et al., 2005), as
modeling entailment relations at the lexical level.
In this paper we examine the lexical substitu-
tion problem within a specific setting of text com-
pression for subtitle generation (Daelemans et al.,
2004). Subtitle generation is the task of generat-
ing target language TV subtitles for video recordings
of a source language speech. The subtitles should
be of restricted length, which is often shorter than
the full translation of the original speech, yet they
should maintain as much as possible the meaning
of the original content. In a typical (automated)
subtitling process the original speech is first trans-
lated fully into the target language and then the tar-
get translation is compressed to optimize the length
requirements. One of the techniques employed in
the text compression phase is to replace a target lan-
guage word in the original translation with a shorter
synonym of it, thus reducing the character length of
the subtitle. This is a typical lexical substitution
task, which resembles similar operations in other
text compression and generation tasks (e.g. (Knight
and Marcu, 2002)).
This paper investigates the task of assigning like-
lihood scores for the correctness of such lexical sub-
stitutions, in which words in the original translation
are replaced with shorter synonyms. In our experi-
ments we use WordNet as a source of candidate syn-
onyms for substitution. The goal is to score the like-
lihood that the substitution is admissible, i.e. yield-
ingavalidsentencethatpreservestheoriginalmean-
ing. The focus of this paper is thus to utilize the
subtitling setting in order to investigate lexical sub-
45
stitution models in isolation, unlike most previous
literature in which this sub-task has been embedded
in larger systems and was not evaluated directly.
We examine four statistical scoring models, of
two types. Context independent models score the
general likelihood that the original word is “replace-
able” with the candidate synonym, in an arbitrary
context. That is, trying to filter relatively bizarre
synonyms, often of rare senses, which are abundant
in WordNet but are unlikely to yield valid substitu-
tions. Contextual models score the “fitness” of the
replacing word within the context of the sentence, in
order to filter out synonyms of senses of the original
word that are not the right sense in the given context.
We set up an experiment using actual subti-
tling data and human judgements and evaluate the
different scoring methods. Our findings suggest
the dominance, in this setting, of generic context-
independent scoring. In particular, considering dis-
tributional similarity amongst WordNet synonyms
seems effective for identifying candidate substitu-
tions that are indeed likely to be applicable in actual
texts. Thus, while distributional similarity alone is
known to be too noisy as a sole basis for meaning-
preservingsubstitutions, itscombinationwithWord-
Net allows reducing the noise caused by the many
WordNet synonyms that are unlikely to correspond
to valid substitutions.
2 Background and Setting
2.1 Subtitling
Automatic generation of subtitles is a summariza-
tion task at the level of individual sentences or occa-
sionally of a few contiguous sentences. Limitations
on reading speed of viewers and on the size of the
screen that can be filled with text without the image
becoming too cluttered, are the constraints that dy-
namically determine the amount of compression in
characters that should be achieved in transforming
the transcript into subtitles. Subtitling is not a trivial
task, and is expensive and time-consuming when ex-
perts have to carry it out manually. As for other NLP
tasks,bothstatistical(machinelearning)andlinguis-
tic knowledge-based techniques have been consid-
ered for this problem. Examples of the former are
(Knight and Marcu, 2002; Hori et al., 2002), and of
the latter are (Grefenstette, 1998; Jing and McKe-
own, 1999). A comparison of both approaches in
the context of a Dutch subtitling system is provided
in (Daelemans et al., 2004). The required sentence
simplification is achieved either by deleting mate-
rial, or by paraphrasing parts of the sentence into
shorter expressions with the same meaning. As a
special case of the latter, lexical substitution is often
used to achieve a compression target by substituting
a word by a shorter synonym. It is on this subtask
that we focus in this paper. Table 1 provides a few
examples. E.g. by substituting “happen” by “occur”
(example3), onecharacterissavedwithoutaffecting
the sentence meaning .
2.2 Experimental Setting
The data used in our experiments was collected in
the context of the MUSA (Multilingual Subtitling of
Multimedia Content) project (Piperidis et al., 2004)1
and was kindly provided for the current study. The
data was provided by the BBC in the form of Hori-
zon documentary transcripts with the corresponding
audio and video. The data for two documentaries
was used to create a dataset consisting of sentences
from the transcripts and the corresponding substitu-
tion examples in which selected words are substi-
tuted by a shorter Wordnet synonym. More con-
cretely, a substitution example thus consists of an
original sentence s = w1 ...wi ...wn, a specific
source word wi in the sentence and a target (shorter)
WordNet synonym wprime to substitute the source. See
Table 1 for examples. The dataset consists of 918
substitution examples originating from 231 different
sentences.
An annotation environment was developed to al-
low efficient annotation of the substitution examples
with the classes true (admissible substitution, in the
given context) or false (inadmissible substitution).
About 40% of the examples were judged as true.
Part of the data was annotated by an additional an-
notator to compute annotator agreement. The Kappa
score turned out to be 0.65, corresponding to ”Sub-
stantial Agreement” (Landis and Koch, 1997). Since
some of the methods we are comparing need tuning
we held out a random subset of 31 original sentences
(with 121 corresponding examples) for development
and kept for testing the resulting 797 substitution ex-
1http://sinfos.ilsp.gr/musa/
46
id sentence source target judgment
1 The answer may be found in the behaviour of animals. answer reply false
2 ...and the answer to that was - Yes answer reply true
3 We then wanted to know what would happen if
we delay the movement of the subject’s left hand
happen occur true
4 subject topic false
5 subject theme false
6 people weren’t laughing they were going stone sober. stone rock false
7 if we can identify a place where the seizures are coming from then we can go in
and remove just that small area.
identify place false
8 my approach has been the first to look at the actual structure of the laugh sound. approach attack false
9 He quickly ran into an unexpected problem. problem job false
10 today American children consume 5 times more Ritalin than the rest of the world
combined
consume devour false
Table 1: Substitution examples from the dataset along with their annotations
amples from the remaining 200 sentences.
3 Compared Scoring Models
We compare methods for scoring lexical substitu-
tions. These methods assign a score which is ex-
pected to correspond to the likelihood that the syn-
onym substitution results in a valid subtitle which
preserves the main meaning of the original sentence.
We examine four statistical scoring models, of
two types. The context independent models score
the general likelihood that the source word can be
replaced with the target synonym regardless of the
context in which the word appears. Contextual mod-
els, on the other hand, score the fitness of the target
word within the given context.
3.1 Context Independent Models
Even though synonyms are substitutable in theory,
in practice there are many rare synonyms for which
the likelihood of substitution is very low and will be
substitutable only in obscure contexts. For exam-
ple, although there are contexts in which the word
job is a synonym of the word problem2, this is not
typically the case and overall job is not a good tar-
get substitution for the source problem (see example
9 in Table 1). For this reason synonym thesauruses
such as WordNet tend to be rather noisy for practi-
calpurposes, raisingtheneedtoscoresuchsynonym
substitutions and accordingly prioritize substitutions
that are more likely to be valid in an arbitrary con-
text.
2WordNet lists job as a possible member of the synset for a
state of difficulty that needs to be resolved, as might be used in
sentences like “it is always a job to contact him”
As representative approaches for addressing this
problem, we chose two methods that rely on statisti-
cal information of two types: supervised sense dis-
tributions from SemCor and unsupervised distribu-
tional similarity.
3.1.1 WordNet based Sense Frequencies
(semcor)
The obvious reason that a target synonym cannot
substitute a source in some context is if the source
appears in a different sense than the one in which
it is synonymous with the target. This means that a
priori, synonymsoffrequentsensesofasourceword
are more likely to provide correct substitutions than
synonyms of the word’s infrequent senses.
To estimate such likelihood, our first measure is
based on sense frequencies from SemCor (Miller et
al., 1993), a corpus annotated with Wordnet senses.
For a given source word u and target synonym v the
score is calculated as the percentage of occurrences
of u in SemCor for which the annotated synset con-
tains v (i.e. u’s occurrences in which its sense is
synonymous with v). This corresponds to the prior
probability estimate that an occurrence of u (in an
arbitrary context) is actually a synonym of v. There-
fore it is suitable as a prior score for lexical substi-
tution.3
3.1.2 Distributional Similarity (sim)
The SemCor based method relies on a supervised
approachandrequiresasenseannotatedcorpus. Our
3Note that WordNet semantic distance measures such as
those compared in (Budanitsky and Hirst, 2001) are not appli-
cable here since they measure similarity between synsets rather
than between synonymous words within a single synset.
47
second method uses an unsupervised distributional
similarity measure to score synonym substitutions.
Such measures are based on the general idea of
Harris’ Distributional Hypothesis, suggesting that
words that occur within similar contexts are seman-
tically similar (Harris, 1968).
As a representative of this approach we use Lin’s
dependency-baseddistributionalsimilaritydatabase.
Lin’s database was created using the particular dis-
tributionalsimilaritymeasurein(Lin, 1998), applied
to a large corpus of news data (64 million words) 4.
Two words obtain a high similarity score if they oc-
cur often in the same contexts, as captured by syn-
tactic dependency relations. For example, two verbs
willbeconsideredsimilariftheyhavelargecommon
sets of modifying subjects, objects, adverbs etc.
Distributional similarity does not capture directly
meaning equivalence and entailment but rather a
looser notion of meaning similarity (Geffet and Da-
gan, 2005). It is typical that non substitutable words
such as antonyms or co-hyponyms obtain high sim-
ilarity scores. However, in our setting we apply
the similarity score only for WordNet synonyms in
which it is known a priori that they are substitutable
is some contexts. Distributional similarity may thus
capture the statistical degree to which the two words
are substitutable in practice. In fact, it has been
shown that prominence in similarity score corre-
sponds to sense frequency, which was suggested as
the basis for an unsupervised method for identifying
the most frequent sense of a word (McCarthy et al.,
2004).
3.2 Contextual Models
Contextual models score lexical substitutions based
on the context of the sentence. Such models
try to estimate the likelihood that the target word
could potentially occur in the given context of the
source word and thus may replace it. More con-
cretely, for a given substitution example consist-
ing of an original sentence s = w1 ...wi ...wn,
and a designated source word wi, the contextual
models we consider assign a score to the substi-
tution based solely on the target synonym v and
the context of the source word in the original sen-
4available at http://www.cs.ualberta.ca/
˜lindek/downloads.htm
tence, {w1,...,wi−1,wi+1,...,wn}, which is rep-
resented in a bag-of-words format.
Apparently, thissettingwasnotinvestigatedmuch
in the context of lexical substitution in the NLP lit-
erature. We chose to evaluate two recently proposed
models that address exactly the task at hand: the first
model was proposed in the context of lexical model-
ing of textual entailment, using a generative Na¨ıve
Bayes approach; the second model was proposed
in the context of machine learning for information
retrieval, using a discriminative neural network ap-
proach. The two models were trained on the (un-
annotated) sentences of the BNC 100 million word
corpus(Burnard, 1995)inbag-of-wordsformat. The
corpus was broken into sentences, tokenized, lem-
matized and stop words and tokens appearing only
once were removed. While training of these models
is done in an unsupervised manner, using unlabeled
data, some parameter tuning was performed using
the small development set described in Section 2.
3.2.1 Bayesian Model (bayes)
The first contextual model we examine is the one
proposed in (Glickman et al., 2005) to model tex-
tual entailment at the lexical level. For a given tar-
get word this unsupervised model takes a binary text
categorization approach. Each vocabulary word is
considered a class, and contexts are classified as to
whether the given target word is likely to occur in
them. Taking a probabilistic Na¨ıve-Bayes approach
the model estimates the conditional probability of
thetargetwordgiventhecontextbasedoncorpusco-
occurrence statistics. We adapted and implemented
this algorithm and trained the model on the sen-
tences of the BNC corpus.
For a bag-of-words context C =
{w1,...,wi−1,wi+1,...,wn} and target word
v the Na¨ıve Bayes probability estimation for the
conditional probability of a word v may occur in a
given a context C is as follows:
P(v|C) =
P(C|v)P(v)
P(C|v)P(v)+P(C|¬v)P(¬v) ≈
P(v)
producttext
w∈C P(w|v)
P(v)
producttext
w∈C P(w|v)+P(¬v)
producttext
w∈C P(w|¬v)
(1)
where P(w|v) is the probability that a word w ap-
pears in the context of a sentence containing v and
correspondingly P(w|¬v) is the probability that w
48
appears in a sentence not containing v. The prob-
ability estimates were obtained from the processed
BNC corpus as follows:
P(w|v) = |w appears in sentences containing v||words in sentences containing v|
P(w|¬v) = |w occurs in sentences not containing v||words in sentences not containing v|
To avoid 0 probabilities these estimates were
smoothed by adding a small constant to all counts
and normalizing accordingly. The constant value
was tuned using the development set to maximize
average precision (see Section 4.1). The estimated
probability, P(v|C), was used as the confidence
score for each substitution example.
3.2.2 Neural Network Model (nntr)
As a second contextual model we evaluated the
Neural Network for Text Representation (NNTR)
proposed in (Keller and Bengio, 2005). NNTR is
a discriminative approach which aims at modeling
how likely a given word v is in the context of a piece
of text C, while learning a more compact represen-
tation of reduced dimensionality for both v and C.
NNTR is composed of 3 Multilayer Perceptrons,
noted mlpA(), mlpB() and mlpC(), connected as
follow:
NNTR(v,C) = mlpC[mlpA(v),mlpB(C)].
mlpA(v) and mlpB(C) project respectively the
vector space representation of the word and text
into a more compact space of lower dimensionality.
mlpC() takes as input the new representations of v
and C and outputs a score for the contextual rele-
vance of v to C.
As training data, couples (v,C) from the BNC cor-
pus are provided to the learning scheme. The target
training value for the output of the system is 1 if v is
indeed in C and -1 otherwise. The hope is that the
neural network will be able to generalize to words
which are not in the piece of text but are likely to be
related to it.
In essence, this model is trained by minimizing
the weighted sum of the hinge loss function over
negative and positive couples, using stochastic Gra-
dientDescent(see(KellerandBengio, 2005)forfur-
ther details). The small held out development set of
the substitution dataset was used to tune the hyper-
parameters of the model, maximizing average preci-
sion (see Section 4.1). For simplicity mlpA() and
mlpB() were reduced to Perceptrons. The output
size of mlpA() was set to 20, mlpB() to 100 and the
number of hidden units of mlpC() was set to 500.
There are a couple of important conceptual differ-
ences of the discriminative NNTR model compared
to the generative Bayesian model described above.
First, the relevancy of v to C in NNTR is inferred
in a more compact representation space of reduced
dimensionality, which may enable a higher degree
of generalization. Second, in NNTR we are able to
control the capacity of the model in terms of num-
ber of parameters, enabling better control to achieve
an optimal generalization level with respect to the
training data (avoiding over or under fitting).
4 Empirical Results
4.1 Evaluation Measures
Wecomparethelexicalsubstitutionscoringmethods
using two evaluation measures, offering two differ-
ent perspectives of evaluation.
4.1.1 Accuracy
Thefirstevaluationmeasureismotivatedbysimu-
latingadecisionstepofasubtitlingsystem, inwhich
the best scoring lexical substitution is selected for
each given sentence. Such decision may correspond
to a situation in which each single substitution may
suffice to obtain the desired compression rate, or
might be part of a more complex decision mecha-
nism of the complete subtitling system. We thus
measure the resulting accuracy of subtitles created
by applying the best scoring substitution example
for every original sentence. This provides a macro
evaluation style since we obtain a single judgment
for each group of substitution examples that corre-
spond to one original sentence.
In our dataset 25.5% of the original sentences
have no correct substitution examples and for 15.5%
of the sentences all substitution examples were an-
notated as correct. Accordingly, the (macro aver-
aged) accuracy has a lower bound of 0.155 and up-
per bound of 0.745.
49
4.1.2 Average Precision
As a second evaluation measure we compare the
average precision of each method over all the exam-
ples from all original sentences pooled together (a
micro averaging approach). This measures the po-
tential of a scoring method to ensure high precision
for the high scoring examples and to filter out low-
scoring incorrect substitutions.
Average precision is a single figure measure com-
monly used to evaluate a system’s ranking ability
(Voorhees and Harman, 1999). It is equivalent to the
area under the uninterpolated recall-precision curve,
defined as follows:
average precision =
summationtextN
i=1 P(i)T(i)summationtextN
i=1 T(i)
P(i) =
summationtexti
k=1 T(k)i
(2)
where N is the number of examples in the test
set (797 in our case), T(i) is the gold annotation
(true=1, false=0) and i ranges over the examples
ranked by decreasing score. An average precision
of 1.0 means that the system assigned a higher score
to all true examples than to any false one (perfect
ranking). A lower bound of 0.26 on our test set cor-
responds to a system that ranks all false examples
above the true ones.
4.2 Results
Figure 1 shows the accuracy and average precision
resultsofthevariousmodelsonourtestset. Theran-
dom baseline and corresponding significance levels
were achieved by averaging multiple runs of a sys-
tem that assigned random scores. As can be seen in
the figures, the models’ behavior seems to be con-
sistent in both evaluation measures.
Overall, the distributional similarity based
method (sim) performs much better than the
other methods. In particular, Lin’s similarity
also performs better than semcor, the other
context-independent model. Generally, the context
independent models perform better than the contex-
tual ones. Between the two contextual models, nntr
is superior to Bayes. In fact the Bayes model is not
significantly better than random scoring.
4.3 Analysis and Discussion
When analyzing the data we identified several rea-
sons why some of the WordNet substitutions were
judged as false. In some cases the source word as
appearing in the original sentence is not in a sense
for which it is a synonym of the target word. For ex-
ample, in many situations the word answer is in the
sense of a statement that is made in reply to a ques-
tion or request. In such cases, such as in example 2
from Table 1, answer can be successfully replaced
with reply yielding a substitution which conveys the
original meaning. However, in situations such as in
example 1 the word answer is in the sense of a gen-
eral solution and cannot be replaced with reply. This
is also the case in examples 4 and 5 in which subject
does not appear in the sense of topic or theme.
Havinganinappropriatesense, however, isnotthe
only reason for incorrect substitutions. In example 8
approach appears in a sense which is synonymous
with attack and in example 9 problem appears in a
sense which is synonymous with a quite uncommon
use of the word job. Nevertheless, these substitu-
tions were judged as unacceptable since the desired
sense of the target word after the substitution is not
very clear from the context. In many other cases,
such as in example 7, though semantically correct,
thesubstitutionwasjudgedasincorrectduetostylis-
tic considerations.
Finally, there are cases, such as in example 6
in which the source word is part of a collocation
and cannot be replaced with semantically equivalent
words.
When analyzing the mistakes of the distributional
similarity method it seems as if many were not nec-
essarily due to the method itself but rather to imple-
mentation issues. The online source we used con-
tains only the top most similar words for any word.
In many cases substitutions were assigned a score of
zerosincetheywerenotlistedamongthetopscoring
similar words in the database. Furthermore, the cor-
pus that was used for training the similarity scores
was news articles in American English spelling and
does not always supply good scores to words of
British spelling in our BBC dataset (e.g. analyse,
behavioural, etc.).
The similarity based method seems to perform
better than the SemCor based method since, as noted
above, even when the source word is in the appro-
priate sense it not necessarily substitutable with the
target. For this reason we hypothesize that apply-
ing Word Sense Disambiguation (WSD) methods to
50
Figure 1: Accuracy and Average Precision Results
classifythespecificWordNetsenseofthesourceand
target words may have only a limited impact on per-
formance.
Overall, context independent models seem to per-
form relatively well since many candidate synonyms
are a priori not substitutable. This demonstrates that
such models are able to filter out many quirky Word-
Net synonyms, such as problem and job.
Fitness to the sentence context seems to be a less
frequent factor and not that trivial to model. Local
context (adjacent words) seems to play more of a
role than the broader sentence context. However,
these two types of contexts were not distinguished in
the bag-of-words representations of the two contex-
tual methods that we examined. It will be interesting
to investigate in future research using different fea-
ture types for local and global context, as commonly
done for Word Sense Disambiguation (WSD). Yet,
it would still remain a challenging task to correctly
distinguish, for example, the contexts for which an-
swer is substitutable by reply (as in example 2) from
contexts in which it is not (as in example 1).
So far we have investigated separately the perfor-
mance of context independent and contextual mod-
els. In fact, the accuracy performance of the (con-
text independent) sim method is not that far from
the upper bound, and the analysis above indicated a
rather small potential for improvement by incorpo-
rating information from a contextual method. Yet,
there is still a substantial room for improvement in
therankingqualityofthismodel, asmeasuredbyav-
erage precision, and it is possible that a smart com-
bination with a high-quality contextual model would
yield better performance. In particular, we would
expect that a good contextual model will identify the
cases in which for potentially good synonyms pair,
the source word appears in a sense that is not substi-
tutable with the target, such as in examples 1, 4 and
5 in Table 1. Investigating better contextual models
and their optimal combination with context indepen-
dent models remains a topic for future research.
5 Conclusion
This paper investigated an isolated setting of the lex-
ical substitution task, which has typically been em-
bedded in larger systems and not evaluated directly.
The setting allowed us to analyze different types of
state of the art models and their behavior with re-
spect to characteristic sub-cases of the problem.
The major conclusion that seems to arise from
our experiments is the effectiveness of combining a
knowledge based thesaurus such as WordNet with
distributional statistical information such as (Lin,
1998), overcoming the known deficiencies of each
method alone. Furthermore, modeling the a pri-
ori substitution likelihood captures the majority of
cases in the evaluated setting, mostly because Word-
Net provides a rather noisy set of substitution candi-
dates. On the other hand, successfully incorporating
local and global contextual information, as similar
to WSD methods, remains a challenging task for fu-
ture research. Overall, scoring lexical substitutions
51
is an important component in many applications and
we expect that our findings are likely to be broadly
applicable.

References
[Budanitsky and Hirst2001] Alexander Budanitsky and
Graeme Hirst. 2001. Semantic distance in word-
net: An experimental, application-oriented evalua-
tion of five measures. In Workshop on WordNet and
Other Lexical Resources: Second Meeting of the North
American Chapter of the Association for Computa-
tional Linguistics, pages 29–34.
[Burnard1995] Lou Burnard. 1995. Users Reference
Guide for the British National Corpus. Oxford Uni-
versity Computing Services, Oxford.
[Daelemans et al.2004] Walter Daelemans, Anja H¨othker,
and Erik Tjong Kim Sang. 2004. Automatic sen-
tence simplification for subtitling in dutch and english.
In Proceedings of the 4th International Conference
on Language Resources and Evaluation, pages 1045–
1048.
[Dagan et al.2005] Ido Dagan, Oren Glickman, and
Bernardo Magnini. 2005. The pascal recognising tex-
tual entailment challenge. Proceedings of the PAS-
CAL Challenges Workshop on Recognising Textual
Entailment.
[Geffet and Dagan2005] Maayan Geffet and Ido Dagan.
2005. The distributional inclusion hypotheses and lex-
ical entailment. In Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguis-
tics (ACL’05), pages 107–114, Ann Arbor, Michigan,
June. Association for Computational Linguistics.
[Glickman et al.2005] Oren Glickman, Ido Dagan, and
Moshe Koppel. 2005. A probabilistic classifica-
tion approach for lexical textual entailment. In AAAI,
pages 1050–1055.
[Grefenstette1998] Gregory Grefenstette. 1998. Produc-
ing Intelligent Telegraphic Text Reduction to Provide
an Audio Scanning Service for the Blind. pages 111–
117, Stanford, CA, March.
[Harris1968] Zelig Harris. 1968. Mathematical Struc-
tures of Language. New York: Wiley.
[Hori et al.2002] ChioriHori,SadaokiFurui,RobMalkin,
Hua Yu, and Alex Waibel. 2002. Automatic
speech summarization applied to english broadcast
news speech. volume 1, pages 9–12.
[Jing and McKeown1999] Hongyan Jing and Kathleen R.
McKeown. 1999. The decomposition of human-
written summary sentences. In SIGIR ’99: Proceed-
ings of the 22nd annual international ACM SIGIR con-
ference on Research and development in information
retrieval, pages 129–136, New York, NY, USA. ACM
Press.
[Keller and Bengio2005] Mikaela Keller and Samy Ben-
gio. 2005. A neural network for text representation.
In Wodzisaw Duch, Janusz Kacprzyk, and Erkki Oja,
editors, Artificial Neural Networks: Biological Inspi-
rations ICANN 2005: 15th International Conference,
Warsaw, Poland, September11-15, 2005.Proceedings,
Part II, volume 3697 / 2005 of Lecture Notes in Com-
puter Science, page p. 667. Springer-Verlag GmbH.
[Knight and Marcu2002] Kevin Knight and Daniel
Marcu. 2002. Summarization beyond sentence
extraction: a probabilistic approach to sentence
compression. Artif. Intell., 139(1):91–107.
[Landis and Koch1997] J. R. Landis and G. G. Koch.
1997. The measurements of observer agreement for
categorical data. Biometrics, 33:159–174.
[Lin1998] Dekang Lin. 1998. Automatic retrieval and
clustering of similar words. In Proceedings of the
17th international conference on Computational lin-
guistics, pages 768–774, Morristown, NJ, USA. Asso-
ciation for Computational Linguistics.
[McCarthy et al.2004] Diana McCarthy, Rob Koeling,
JulieWeeds, andJohnCarroll. 2004. Findingpredom-
inant senses in untagged text. In ACL, pages 280–288,
Morristown, NJ, USA. Association for Computational
Linguistics.
[Miller et al.1993] George A. Miller, Claudia Leacock,
Randee Tengi, and Ross T. Bunker. 1993. A semantic
concordance. In HLT ’93: Proceedings of the work-
shop on Human Language Technology, pages 303–
308, Morristown, NJ, USA. Association for Compu-
tational Linguistics.
[Piperidis et al.2004] Stelios Piperidis, Iason Demiros,
Prokopis Prokopidis, Peter Vanroose, Anja H¨othker,
Walter Daelemans, Elsa Sklavounou, Manos Kon-
stantinou, and Yannis Karavidas. 2004. Multimodal
multilingual resources in the subtitling process. In
Proceedings of the 4th International Language Re-
sources and Evaluation Conference (LREC 2004), Lis-
bon.
[Voorhees and Harman1999] Ellen M. Voorhees and
Donna Harman. 1999. Overview of the seventh text
retrieval conference. In Proceedings of the Seventh
Text REtrieval Conference (TREC-7). NIST Special
Publication.
[Voorhees1994] Ellen M. Voorhees. 1994. Query expan-
sion using lexical-semantic relations. In SIGIR ’94:
Proceedings of the 17th annual international ACM SI-
GIR conference on Research and development in infor-
mation retrieval, pages 61–69, New York, NY, USA.
Springer-Verlag New York, Inc.
