File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2907_metho.xml
Size: 13,320 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2907"> <Title>Investigating Lexical Substitution Scoring for Subtitle Generation</Title> <Section position="5" start_page="46" end_page="48" type="metho"> <SectionTitle> 3 Compared Scoring Models </SectionTitle> <Paragraph position="0"> We compare methods for scoring lexical substitutions. These methods assign a score which is expected to correspond to the likelihood that the synonym substitution results in a valid subtitle which preserves the main meaning of the original sentence.</Paragraph> <Paragraph position="1"> We examine four statistical scoring models, of two types. The context independent models score the general likelihood that the source word can be replaced with the target synonym regardless of the context in which the word appears. Contextual models, on the other hand, score the fitness of the target word within the given context.</Paragraph> <Section position="1" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 3.1 Context Independent Models </SectionTitle> <Paragraph position="0"> Even though synonyms are substitutable in theory, in practice there are many rare synonyms for which the likelihood of substitution is very low and will be substitutable only in obscure contexts. For example, although there are contexts in which the word job is a synonym of the word problem2, this is not typically the case and overall job is not a good target substitution for the source problem (see example 9 in Table 1). For this reason synonym thesauruses such as WordNet tend to be rather noisy for practicalpurposes, raisingtheneedtoscoresuchsynonym substitutions and accordingly prioritize substitutions that are more likely to be valid in an arbitrary context. null 2WordNet lists job as a possible member of the synset for a state of difficulty that needs to be resolved, as might be used in sentences like &quot;it is always a job to contact him&quot; As representative approaches for addressing this problem, we chose two methods that rely on statistical information of two types: supervised sense distributions from SemCor and unsupervised distributional similarity.</Paragraph> <Paragraph position="1"> (semcor) The obvious reason that a target synonym cannot substitute a source in some context is if the source appears in a different sense than the one in which it is synonymous with the target. This means that a priori, synonymsoffrequentsensesofasourceword are more likely to provide correct substitutions than synonyms of the word's infrequent senses.</Paragraph> <Paragraph position="2"> To estimate such likelihood, our first measure is based on sense frequencies from SemCor (Miller et al., 1993), a corpus annotated with Wordnet senses.</Paragraph> <Paragraph position="3"> For a given source word u and target synonym v the score is calculated as the percentage of occurrences of u in SemCor for which the annotated synset contains v (i.e. u's occurrences in which its sense is synonymous with v). This corresponds to the prior probability estimate that an occurrence of u (in an arbitrary context) is actually a synonym of v. Therefore it is suitable as a prior score for lexical substi- null second method uses an unsupervised distributional similarity measure to score synonym substitutions.</Paragraph> <Paragraph position="4"> Such measures are based on the general idea of Harris' Distributional Hypothesis, suggesting that words that occur within similar contexts are semantically similar (Harris, 1968).</Paragraph> <Paragraph position="5"> As a representative of this approach we use Lin's dependency-baseddistributionalsimilaritydatabase.</Paragraph> <Paragraph position="6"> Lin's database was created using the particular distributionalsimilaritymeasurein(Lin, 1998), applied to a large corpus of news data (64 million words) 4.</Paragraph> <Paragraph position="7"> Two words obtain a high similarity score if they occur often in the same contexts, as captured by syntactic dependency relations. For example, two verbs willbeconsideredsimilariftheyhavelargecommon sets of modifying subjects, objects, adverbs etc.</Paragraph> <Paragraph position="8"> Distributional similarity does not capture directly meaning equivalence and entailment but rather a looser notion of meaning similarity (Geffet and Dagan, 2005). It is typical that non substitutable words such as antonyms or co-hyponyms obtain high similarity scores. However, in our setting we apply the similarity score only for WordNet synonyms in which it is known a priori that they are substitutable is some contexts. Distributional similarity may thus capture the statistical degree to which the two words are substitutable in practice. In fact, it has been shown that prominence in similarity score corresponds to sense frequency, which was suggested as the basis for an unsupervised method for identifying the most frequent sense of a word (McCarthy et al., 2004).</Paragraph> </Section> <Section position="2" start_page="47" end_page="48" type="sub_section"> <SectionTitle> 3.2 Contextual Models </SectionTitle> <Paragraph position="0"> Contextual models score lexical substitutions based on the context of the sentence. Such models try to estimate the likelihood that the target word could potentially occur in the given context of the source word and thus may replace it. More concretely, for a given substitution example consisting of an original sentence s = w1 ...wi ...wn, and a designated source word wi, the contextual models we consider assign a score to the substitution based solely on the target synonym v and the context of the source word in the original sen- null tence, {w1,...,wi[?]1,wi+1,...,wn}, which is represented in a bag-of-words format.</Paragraph> <Paragraph position="1"> Apparently, thissettingwasnotinvestigatedmuch in the context of lexical substitution in the NLP literature. We chose to evaluate two recently proposed models that address exactly the task at hand: the first model was proposed in the context of lexical modeling of textual entailment, using a generative Na&quot;ive Bayes approach; the second model was proposed in the context of machine learning for information retrieval, using a discriminative neural network approach. The two models were trained on the (unannotated) sentences of the BNC 100 million word corpus(Burnard, 1995)inbag-of-wordsformat. The corpus was broken into sentences, tokenized, lemmatized and stop words and tokens appearing only once were removed. While training of these models is done in an unsupervised manner, using unlabeled data, some parameter tuning was performed using the small development set described in Section 2.</Paragraph> <Paragraph position="2"> The first contextual model we examine is the one proposed in (Glickman et al., 2005) to model textual entailment at the lexical level. For a given target word this unsupervised model takes a binary text categorization approach. Each vocabulary word is considered a class, and contexts are classified as to whether the given target word is likely to occur in them. Taking a probabilistic Na&quot;ive-Bayes approach the model estimates the conditional probability of thetargetwordgiventhecontextbasedoncorpuscooccurrence statistics. We adapted and implemented this algorithm and trained the model on the sentences of the BNC corpus.</Paragraph> <Paragraph position="3"> For a bag-of-words context C = {w1,...,wi[?]1,wi+1,...,wn} and target word v the Na&quot;ive Bayes probability estimation for the conditional probability of a word v may occur in a given a context C is as follows:</Paragraph> <Paragraph position="5"> where P(w|v) is the probability that a word w appears in the context of a sentence containing v and correspondingly P(w|!v) is the probability that w appears in a sentence not containing v. The probability estimates were obtained from the processed BNC corpus as follows:</Paragraph> <Paragraph position="7"> To avoid 0 probabilities these estimates were smoothed by adding a small constant to all counts and normalizing accordingly. The constant value was tuned using the development set to maximize average precision (see Section 4.1). The estimated probability, P(v|C), was used as the confidence score for each substitution example.</Paragraph> <Paragraph position="8"> As a second contextual model we evaluated the Neural Network for Text Representation (NNTR) proposed in (Keller and Bengio, 2005). NNTR is a discriminative approach which aims at modeling how likely a given word v is in the context of a piece of text C, while learning a more compact representation of reduced dimensionality for both v and C. NNTR is composed of 3 Multilayer Perceptrons, noted mlpA(), mlpB() and mlpC(), connected as follow: NNTR(v,C) = mlpC[mlpA(v),mlpB(C)].</Paragraph> <Paragraph position="9"> mlpA(v) and mlpB(C) project respectively the vector space representation of the word and text into a more compact space of lower dimensionality. mlpC() takes as input the new representations of v and C and outputs a score for the contextual relevance of v to C.</Paragraph> <Paragraph position="10"> As training data, couples (v,C) from the BNC corpus are provided to the learning scheme. The target training value for the output of the system is 1 if v is indeed in C and -1 otherwise. The hope is that the neural network will be able to generalize to words which are not in the piece of text but are likely to be related to it.</Paragraph> <Paragraph position="11"> In essence, this model is trained by minimizing the weighted sum of the hinge loss function over negative and positive couples, using stochastic GradientDescent(see(KellerandBengio, 2005)forfurther details). The small held out development set of the substitution dataset was used to tune the hyper-parameters of the model, maximizing average precision (see Section 4.1). For simplicity mlpA() and mlpB() were reduced to Perceptrons. The output size of mlpA() was set to 20, mlpB() to 100 and the number of hidden units of mlpC() was set to 500.</Paragraph> <Paragraph position="12"> There are a couple of important conceptual differences of the discriminative NNTR model compared to the generative Bayesian model described above.</Paragraph> <Paragraph position="13"> First, the relevancy of v to C in NNTR is inferred in a more compact representation space of reduced dimensionality, which may enable a higher degree of generalization. Second, in NNTR we are able to control the capacity of the model in terms of number of parameters, enabling better control to achieve an optimal generalization level with respect to the training data (avoiding over or under fitting).</Paragraph> </Section> </Section> <Section position="6" start_page="48" end_page="49" type="metho"> <SectionTitle> 4 Empirical Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="48" end_page="49" type="sub_section"> <SectionTitle> 4.1 Evaluation Measures Wecomparethelexicalsubstitutionscoringmethods </SectionTitle> <Paragraph position="0"> using two evaluation measures, offering two different perspectives of evaluation.</Paragraph> <Paragraph position="1"> Thefirstevaluationmeasureismotivatedbysimulatingadecisionstepofasubtitlingsystem, inwhich the best scoring lexical substitution is selected for each given sentence. Such decision may correspond to a situation in which each single substitution may suffice to obtain the desired compression rate, or might be part of a more complex decision mechanism of the complete subtitling system. We thus measure the resulting accuracy of subtitles created by applying the best scoring substitution example for every original sentence. This provides a macro evaluation style since we obtain a single judgment for each group of substitution examples that correspond to one original sentence.</Paragraph> <Paragraph position="2"> In our dataset 25.5% of the original sentences have no correct substitution examples and for 15.5% of the sentences all substitution examples were annotated as correct. Accordingly, the (macro averaged) accuracy has a lower bound of 0.155 and upper bound of 0.745.</Paragraph> <Paragraph position="3"> As a second evaluation measure we compare the average precision of each method over all the examples from all original sentences pooled together (a micro averaging approach). This measures the potential of a scoring method to ensure high precision for the high scoring examples and to filter out low-scoring incorrect substitutions.</Paragraph> <Paragraph position="4"> Average precision is a single figure measure commonly used to evaluate a system's ranking ability (Voorhees and Harman, 1999). It is equivalent to the area under the uninterpolated recall-precision curve, defined as follows:</Paragraph> <Paragraph position="6"> (2) where N is the number of examples in the test set (797 in our case), T(i) is the gold annotation (true=1, false=0) and i ranges over the examples ranked by decreasing score. An average precision of 1.0 means that the system assigned a higher score to all true examples than to any false one (perfect ranking). A lower bound of 0.26 on our test set corresponds to a system that ranks all false examples above the true ones.</Paragraph> </Section> </Section> class="xml-element"></Paper>