File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2026_metho.xml
Size: 10,754 bytes
Last Modified: 2025-10-06 14:09:36
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2026"> <Title>Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection</Title> <Section position="4" start_page="150" end_page="151" type="metho"> <SectionTitle> 3 Lexical Chains in Content Similarity </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="150" end_page="150" type="sub_section"> <SectionTitle> Detection 3.1 Corpus </SectionTitle> <Paragraph position="0"> The experiments in this paper were performed on a corpus consisting of chapters from translations of four books (Table 1) that cover a variety of topics. Many of the chapters from each book deal with similar topics; therefore, fine-grained content analysis is required to identify chapters that are derived from the same original chapter.</Paragraph> </Section> <Section position="2" start_page="150" end_page="151" type="sub_section"> <SectionTitle> 3.2 Computing Lexical Chains </SectionTitle> <Paragraph position="0"> Our approach to calculating lexical chains uses nouns, verbs, and adjectives present in WordNetV2.0. We first extract such words from each chapter in the corpus and represent each chapter as a set of these word instances {I</Paragraph> <Paragraph position="2"> }. Each instance of each of these words has a set of possible interpretations, I N , in WordNet. These interpretations are either the synsets or the hypernyms of the instances. Given these interpretations, we apply a slightly modified version of the algorithm by Silber and McCoy [7] to automatically disambiguate nouns, verbs, and adjectives, i.e., to select the correct interpretation, for each instance. Silber and McCoy's algorithm computes all of the scored metachains for all senses of each word in the document and attributes the word to the metachain to which it contributes the most. During this process, the algorithm computes the contribution of a word to a given chain by considering 1) the semantic relations between the synsets of the words that are members of the same metachain, and 2) the distance between their respective instances in the discourse. Our approach uses these two parameters, with minor modifications. Silber and McCoy measure distance in terms of paragraphs on prose text; we measure distance in terms of sentences in order to handle both dialogue and prose text.</Paragraph> <Paragraph position="3"> eliminating words that are not nouns, verbs, or adjectives and after identifying lexical chains (represented by WordNet synset IDs). Note that {kitchen, bathroom} are represented by the same synset ID which corresponds to the synset ID of their common hypernym &quot;room&quot;. {kitchen, bathroom} is a lexical chain. Ties are broken in favor of hypernyms.</Paragraph> <Paragraph position="4"> Following Silber and McCoy, we allow different types of conceptual relations to contribute differently to each lexical chain, i.e., the contribution of each word to a lexical chain is dependent on its semantic relation to the chain (see Table 2). After scoring, concepts that are dominant in the text segment are identified and each word is represented by only the WordNet ID of the synset (or the hypernym/hyponym set) that best fits its local context. Figure 1 gives an example of the resulting intermediate representation, corresponding to the interpretation, S, found for each word instance, I, that can be used to represent each chapter, C, where C = {S</Paragraph> </Section> <Section position="3" start_page="151" end_page="151" type="sub_section"> <SectionTitle> 3.3 Determining the Locality Window </SectionTitle> <Paragraph position="0"> After computing the lexical chains, we created a representation for text by substituting the correct lexical chain for each noun, verb, and adjective in each document. We omitted the remaining parts of speech from the documents (see Figure 1 for sample intermediate representation). We obtained ordered and unordered n-grams of lexical chains from this representation.</Paragraph> <Paragraph position="1"> Ordered n-grams consist of n consecutive lexical chains extracted from text. These ordered n-grams preserve the original order of the lexical chains in the text. Corresponding unordered n-grams disregard this order. The resulting text</Paragraph> <Paragraph position="3"> may be sorted or unsorted, depending on the selected method. N-grams are extracted from text using sliding locality windows and provide what we call &quot;attribute vectors&quot;. The attribute vector for ordered n-grams has the form C = {(e</Paragraph> <Paragraph position="5"> is the last lexical chain in the chapter. For unordered n-grams, the attribute vector has the form C = {sort[(e</Paragraph> <Paragraph position="7"> where sort[...] indicates alphabetical sorting of chains (rather than the actual order in which the chains appear in the text).</Paragraph> <Paragraph position="8"> We evaluated similarity between pairs of book chapters using the cosine of the attribute vectors of n-grams of lexical chains (sliding locality windows of width n). We varied the width of the sliding locality windows from two to five elements.</Paragraph> </Section> </Section> <Section position="5" start_page="151" end_page="153" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> We used cosine similarity as the distance metric, computed the cosine of the angle between the vectors of pairs of documents in the corpus, and ranked the pairs based on this score. We identified the top n most similar pairs (also referred to as &quot;selection level of n&quot;) and considered them to be similar in content.</Paragraph> <Paragraph position="1"> We calculated similarity between pairs of documents in several different ways, evaluated these approaches with the standard information retrieval measures, i.e., precision, recall, and fmeasure, and compared our results with two baselines. The first baseline measured the similarity of documents with tf*idf-weighted keywords; the second used the cosine of unweighted lexical chains (unigrams of lexical chains).</Paragraph> <Paragraph position="2"> The corpus of parallel translations provides data that can be used as ground truth for content similarity; corresponding chapters from different translations of the same original title are considered similar in content, i.e., chapter 1 of translation 1 of Madame Bovary is similar in content to chapter 1 of translation 2 of Madame Bovary.</Paragraph> <Paragraph position="3"> Figure 2 shows the f-measure of different methods for measuring similarity between pairs of chapters using ordered lexical chains, unordered lexical chains, and baselines. These graphs present the results when the top 1001,600 most similar pairs in the corpus are considered similar in content and the rest are considered dissimilar (selection level of 1001,600). The total number of chapter pairs is approximately 1,000,000. Of these, 1,080 (475 unique chapters with 2 or 3 translations each) are considered similar for evaluation purposes.</Paragraph> <Paragraph position="4"> The results indicate that four similarity measures gave the best performance. These were tri-grams, quadri-grams, penta-grams, and hexa-grams of unordered lexical chains. The peak f-measure at the selection level of 1,100 chapter pairs was 0.981. Chi squared tests performed on the f-measures (when the top 1,100 pairs were considered similar) were significant at p = 0.001.</Paragraph> <Paragraph position="5"> Closer analysis of the graphs in Figure 2 shows that, at the optimal selection level, n-grams of ordered lexical chains of length greater than four significantly outperformed the baseline at p = 0.001 while n-grams of ordered lexical chains of length less than or equal to four are significantly outperformed by the baseline at the same p. A similar observation cannot be made for the n-grams of unordered lexical chains; for these n-grams, the performance degradation appears at n = 7, i.e., the corresponding curves have a steeper negative incline than the baseline. After the cut-off point of 1,100 chapter pairs, the performance of all algorithms declines. This is due to the evaluation method we have chosen: although the cut-off for similarity judgement can be increased, the number of chapters that are in fact similar does not change and at high cut-off values many dissimilar pairs are considered similar, leading to degradation in performance.</Paragraph> <Paragraph position="6"> Figures 2a and 2b show that some of the lexical chain representations do not outperform the tf*idf-weighted baseline. A comparison of Figures 2a and 2b shows that, for n < 5, n-grams of ordered lexical chains perform worse than n-grams of unordered lexical chains. This indicates that between different translations of the same book the order of chains changes significantly, but that the chains within contiguous regions (locality windows) of the texts remain similar.</Paragraph> <Paragraph position="7"> Interestingly, ordered n-grams of length 3 to 5 perform significantly better than unordered n-grams of the same length. This implies that, during translation, the order of the content words does not change enormously for three to five lexical chain elements. Allowing flexible order for the lexical chains (i.e., unordered lexical chains) in these n-grams therefore hurts performance by allowing many false positives.</Paragraph> <Paragraph position="8"> However, for longer n-grams to be successful, the order of the lexical chains has to be flexible.</Paragraph> </Section> <Section position="6" start_page="153" end_page="153" type="metho"> <SectionTitle> 5 Future Work </SectionTitle> <Paragraph position="0"> Currently, our similarity measures do not employ any weighting scheme for n-grams, i.e., every n-gram is given the same weight. For example, the n-gram &quot;be it as it has been&quot; in lexical chain form corresponds to synsets for the words be, have and be. The trigram of these lexical chains does not convey significant meaning. On the other hand, the n-gram &quot;the lawyer signed the heritage&quot; is converted into the trigram of lexical chains of lawyer, sign, and heritage. This trigram is more meaningful than the trigram be have be, but in our scheme both trigrams will get the same weight. As a result, two documents that share the trigram be have be will look as similar as two documents that share lawyer sign heritage. This problem can be addressed in two possible ways: using a 'stop word' list to filter such expressions completely or giving different weights to n-grams based on the number of their occurrences in the corpus.</Paragraph> </Section> class="xml-element"></Paper>