File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/05/i05-2026_relat.xml
Size: 2,579 bytes
Last Modified: 2025-10-06 14:15:52
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2026"> <Title>Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection</Title> <Section position="3" start_page="0" end_page="150" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Lexical Chains (LC) represent lexical items which are conceptually related to each other, for example, through hyponymy or synonymy relations. Such conceptual relations have previously been used in evaluating cohesion, e.g., by Halliday and Hasan [2, 3]. Barzilay and Elhadad [1] used lexical chains for text summarization; they identified important sentences in a document by retrieving strong chains. Silber and McCoy [7] extended the work of Barzilay and Elhadad; they developed an algorithm that is linear in time and space for efficient identification of lexical chains in large documents. In this algorithm, Silber and McCoy first created a text representation in the form of metachains, i.e., chains that capture all possible lexical chains in the document. After creating the metachains, they used a scoring algorithm to identify the lexical chains that are most relevant to the document, eliminated unnecessary overhead information from the metachains, and selected the lexical chains representing the document. Our method for building lexical chains follows this algorithm.</Paragraph> <Paragraph position="1"> N-gram based language models, i.e., models that divide text into n-word (or n-character) strings, are frequently used in natural language processing. In plagiarism detection, the overlap of n-grams between two documents has been used to determine whether one document plagiarizes another [4]. In general, n-grams capture local relations. In our case, they capture local relations between lexical chains and between concepts represented by these chains.</Paragraph> <Paragraph position="2"> Three main streams of research in content similarity detection are: 1) shallow, statistical analysis of documents, 2) analysis of rhetorical relations in texts [5], and 3) deep syntactic analysis [8]. Shallow methods do not include much linguistic information and provide a very rough model of content while approaches that use syntactic analysis generally require significant computation. Our approach strikes a compromise between these two extremes: it uses the linguistic knowledge provided in WordNet as a way of making use of low-cost linguistic information for building lexical chains that can help detect content similarity.</Paragraph> </Section> class="xml-element"></Paper>