File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-0205_intro.xml
Size: 5,531 bytes
Last Modified: 2025-10-06 14:03:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0205"> <Title>Automatic Knowledge Representation using a Graph-based Algorithm for Language-Independent Lexical Chaining</Title> <Section position="3" start_page="0" end_page="36" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Lexical Chains are powerful representations of documents compared to broadly used bag-of-words representations. In particular, they have successfully been used in the field of Automatic Text Summarization(BarzilayandElhadad, 1997). However, until now, Lexical Chaining algorithms have only been proposed for English as they rely on linguistic resources such as Thesauri (Morris and Hirst, 1991) or Ontologies (Barzilay and Elhadad, 1997; Hirst and St-Onge, 1997; Silber and McCoy, 2002; Galley and McKeown, 2003).</Paragraph> <Paragraph position="1"> Morris and Hirst (1991) were the first to propose the concept of Lexical Chains to explore the discourse structure of a text. However, at the time of writing their paper, no machine-readable thesaurus was available so they manually generated Lexical Chains using Roget's Thesaurus (Roget, 1852).</Paragraph> <Paragraph position="2"> A first computational model of Lexical Chains is introduced by Hirst and St-Onge (1997). Their biggest contribution to the study of Lexical Chains is the mapping of WordNet (Miller, 1995) relations and paths (transitive relationships) to (Morris and Hirst, 1991) word relationship types. However, their greedy algorithm does not use a part-of-speech tagger. Instead, the algorithm only selects those words that contain noun entries in WordNet to compute LexicalChains. But, asBarzilayandElhadad(1997) point at, the use of a part-of-speech tagger could eliminate wrong inclusions of words such as read, which has both noun and verb entries in WordNet.</Paragraph> <Paragraph position="3"> So, Barzilay and Elhadad (1997) propose the first dynamic method to compute Lexical Chains. They argue that the most appropriate sense of a word can only be chosen after examining all possible Lexical Chain combinations that can be generated from a text. Because all possible senses of the word are not taken into account, except at the time of insertion, potentially pertinent context information that is likely to appear after the word is lost. However, this method of retaining all possible interpretations until the end of the process, causes the exponential growth of the time and space complexity.</Paragraph> <Paragraph position="4"> As a consequence, Silber and McCoy (2002) propose a linear time version of (Barzilay and Elhadad, 1997) lexical chaining algorithm. In particular, (Silber and McCoy, 2002)'s implementation creates a structure, called meta-chains, that implicitly stores all chain interpretations without actually creating them, thus keeping both the space and time usage of the program linear.</Paragraph> <Paragraph position="5"> Finally, Galley and McKeown (2003) propose a chaining method that disambiguates nouns prior to the processing of Lexical Chains. Their evaluation shows that their algorithm is more accurate than (Barzilay and Elhadad, 1997) and (Silber and Mc-Coy, 2002) ones.</Paragraph> <Paragraph position="6"> One common point of all these works is that Lexical Chains are built using WordNet as the standard linguistic resource. Unfortunately, systems based on static linguistic knowledge bases are limited. First, such resources are difficult to find. Second, they are largely obsolete by the time they are available.</Paragraph> <Paragraph position="7"> Third, linguistic resources capture a particular form of lexical knowledge which is often very different from the sort needed to specifically relate words or sentences. In particular, WordNet is missing a lot of explicit links between intuitively related words.</Paragraph> <Paragraph position="8"> Fellbaum (1998) refers to such obvious omissions in WordNet as the &quot;tennis problem&quot; where nouns such as nets, rackets and umpires are all present, butWordNetprovidesnolinksbetweentheserelated tennis concepts.</Paragraph> <Paragraph position="9"> In order to solve these problems, we propose to automatically construct from a collection of documents a lexico-semantic knowledge base with the purpose to identify cohesive lexical relationships between words based on corpus evidence. This hierarchical lexico-semantic knowledge base is built by using the Pole-Based Overlapping Clustering Algorithm (Cleuziou et al., 2004) that clusters words with similar meanings and allows words with multiple meanings to belong to different clusters. The second step of the process aims at automatically extracting Lexical Chains from texts based on our knowledge base. For that purpose, we propose a new greedy algorithm which can be seen as an extension of (Hirst and St-Onge, 1997) and (Barzilay and Elhadad, 1997) algorithms which allows polysemous words to belong to different chains thus breaking the &quot;one-word/one-concept per document&quot; paradigm (Gale et al., 1992)1. In particular, it imple1This characteristic can be interesting for multi-topic documents like web news stories. Indeed, in this case, there may be different topics in the same document as different news stories mayappear. Insomeway, itfollowstheideaof(Krovetz, 1998). ments (Lin, 1998) information-theoretic definition of similarity as the relatedness criterion for the attribution of words to Lexical Chains2.</Paragraph> </Section> class="xml-element"></Paper>