File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1063_intro.xml
Size: 3,285 bytes
Last Modified: 2025-10-06 14:01:24
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1063"> <Title>Hierarchical Orderings of Textual Units</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text representation is a central task for approaches to text classification or categorization.</Paragraph> <Paragraph position="1"> They require a format which allows to semantically relate words, texts, and thematic categories. The majority of approaches to automatic learning from texts use the vector space or bag of words model. Although there is much research for alternative formats, whether phraseor hyperonym-based, their effects seem to be small (Scott and Matwin, 1999). More seriously (Riloff, 1995) argues that the bag of words model ignores morphological and syntactical information which she found to be essential for solving some categorization tasks. An alternative to the vector space model are semantic spaces, which have been proposed as a high-dimensional format for representing relations of semantic proximity. Relying on sparse knowledge resources, they prove to be efficient in cognitive science (Kintsch, 1998; Landauer and Dumais, 1997), computational linguistics (Rieger, 1984; Sch&quot;utze, 1998), and information retrieval. Although semantic spaces prove to be an alternative to the vector space model, they leave the question unanswered of how to explore and visualize similarities of signs mapped onto them.</Paragraph> <Paragraph position="2"> In case that texts are represented as points in semantic space, this question refers to the exploration of their implicit, content based relations. Several methods for solving this task have been proposed which range from simple lists via minimal spanning trees to cluster analysis as part of scatter/gahter algorithms (Hearst and Pedersen, 1996). Representing a sign's environment in space by means of lists runs the risk of successively ordering semantically or thematically diverse units. Obviously, lists neglect the poly-hierarchical structure of semantic spaces which may induce divergent thematic progressions starting from the same polysemous unit.</Paragraph> <Paragraph position="3"> Although clustering proves to be an alternative to lists, it seeks a global, possibly nested partition in which clusters represent sets of indistinguishable objects regarding the cluster criterion. In contrast to this, we present cohesion trees (CT) as a data structure, in which single objects are hierarchically ordered on the basis of lexical cohesion. CTs, whose field of application is the management of search results in IR, shift the perspective from sets of clustered objects to cohesive paths of interlinked signs.</Paragraph> <Paragraph position="4"> The paper is organized as follows: the next section presents alternative text representation models as extensions of the semantic space approach. They are used in section (3) as a background of the discussion of cohesion trees. Both types of models, i.e. the text representation models and cohesion trees as a tool for hierarchically traversing semantic spaces, are evaluated in section (4). Finally, section (5) gives some conclusions and prospects future work.</Paragraph> </Section> class="xml-element"></Paper>