File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/99/w99-0625_relat.xml

Size: 2,954 bytes

Last Modified: 2025-10-06 14:16:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0625">
  <Title>Normalized? Yes Yes Yes No Yes No Yes Yes Yes Yes Yes Yes</Title>
  <Section position="4" start_page="203" end_page="204" type="relat">
    <SectionTitle>
3 Related Work
</SectionTitle>
    <Paragraph position="0"> Although: there is related empirical research on determining text similarity, primarily in the information retrieval community, there are two major differences between the goals of this earlier work and the problem we address in this (a) An OH-58 helicopter, carrying a crew of two, was on a routine training orientation when contact was lost at about 11:30 a.m. Saturday (9:30 p.m. EST Friday). null  (b) &amp;quot;There were two people on board,&amp;quot; said Bacon. &amp;quot;We lost radar contact with the helicopter about 9:15 EST (0215 GMT).&amp;quot; (c) An OH-58 U.S. military scout helicopter made an emergency landing in North Korea at about 9.15 p.m. EST Friday (0215 GMT Saturday), the Defense De null corpus, topic 11).</Paragraph>
    <Paragraph position="1"> paper. First, the notion of similarity as defined in the previous section is more restrictive than the traditional definition of similarity \[Anderberg 1973; Willet 1988\]. Standard notions of similarity generally involve the creation of a vector or profile of characteristics of a text fragment, and then computing on the basis of frequencies the distance between vectors to determine conceptual distance \[Salton and Buckley 1988; Salton 1989\]. Features typically include stemmed words although sometimes multi-word units and collocations have been used \[Smeaton 1992\], as well as typological characteristics, such as thesaural features. The distance between vectors for one text (usually a query) and another (usually a document) then determines closeness or similarity \[van Rijsbergen 1979\]. In some cases, the texts are represented as vectors of sparse n-grams of word occurrences and learning is applied over those vectors \[Schapire and Singer 1999\]. But since our definition of similarity is oriented to the small-segment goal, we make more fine-grained distinctions. Thus, a set of passages that would probably go into the same class by standard IR criteria would be further separated by our methods. null Second, we have developed a method that functions over pairs of small units of text, so the size of the input text to be compared is different. This differs from document-to-document  or query-to-document comparison. A closely related problem is that of matching a query to the relevant segment from a longer document \[Callan 1994; Kaszkiel and Zobel 1998\], which primarily involves determining which segment of a longer document is relevant to a query, whereas our focus is on which segments are similar to each other. In both cases, we have less data to compare, and thus have to explore additional or more informative indicators of similarity. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML