XML Viewer - w98-1213

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1213_metho.xml
Size: 18,970 bytes
Last Modified: 2025-10-06 14:15:14
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1213">
  <Title>I I I I I I I I I I I I I I I I I Automatically generating hypertext in newspaper articles by computing semantic relatedness</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Lexical chains
</SectionTitle>
    <Paragraph position="0"> A lexical chain (Morris and Hirst, 1991) is a sequence of semantically related words in a text. For example, ifa text contained the words apple and fruit, they would appear in a chain together, since apple is a kind of fruit. Each word in a text may appear in only one chain, but a document will contain many chains, each of which captures a portion of the cohesive structure of the document. Cohesion Green 101 Automatically generating hypertext Stephen J. Green (1998) Automatically generating hypertext in newspaper articles by computing semantic relatedness. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language is what, as Halliday and Hasan (1976) put it, helps a text &amp;quot;hang together as a whole&amp;quot;. The lexical chains contained in a text will tend to delineate the parts of the text that are &amp;quot;about&amp;quot; the same thing. Morris and Hirst showed that the organization of the lexical chains in a document mirrors, in some sense, the discourse structure of that document.</Paragraph>
    <Paragraph position="1"> The lexical chains in a text can be identified using any lexical resource that relates words by their meaning. Our current lexical chainer (based on the one described by St-Onge, 1995) uses the WordNet database (Beckwith et al., 199 I). The WordNet database is composed of synonym sets or synsets. Each synset contains one or more words that have the same meaning. A word may appear in many synsets, depending on the number of senses that it has.</Paragraph>
    <Paragraph position="2"> Synsets can be connected to each other by several different types of links that indicate different relations. For example, two synsets can be connected by a hypernym link, which indicates that the words in the source synset are instances of the words in the target synset.</Paragraph>
    <Paragraph position="3"> For the purposes of chaining, each type of link between WordNet synsets is assigned a direction of up, down, or horizontal. Upward links correspond to generalization: for example, an upward link from apple to fruit indicates that fruit is more general than apple. Downward links correspond to specialization: for example, a link from fruit to apple would have a downward direction. Horizontal links are very specific specializations. For example, the antonymy relation in WordNet is given a direction of horizontal, since it specializes the sense of a word very accurately, that is, if a word and its antonym appear in a text, the two words are very likely being used in the senses that are antonyms.</Paragraph>
    <Paragraph position="4"> Given these types of links, three kinds of relations are built between words: Extra strong An exwa strong relation is said to exist between repetitions of the same word: i.e., term repetition. null Strong A strong relation is said to exist between words that are in the same WordNet synset (i.e., words that are synonymous). Strong relations are also said to exist between words that have synsets connected by a single horizontal link or words that have synsets connected by a single IS-A or INCLUDES relation.</Paragraph>
    <Paragraph position="5"> Regular A regular relation is said&amp;quot; to exist between two words when there is at least one allowable path between a synset containing the first word and a synset containing the second word in the WordNet database. A path is allowable if it is short (less than n links, where n is typically 3 or 4) and adheres to  three rules: 1. No other direction may precede an upward link.</Paragraph>
    <Paragraph position="6"> 2. No more than one change of direction is allowed. null 3. A horizontal link may be used to move from  an upward to a downward direction.</Paragraph>
    <Paragraph position="7"> When a word is processed during chaining, it is initially associated with all of the synsets of which it is a member. When the word is added to a chain, the chainer attempts to find connections between the synsets associated with the new word and the synsets associated with words that are already in the chain. Synsets that can be connected are retained and all others are discarded.</Paragraph>
    <Paragraph position="8"> The result of this processing is that, as the chains are built, the words in the chains are progressively sensedisambiguated. When an article has been chained, a description of the chains contained in the document is written to a file. Table 1 shows some of the chains that were recovered from an article about the trend towards &amp;quot;virtual parenting&amp;quot; (Shellenbarger, 1995). In this table, the numbers in parentheses show the number of occurrences of a particular word.</Paragraph>
    <Paragraph position="9"> The process of lexical chaining is not perfect, but if we wish to process articles quickly, then we must accept some errors or at least bad decisions. In our sample article, for example, chain 1 is a conglomeration of words that would have better been separated into different chains. This is a side effect of the current implementation of the lexical chainer, but even with these difficulties, we are able to perform useful tasks. We expect to address some of these problems in subsequent versions of the chainer, hopefully with no loss in efficiency.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="22806" type="metho">
    <SectionTitle>
3 Building links within an article
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="22806" type="sub_section">
      <SectionTitle>
3.1 Analyzing the iexicai chains
</SectionTitle>
      <Paragraph position="0"> Newspaper articles are written so that one may stop reading at the end of any paragraph and feel as though one has read a complete unit. For this reason, it is natural to choose to use paragraphs as the nodes in our hypertext.</Paragraph>
      <Paragraph position="1"> Table 1 showed the lexical chains recovered from a news article about the trend towards &amp;quot;virtual parenting&amp;quot;. Figure 1 shows the second and eighth paragraphs of this article with the words that participate in lexical chains tagged with their chain numbers. We will use this particular article to illustrate the process of building intra-article links. The first step in the process is to determine how important each chain is to each paragraph in an article. We judge the importance of a chain by calculating the fraction of the content words of the paragraph that are in that chain. We refer to this fraction as the density of that chain in that paragraph. The density of chain c in paragraph p, dc,p, is defined as:</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"/>
      <Paragraph position="7"> Although no one is pushing 12 virtual-reality headgear 16 as a substitute I for parents I, many I technical ad campaigns 13 are promoting cellular phones ~, faxes ~ , computers I and pagers to&amp;quot; l working I parents ! as a way of bridging separations 17 from their kids I . A recent promotion 13 by A T &amp; T and Residence 2 Inns 7 in the United States 6 , for example 3, suggests that business 3 travellers I with young j children use video 3 and audio tapes ~, voice 3 mail 3, videophones and E-mail to stay 3 connected, including kissing ~ the kids I good night 21 by phone 22.</Paragraph>
      <Paragraph position="8"> More advice 3 from advertisers t: Business s travellers I can dine with their kids t by speakerL phone or &amp;quot;tuck them in&amp;quot; by cordless phone z2. Separately, a management I0 newsletter 24 recommends faxing your child I when you have to break 17 a promise 3 to be home 2 or giving 12 a young I child I a beeper to make him feel ~ more secure when left &amp;quot;s alone.</Paragraph>
      <Paragraph position="9">  where wc,p is the number of words from chain c that appear in paragraph p and w v is the number of content words (i.e., words that are not stop words) in p. For example, if we consider paragraph two of our sample article, we see that there are 9 words from chain 1. We also note that there are 48 content words in the paragraph. So, in this case the density of chain 1 in paragraph I, dr,z, is 9 4-g = 0.19.</Paragraph>
      <Paragraph position="10"> The result of these calculations is that each paragraph in the article has associated with it a vector of chain densities, with an element for each of the chains in the article.</Paragraph>
      <Paragraph position="11"> Table 2 shows these chain density vectors for the chains shown in table I. Note that an empty element indicates a density of 0.</Paragraph>
    </Section>
    <Section position="2" start_page="22806" end_page="22806" type="sub_section">
      <SectionTitle>
3.2 Determining paragraph links
</SectionTitle>
      <Paragraph position="0"> As we said earlier, the parts of a document that are about the same thing, and therefore related, will tend to contain the same lexical chains. Given the chain density vectors that we described above, we need to develop a method to determine the similarity of the sets of chains contained in each paragraph. The second stage of paragraph linking, therefore, is to compute the similarity between the paragraphs of the article by computing the similarity between the chain density vectors representing them. We can compute these similarities using any one of 16 similarity coefficients that we have taken from Ellis et al. (1994).</Paragraph>
      <Paragraph position="1"> This similarity is computed for each pair of chain density vectors, giving us a symmetric p x p matrix of similaritie s, where p is the number of paragraphs in the article. From this matrix we can calculate the mean and the standard deviation of the paragraph similarities.</Paragraph>
      <Paragraph position="2"> The next step is to decide which paragraphs should be linked, on the basis of the similarities computed in the previous step. We make this decision by looking at how the similarity of two paragraphs compares to the mean paragraph similarity across the entire article. Each similarity between two paragraphs i and j, si,j, is converted</Paragraph>
      <Paragraph position="4"> to a z-score, zi,j. If two paragraphs are more similar than a threshold given in terms of a number of standard deviations, then a link is placed between them. The result is a symmetric adjacency matrix where a 1 indicates that a link should be placed between two paragraphs. Figure 3 shows the adjacency matrix that is produced when a z-score threshold of 1.0 is used to compute the links for our virtual parenting example.</Paragraph>
      <Paragraph position="5"> Once we have decided which paragraphs should be linked, we need to be able to produce a representation of the hypertext that can be used for browsing. In the current system, there are two ways to output the HTML representation of an article. The first simply displays all of the links that were computed during the last stage of the process described above. The second is more complicated, showing only some of the links. The idea is that links between physically adjacent paragraphs should be omitted so that they do not clutter the hypertext.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="22806" end_page="22806" type="metho">
    <SectionTitle>
4 Building links between articles
</SectionTitle>
    <Paragraph position="0"> While it is useful to be able to build links within articles, for a large scale hypertext, links also need to be placed between articles. You will recall from section 2 that the output of the lexical chainer is a list of chains, each chain consisting of one or more words. Each word in a chain has associated with it one or more synsets. These synsets indicate the sense of the word as it is being used in this chain. An example of the kind of output produced by the ehainer is shown in table 4, which shows a portion of the chains extracted from an article (Gadd, 1995b) about cuts in staffat children's aid societies due to a reduction in provincial grants. Table 5 shows a portion of another set of chains, this time from an article (Gadd, 1995a) describing the changes in child-protection agencies, due in part to budget cuts.</Paragraph>
    <Paragraph position="1"> It seems quite clear that these two articles are related, and that we would like to place a link from one to the other. It is also clear that the words in these two articles display both of the linguistic factors that affect IR performance, namely synonymy and polysemy. For example, the first set of chains contains the word abuse, while the second set contains the synonym maltreatment. Similarly, the first set of chains includes the word kid, while the second contains child. The word abuse in the first article has been disambiguated by the lexieal chainer into the &amp;quot;cruel or inhuman treatment&amp;quot; sense, as has the word maltreatment from the second article. We once again note that the lexieal chaining process is not perfect: for example, both texts contain the word abuse, but it has been d.isambiguated into different senses-- in the first article, it is meant in the sense of &amp;quot;ill-treatment&amp;quot;, while in the second it is meant in the sense of &amp;quot;verbal abuse&amp;quot;.</Paragraph>
    <Paragraph position="2"> Although the articles share a large number of words, by missing the synonyms or by making incorrect (or no) judgments about different senses, a traditional IR system might miss the relation between these documents or rank them as less related than they really are. Aside from the problems of synonymy and polysemy, we can see that there are also more-distant relations between the words of these two articles. For example, the second set of chains</Paragraph>
    <Paragraph position="4"> contains the word maltreatment while the first set contains the related word child abuse (a kind of maltreatment) as well as the repetition of child abuse.</Paragraph>
    <Paragraph position="5"> We can build these inter-article links by determining the similarity of the two sets of chains contained in two articles. In essence, we wish to perform a kind of cross-document chaining.</Paragraph>
    <Section position="1" start_page="22806" end_page="22806" type="sub_section">
      <SectionTitle>
4.1 Synset weight vectors
</SectionTitle>
      <Paragraph position="0"> We can represent each document in a database by two vectors. Each vector will have an element for each synset in WordNet. An element in the first vector will contain a weight based on the number of occurrences of that particular synset in the words of the chains contained in the document. An element in the second vector will contain a weight based on the number of occurrences of that particular synset when it is one link away from a synset associated with a word in the chains. We will call these vectors the member and linked synset vectors, or simply the member and linked vectors, respectively.</Paragraph>
      <Paragraph position="1"> The weight of a particular synset in a particular document is not based solely on the frequency of that synset in the document, but also on how frequently that term appears throughout the database. The synsets that are the most heavily weighted in a document are the ones that appear frequently in that document but infrequently in the entire database. The weights are calculated using the standard ff-idf weighting function: Wik =- sf ik&amp;quot; log(N/nk) ~/Y~= t (sf ij) 2. (log(N lnj) )2 where sfik is the frequency of synset k in document i, N is the size of the document collection, n, is the number of documents in the collection that contain synset k, and s is the number of synsets in all documents. Note that this equation incorporates the normalization of the synset weight vectors.</Paragraph>
      <Paragraph position="2"> The weights are calculated independently for the member and linked vectors. We do this because the linked vectors introduce a large number of synsets that do not necessarily appear in the original chains of an article, and should therefore not influence the frequency counts of the member synsets. Thus, we make a distinction between Green 105 Automatically generating hypertext strong links that occur due to synonymy, and strong links that occur due to IS-A or INCLUDES relations. The similarity between two documents, DI and/32, is then determined by calculating three cosine similarities:  1. The similarity of the member vectors of DI and/)2; 2. The similarity of the member vector of Dl and linked vector olD2; and 3. The similarity of the linked vector of Di and the member vector of D2.</Paragraph>
      <Paragraph position="3">  Clearly, the first similarity measure (the membermember similarity) is the most important, as it will capture extra-strong relations as well as strong relations between synonymous words. The last two measures (the member-linked similarities) are less important as they capture strong relations that occur between synsets that are one link away from each other. If we enforce a threshold on these measures of relatedness, then we ensure that there are several connections between two articles, since each element of the vectors will contribute only a small part of the overall similarity.</Paragraph>
    </Section>
    <Section position="2" start_page="22806" end_page="22806" type="sub_section">
      <SectionTitle>
4.2 Building inter-article finks
</SectionTitle>
      <Paragraph position="0"> Once we have built a set of synset weight vectors for a collection of documents, the process of building links between articles is relatively simple. Given an article that we wish to build links from, we can compute the similarity between the article's symet weight vectors and the vectors of all other documents. Documents whose member vectors exceed a given threshold of similarity will have a link placed between them. Our preliminary work shows that a threshold of 0.15 will include most related documents while excluding many unrelated documents.</Paragraph>
      <Paragraph position="1"> This is almost exactly the methodology used in vector-space IR systems such as SMART, with the difference being that for each pair of documents we are calculating three separate similarity measures. The best way to cope with these multiple measurements seems to be to rank related documents by the sum of the three similarities.</Paragraph>
      <Paragraph position="2"> The sum of the three similarities can lie, theoretically, anywhere between 0 and 3. In practice, the sum is usually less than 1. For example, the average sum of the three similarities when running the vectors of a single article against 5,592 other articles is 0.039.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML