File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2153_metho.xml
Size: 15,608 bytes
Last Modified: 2025-10-06 14:15:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2153"> <Title>Hypertext Authoring for Linking Relevant Segments of Related Instruction Manuals</Title> <Section position="3" start_page="0" end_page="929" type="metho"> <SectionTitle> 2 Similarity Calculation </SectionTitle> <Paragraph position="0"> We need to calculate a semantic similarity between two segments in order to decide whether two of them are linked, automatically. The most well known method to calculate similarity in IR is a vector space model based on tf * idf value. As for idf, namely inverse document frequency, we adopt a segment in- null stead of document in the definition of idf. The definition of idf in our system is the following.</Paragraph> <Paragraph position="1"> of segments in the manual idf(t) = log ~ of segments in which t occurs + 1 Then a segment is described as a vector in a vector space. Each dimension of the vector space consists of each term used in the manual. A vector's value of each dimension corresponding to the term t is its tf * idf value. The similarity of two segments is a cosine of two vectors corresponding to these two segments respectively. Actually the cosine measure similarity based on tf. idf is a baseline in evaluation of similarity measures we propose in the rest of this section.</Paragraph> <Paragraph position="2"> As the first expansion of definition of tf * idf, we use case information of each noun. In Japanese, case information is easily identified by the case particle like ga( nominal marker ), o( accusative marker ), hi( dative marker ) etc. which are attached just after a noun. As the second expansion, we use not only nouns (+ case information) but also verbs because verbs give important information about an action a user does in operating a system. As the third expansion, we use co-occurrence information of nouns and verbs in a sentence because combination of nouns and a verb gives us an outline of what the sentence describes. The problem at this moment is the way to reflect co-occurrence information in tf. idf based vector space model. We investigate two methods for this, namely, 1. Dimension expansion of vector space, and 2. Modification of tf value within a segment.</Paragraph> <Paragraph position="3"> In the following, we describe the detail of these two methods.</Paragraph> <Section position="1" start_page="929" end_page="929" type="sub_section"> <SectionTitle> 2.1 Dimension Expansion </SectionTitle> <Paragraph position="0"> This method is adding extra-dimensions into the vector space in order to express co-occurrence information. It is described more precisely as the following procedure.</Paragraph> <Paragraph position="1"> 1. Extracting a case information (case particle in Japanese) from each noun phrase. Extracting a verb from a clause.</Paragraph> <Paragraph position="2"> 2. Suppose be there n noun phrases with a case particle in a clause. Enumerating every combination of 1 to n noun phrases with case particle. Then we have E nCk combinations.</Paragraph> <Paragraph position="3"> 3. Calculating tf * idf for every combination with the corresponding verb. And using them as new extra dimensions of the original vector space. For example, suppose a sentence &quot;An end user learns the programming language.&quot; Then in addition to dimensions corresponding to every noun phrase like &quot;end user&quot;, we introduce the new dimensions corresponding to co-occurrence information such as:</Paragraph> <Paragraph position="5"> We calculate tf. idf of each of these combinations that is a value of vector corresponding to each of these combinations. The similarity calculation based on cosine measure is done on this expanded vector space.</Paragraph> </Section> <Section position="2" start_page="929" end_page="929" type="sub_section"> <SectionTitle> 2.2 Modification of tf value </SectionTitle> <Paragraph position="0"> Another method we propose for reflecting co-occurrence information to similarity is modification of tf value within a segment. (Takaki and Kitani, 1996) reports that co-occurrence of word pairs contributes to the IR performance for Japanese news paper articles.</Paragraph> <Paragraph position="1"> In our method, we modify tf of pairs of co-occurred words that occur in both of two segments, say dA and dB, in the following way. Suppose that a term tk, namely noun or verb, occurs f times in the segment da. Then the modified tf'(da, tk) is defined as the following formula.</Paragraph> <Paragraph position="3"> where cw and cw' are scores of importance for co-occurrence of words, tk and t~. Intuitively, cw and cw' are counter parts of tf. idf for co-occurrence of words and co-occurrence of (noun case-information), respectively, cw is defined by the following formula.</Paragraph> <Paragraph position="4"> cw(dA, tk, p, to)</Paragraph> <Paragraph position="6"> where c~(da, tk, p, to) is a function expressing how near tkand t~ occur, p denotes that pth tk's occurrence in the segment dA, and fl(tk,tC/) is a normalized frequency of co-occurrence of C/~ and C/~. Each of them is defined as follows.</Paragraph> <Paragraph position="7"> a(dA, tk, p, t~) = d(dA, tk, p) - dist(dA, tk, p, t~)</Paragraph> <Paragraph position="9"> where the function dist(da, tk,p, to) is a distance between pth t~ within da and tc counted by word.</Paragraph> <Paragraph position="10"> d(da,tk,p) shows the threshold of distance within which two words are regarded as a co-occurrence.</Paragraph> <Paragraph position="11"> Since, in our system, we only focus on co-occurrences within a sentence, a(da,tk,p,t~) is calculated for pairs of word occurrences within a sentence. As a result, d(dA,tk,p) is a number of words in a sentence we focus on. atf(tk) is a total number of tk's occurrences within the manual we deal with.</Paragraph> <Paragraph position="12"> rtf(tk, t~) is a total number of co-occurrences of tk and tc within a sentence. 7(t~, to) is an inverse document frequency ( in this case &quot;inverse segment frequency&quot;) of te which co-occurs with tk, and defined as follows.</Paragraph> <Paragraph position="13"> N 7(tk, fc) = lOg( d-~c ) ) where N is a number of segments in a manual, and dr(to) is a number segments in which tc occurs with tk.</Paragraph> <Paragraph position="14"> M(da) is a length of segment da counted in morphological unit, and used to normalize cw. C is a weight parameter for cw. Actually we adopt the value of C which optimizes 1 lpoint precision as described later.</Paragraph> <Paragraph position="15"> The other modification factor cw' is defined in almost the same way as cw is. The difference between cw and cw' is the following, cw is calculated for each noun. On the other hand, cw' is calculated for each combination of noun and its case information. Therefore, cw I is calculated for each ( noun, case ) like (user, NOMINAL). In other words, in calculation of cw', only when ( noun-l, case-1 ) and ( noun2, case-2 ), like (user NOMINAL) and (program AC-CUSATIVE), occur within the same sentence, they are regarded as a co-occurrence.</Paragraph> <Paragraph position="16"> Now we have defined cw and cw'. Then back to the formula which defines tf'. In the definition of tf', Tc(tk, dA, dB) is a set of word which occur in both of dA and dB. Therefore cws and cw's are summed up for all occurrences of tk in dA. Namely we add up all cws and cw% whose tc is included in T~(tk, dA, dn) to calculate tf'.</Paragraph> </Section> </Section> <Section position="4" start_page="929" end_page="932" type="metho"> <SectionTitle> 3 Implementation and Experimental </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="929" end_page="929" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> Our system has the following inputs and outputs.</Paragraph> <Paragraph position="1"> Input is an electronic manual text which can be written in plain text,I~TEXor HTML) Output is a hypertext in HTML format.</Paragraph> <Paragraph position="2"> We need a browser like NelScape that can display a text written in HTML. Our system consists of four sub-systems shown in Figure 1.</Paragraph> <Paragraph position="3"> Keyword Extraction Sub-System In this subsystem, a morphological analyzer segments out the input text, and extract all nouns and verbs that are to be keywords. We use Chasen 1.04b (Matsumoto et al., 1996) as a morphological analyzer for Japanese texts. Noun and Caseinformation pairs are also made in this subsystem. If you use the dimension expansion described in 2.1, you introduce new dimensions here.</Paragraph> <Paragraph position="4"> tf- idf Calculation Sub-System This sub-system calculates tf * idf of extracted keywords by Keyword Extraction Sub-System.</Paragraph> </Section> <Section position="2" start_page="929" end_page="932" type="sub_section"> <SectionTitle> Similarity Calculation Sub-System This sub- </SectionTitle> <Paragraph position="0"> system calculates the similarity that is represented by cosine of every pair of segments based on tf * idf values calculated above. If you use modifications of tf values described in 2.2, you calculated modified t f, namely tf' in this subsystem. null Hypertext Generator This sub-system translates the given input text into a hypertext in which pairs of segments having high similarity, say high cosine value, are linked. The similarity of those pairs are associated with their links for user friendly display described in the following We show an example of display on a browser in parts. The upper left and upper right parts show a distinct part of manual text respectively. In the lower left (right) part, the title of segments that are relevant to the segment displayed on the upper left (right) part are displayed in descending order of similarity. Since these titles are linked to the corresponding segment text, if we click one of them in the lower left (right) part, the hyperlinked segment's text is instantly displayed on the upper right (left) part, and its relevant segments' title are displayed on the lower right (left) part. By this type of browsing along with links displayed on the lower parts, if a user wants to know relevant information about what she/he is reading on the text displayed on the upper part, a user can easily access the segments in which what she/he wants to know might be written in high probability.</Paragraph> <Paragraph position="1"> Now we describe the evaluation of our proposed methods with recall and precision defined as follows. recall = ~ of retrieved pairs of relevant segments precision= of pairs of relevant segments of retrieved pairs of relevant segments II of retrieved pairs of segments The first experiment is done for a large manual of APPGALLARY(Hitachi, 1995) which is 2.5MB large. This manual is divided into two volumes. One is a tutorial manual for novices that contains 65 segments. The other is a help manual for advanced users that contains 2479 segments. If we try to find the relevant segments between ones in the tutorial manual and ones in the help manual, the number of possible pairs of segments is 161135. This number is too big for human to extract all relevant segment manually. Then we investigate highest 200 pairs of segments by hand, actually by two students in the engineering department of our university to extract pairs of relevant segments. The guideline of selection of pairs of relevant segments is: of all pairs II 1056 896 924 of relevant pairs 65 60 47 1. Two segments explain the same operation or the same terminology.</Paragraph> <Paragraph position="2"> 2. One segment explains an abstract concept and the other explains that concept in concrete operation. null Figure 3 shows tim recall and precision for numbers of selected pairs of segments where those pairs are sorted in descending order of cosine similarity value using normal tf * idf of all nouns. Tiffs result indicates that pairs of relevant segments are concentrated in high similarity area. In fact, the pairs of segments within top 200 pairs are almost all relevant ones.</Paragraph> <Paragraph position="3"> The second experiment is done for three small manuals of three models of video cassette recorder(MITSUBISHI, 1995c; MITSUBISHI, 1995a; MITSUBISHI, 1995b) produced by the same company. We investigate all pairs of segments that appear in the distinct manuals respectively, and extract relevant pairs of segment according to the same guideline we did in the first experiment by two students of the engineering department of our university. The numbers of segments are 32 for manual A(MITSUBISHI, 1995c), 33 for manual B(MITSUBISHI, 1995a) and 28 for manual C(MITSUBISHI, 1995b), respectively. The number of relevant pairs of segments are shown ill Table 1. We show the 11 points precision averages for these methods in Table 2. Each recall-precision curve, say Keyword, dimension N, cw+cw' tf, and Normal Query, corresponds to the methods described in the previous section. We describe the more precise definition of each in the following.</Paragraph> <Paragraph position="4"> Keyword: Using tf. idf for all nouns and verbs occuring in a pair of manuals. This is the baseline data.</Paragraph> <Paragraph position="5"> dimension N: Dimension Expansion method described in section 2.1. In this experiment, we use only noun-noun co-occurrences.</Paragraph> <Paragraph position="6"> cw+cw' tf: Modification of tf value method described in section2.2. In this experiment, we use only noun-verb co-occurrences.</Paragraph> <Paragraph position="7"> Normal Query: This is the same as Keyword except that vector values in one manual are all set to 0 or 1, and vector values of the other manual are tf . id/.</Paragraph> <Paragraph position="8"> In the rest of this section, we consider the results shown above point by point.</Paragraph> <Paragraph position="9"> The effect of using tf. idf information of both segments We consider the effect of using tf. idf of two segments that we calculate similarity. For comparison, we did the experiment Normal Query where tf.idf is used as vector value for one segment and 1 or 0 is used as vector value for the other segment. This is a typical situation in IR. In our system, we calculate similarity of two segments .already given. That makes us possible using tf * idf for both segments. As shown in Table 2, Keyword outperforms Normal Query.</Paragraph> <Paragraph position="10"> The effect of using co-occurrence information The same types of operation are generally described in relevant segments. The same type ofoperation consists of the same action and equipment in high probability. This is why using co-occurrence information in similarity calculation magnifies similarities between relevant segments. Comparing dimension expansion and modification of t f, the latter outperforms the former in precision for almost all recall rates. Modification of tf value method also shows better results than dimension expansion in 11 point precision average shown in Table 2 for A-C and B-C manual pairs. As for normalization factor C of modification of tf value method, the smaller C becomes, the less tf value changes and the more similar the result becomes with the baseline ease in which only tf is used. On the contrary, the bigger C becomes, the more incorrect pairs get high similarity and the precision deteriorates in low recall area. As a result, there is an optimum C value, which we selected experimentally for each pair of manuals and is shown in Table 2 respectively.</Paragraph> </Section> </Section> class="xml-element"></Paper>