Hypertext Authoring for Linking Relevant Segments of 
Related Instruction Manuals 
Hiroshi Nakagawa and Tatsunori Mori and Nobuyuki Omori and Jun Okamura 
Department of Computer and Electronic Engineering, Yokohama National University 
Tokiwadai 79-5, Hodogaya, Yokohama, 240-8501, JAPAN 
E- mail: nakagawa@ n aklab, dnj. ynu. ac.j p, { mori, ohmori ,j un } @forest. dnj. ynu. ac.j p 
Abstract 
Recently manuals of industrial products become 
large and often consist of separated volumes. In 
reading such individual but related manuals, we 
must consider the relation among segments, which 
contain explanations of sequences of operation. In 
this paper, we propose methods for linking relevant 
segments in hypertext authoring of a set of related 
manuals. Our method is based on the similarity 
calculation between two segments. Our experimen- 
tal results show that the proposed method improves 
both recall and precision comparing with the con- 
ventional tf. idf based method. 
1 Introduction 
In reading traditional paper based manuals, we 
should use their indices and table of contents in or- 
der to know where the contents we want to know are 
written. In fact, it is not an easy task especially for 
novices. Recent years, electronic manuals in a form 
of hypertext like Help of Microsoft Windows became 
widely used. Unfortunately it is very expensive to 
make a hypertext manual by hand especially in case 
of a large volume of manual which consists of sev- 
eral separated volumes. In a case of such a large 
manual, the same topic appears at several places in 
different volumes. One of them is an introductory 
explanation for a novice. Another is a precise ex- 
planation for an advanced user. It is very useful to 
jump from one of them to another of them directly 
by just clicking a button of mouse in reading a man- 
ual text on a browser like NetScape. This type of 
access is realized by linking them in hypertext for- 
mat by hypertext authoring. 
Automatic hypertext authoring has been focused 
on in these years, and much work has been done. For 
instance, Basili et al. (1994) use document struc- 
tures and semantic information by means of natural 
language processing technique to set hyperlinks on 
plain texts. 
The essential point in the research of automatic 
hypertext authoring is the way to find semantically 
relevant parts where each part is characterized by 
a number of key words. Actually it is very similar 
with information retrieval, IR henceforth, especially 
with the so called passage retrieval (Salton et al., 
1993). J.Green (1996) does hypertext authoring of 
newspaper articles by word's lexical chains which are 
calculated using WordNet. Kurohashi et al. (1992) 
made a hypertext dictionary of the field of infor- 
mation science. They use linguistic patterns that 
are used for definition of terminology as well as the- 
saurus based on words' similarity. Furner-Hines and 
Willett (1994) experimentally evaluate and compare 
the performance of several human hyper linkers. In 
general, however, we have not yet paid enough at- 
tention to a full-automatic hyper linker system, that 
is what we pursue in this paper. 
The new ideas in our system are the following 
points: 
1. Our target is a multi-volume manual that de- 
scribes the same hardware or software but is dif- 
ferent in their granularity of descriptions from 
volume to volume. 
2. In our system, hyper links are set not between 
an anchor word and a certain part of text but 
between two segments, where a segment is a 
smallest formal unit in document, like a sub- 
subsection of ~TEX if no smaller units like 
subsubsubsection are used. 
3. We find pairs of relevant segments over two 
volumes, for instance, between an introductory 
manual for novices and a reference manual for 
advanced level users about the same software or 
hardware. 
4. We use not only tf.idf based vector space model 
but also words' co-occurrence information to 
measure the similarity between segments. 
2 Similarity Calculation 
We need to calculate a semantic similarity between 
two segments in order to decide whether two of them 
are linked, automatically. The most well known 
method to calculate similarity in IR is a vector space 
model based on tf • idf value. As for idf, namely 
inverse document frequency, we adopt a segment in- 
929 
stead of document in the definition of idf. The def- 
inition of idf in our system is the following. 
of segments in the manual 
idf(t) = log ~ of segments in which t occurs + 1 
Then a segment is described as a vector in a vector 
space. Each dimension of the vector space consists 
of each term used in the manual. A vector's value 
of each dimension corresponding to the term t is 
its tf • idf value. The similarity of two segments is 
a cosine of two vectors corresponding to these two 
segments respectively. Actually the cosine measure 
similarity based on tf. idf is a baseline in evaluation 
of similarity measures we propose in the rest of this 
section. 
As the first expansion of definition of tf • idf, we 
use case information of each noun. In Japanese, case 
information is easily identified by the case particle 
like ga( nominal marker ), o( accusative marker ), 
hi( dative marker ) etc. which are attached just af- 
ter a noun. As the second expansion, we use not only 
nouns (+ case information) but also verbs because 
verbs give important information about an action a 
user does in operating a system. As the third expan- 
sion, we use co-occurrence information of nouns and 
verbs in a sentence because combination of nouns 
and a verb gives us an outline of what the sentence 
describes. The problem at this moment is the way 
to reflect co-occurrence information in tf. idf based 
vector space model. We investigate two methods for 
this, namely, 
1. Dimension expansion of vector space, and 
2. Modification of tf value within a segment. 
In the following, we describe the detail of these two 
methods. 
2.1 Dimension Expansion 
This method is adding extra-dimensions into the 
vector space in order to express co-occurrence in- 
formation. It is described more precisely as the fol- 
lowing procedure. 
1. Extracting a case information (case particle in 
Japanese) from each noun phrase. Extracting a 
verb from a clause. 
2. Suppose be there n noun phrases with a case 
particle in a clause. Enumerating every combi- 
nation of 1 to n noun phrases with case particle. 
12 
Then we have E nCk combinations. 
6=1 
3. Calculating tf • idf for every combination with 
the corresponding verb. And using them as new 
extra dimensions of the original vector space. 
For example, suppose a sentence "An end user 
learns the programming language." Then in ad- 
dition to dimensions corresponding to every noun 
phrase like "end user", we introduce the new di- 
mensions corresponding to co-occurrence informa- 
tion such as: 
• (VERB, learn) (NOMNINAL end user) (AC- 
CUSATIVE programming language) 
• (VERB, learn) (NOMNINAL end user) 
• (VERB, learn) (ACCUSATIVE programming 
language) 
We calculate tf. idf of each of these combinations 
that is a value of vector corresponding to each of 
these combinations. The similarity calculation based 
on cosine measure is done on this expanded vector 
space. 
2.2 Modification of tf value 
Another method we propose for reflecting co- 
occurrence information to similarity is modification 
of tf value within a segment. (Takaki and Kitani, 
1996) reports that co-occurrence of word pairs con- 
tributes to the IR performance for Japanese news 
paper articles. 
In our method, we modify tf of pairs of co- 
occurred words that occur in both of two segments, 
say dA and dB, in the following way. Suppose that a 
term tk, namely noun or verb, occurs f times in the 
segment da. Then the modified tf'(da, tk) is defined 
as the following formula. 
tf'(dA, tk) = t f(da, tk) 
1 
+ Z E cw(dA,tk,p, tc) 
teETc(tk,da,dB)P =1 
1 
"}- E E Cw'(da,tk,p, tc) 
tcGTc( tk ,dA,dB ) P =1 
where cw and cw' are scores of importance for co- 
occurrence of words, tk and t~. Intuitively, cw and 
cw' are counter parts of tf. idf for co-occurrence of 
words and co-occurrence of (noun case-information), 
respectively, cw is defined by the following formula. 
cw(dA, tk, p, to) 
a(dA,~k,p,t~) X ~(tk,t~) X 7(tk,/c) X C 
M(dA) 
where c~(da, tk, p, to) is a function expressing how 
near tkand t~ occur, p denotes that pth tk's occur- 
rence in the segment dA, and fl(tk,t¢) is a normal- 
ized frequency of co-occurrence of ¢~ and ¢~. Each 
of them is defined as follows. 
a(dA, tk, p, t~) = d(dA, tk, p) - dist(dA, tk, p, t~) 
d(dA, tk, p) 
930 
rtf(t~,t¢) 
~(tk,t~)- atf(tk) 
where the function dist(da, tk,p, to) is a distance 
between pth t~ within da and tc counted by word. 
d(da,tk,p) shows the threshold of distance within 
which two words are regarded as a co-occurrence. 
Since, in our system, we only focus on co-occurrences 
within a sentence, a(da,tk,p,t~) is calculated for 
pairs of word occurrences within a sentence. As a 
result, d(dA,tk,p) is a number of words in a sen- 
tence we focus on. atf(tk) is a total number of 
tk's occurrences within the manual we deal with. 
rtf(tk, t~) is a total number of co-occurrences of tk 
and tc within a sentence. 7(t~, to) is an inverse doc- 
ument frequency ( in this case "inverse segment fre- 
quency") of te which co-occurs with tk, and defined 
as follows. 
N 7(tk, fc) = lOg( d-~c ) ) 
where N is a number of segments in a manual, 
and dr(to) is a number segments in which tc occurs 
with tk. 
M(da) is a length of segment da counted in mor- 
phological unit, and used to normalize cw. C is a 
weight parameter for cw. Actually we adopt the 
value of C which optimizes 1 lpoint precision as de- 
scribed later. 
The other modification factor cw' is defined in al- 
most the same way as cw is. The difference between 
cw and cw' is the following, cw is calculated for 
each noun. On the other hand, cw' is calculated for 
each combination of noun and its case information. 
Therefore, cw I is calculated for each ( noun, case ) 
like (user, NOMINAL). In other words, in calcula- 
tion of cw', only when ( noun-l, case-1 ) and ( noun- 
2, case-2 ), like (user NOMINAL) and (program AC- 
CUSATIVE), occur within the same sentence, they 
are regarded as a co-occurrence. 
Now we have defined cw and cw'. Then back to 
the formula which defines tf'. In the definition of 
tf', Tc(tk, dA, dB) is a set of word which occur in 
both of dA and dB. Therefore cws and cw's are 
summed up for all occurrences of tk in dA. Namely 
we add up all cws and cw% whose tc is included in 
T~(tk, dA, dn) to calculate tf'. 
3 Implementation and Experimental 
Results 
Our system has the following inputs and outputs. 
Input is an electronic manual text which can be 
written in plain text,I~TEXor HTML) 
Output is a hypertext in HTML format. 
Electronic Manuals manual A manual B 
WO~as-~red2o~S ~ ~Ke),word~xtra~ = 
.... "4 tf i~\[cutatlon 
, 
Slrnllafl~/Calculation based on Vector Space Mode 
1 \[ 
Hypeaext Unk Genarator 
I OUTPUT 
HYPERTEXT 
~ orphological Ana~s System 
manual A manual B 
Figure h Overview of our hypertext generator 
We need a browser like NelScape that can display 
a text written in HTML. Our system consists of four 
sub-systems shown in Figure 1. 
Keyword Extraction Sub-System In this sub- 
system, a morphological analyzer segments out 
the input text, and extract all nouns and verbs 
that are to be keywords. We use Chasen 1.04b 
(Matsumoto et al., 1996) as a morphological 
analyzer for Japanese texts. Noun and Case- 
information pairs are also made in this sub- 
system. If you use the dimension expansion de- 
scribed in 2.1, you introduce new dimensions 
here. 
tf- idf Calculation Sub-System 
This sub-system calculates tf • idf of extracted 
keywords by Keyword Extraction Sub-System. 
Similarity Calculation Sub-System This sub- 
system calculates the similarity that is repre- 
sented by cosine of every pair of segments based 
on tf • idf values calculated above. If you use 
modifications of tf values described in 2.2, you 
calculated modified t f, namely tf' in this sub- 
system. 
Hypertext Generator This sub-system trans- 
lates the given input text into a hypertext in 
which pairs of segments having high similarity, 
say high cosine value, are linked. The similarity 
of those pairs are associated with their links for 
user friendly display described in the following 
We show an example of display on a browser in 
Figure 2. The display screen is divided into four 
parts. The upper left and upper right parts show 
a distinct part of manual text respectively. In the 
lower left (right) part, the title of segments that 
are relevant to the segment displayed on the upper 
left (right) part are displayed in descending order of 
931 
1 
FS-Ze FAir V~w Go Booka~pa 0pt~orm D~ZU3ry WJz~:l~ H~p 
.__v_J- -2_J .-.I m 
Location: IIhtt~ ://~. forest, dr,,j. Ynu. ,etc. 5p/+SuxVjum_ch~frame+ htqL~ 
~hat" s ~1 ~t'~ ~?1 Ikstlnati°nsl Net Search I l~opl¢l Soft,zre I 
E JUMAN ~-- 
ChaSen 1.0 'r'~.6r~ 
l- -- J b~R~L~ t~ ~ =k ~ t~.ANSt 
ttl L,~. 
-t- 
• P JUM AN 2~l ;PJ'~ JUM~N 3~) ~ 
. 
Tr 
F JUMAN 2.0 7)'+~> 
JUMAN 3.0 ,r',,..CT'~:~m. 
r:. 
~9 
~o~8~g 
n~- i'a. -~ l't L: "~ I, • 35 l~l.~'~"lt!$ L < IlI~'tF • ,I I 
_ ~,:X,_ 
, 
Figure 2: The use of this system 
similarity. Since these titles are linked to the cor- 
responding segment text, if we click one of them in 
the lower left (right) part, the hyperlinked segment's 
text is instantly displayed on the upper right (left) 
part, and its relevant segments' title are displayed 
on the lower right (left) part. By this type of brows- 
ing along with links displayed on the lower parts, 
if a user wants to know relevant information about 
what she/he is reading on the text displayed on the 
upper part, a user can easily access the segments in 
which what she/he wants to know might be written 
in high probability. 
Now we describe the evaluation of our proposed 
methods with recall and precision defined as follows. 
recall = ~ of retrieved pairs of relevant segments 
precision= 
of pairs of relevant segments 
of retrieved pairs of relevant segments 
II of retrieved pairs of segments 
The first experiment is done for a large manual 
of APPGALLARY(Hitachi, 1995) which is 2.5MB 
large. This manual is divided into two volumes. One 
is a tutorial manual for novices that contains 65 seg- 
ments. The other is a help manual for advanced 
users that contains 2479 segments. If we try to find 
the relevant segments between ones in the tutorial 
manual and ones in the help manual, the number of 
possible pairs of segments is 161135. This number 
is too big for human to extract all relevant segment 
manually. Then we investigate highest 200 pairs of 
segments by hand, actually by two students in the 
engineering department of our university to extract 
pairs of relevant segments. The guideline of selection 
of pairs of relevant segments is: 
0.9 
08 
0.7 
0.6 
0.5 
04 
03 
0.2 
0.t 
0 
Precision - - - 
Recall -- 
20 40 60 80 100 120 140 t60 180 200 
Rank~ 
Figure 3: Recall and precision of generated hyper- 
links on large-scale manuals 
Table 1: Manual combinations and number of right 
correspondences of segments 
pairofm  uals ,,AoB AO+ BO+ 
of all pairs II 1056 896 924 
of relevant pairs 65 60 47 
1. Two segments explain the same operation or the 
same terminology. 
2. One segment explains an abstract concept and 
the other explains that concept in concrete op- 
eration. 
Figure 3 shows tim recall and precision for num- 
bers of selected pairs of segments where those pairs 
are sorted in descending order of cosine similarity 
value using normal tf • idf of all nouns. Tiffs result 
indicates that pairs of relevant segments are concen- 
trated in high similarity area. In fact, the pairs of 
segments within top 200 pairs are almost all relevant 
ones. 
The second experiment is done for three 
small manuals of three models of video cas- 
sette recorder(MITSUBISHI, 1995c; MITSUBISHI, 
1995a; MITSUBISHI, 1995b) produced by the same 
company. We investigate all pairs of segments 
that appear in the distinct manuals respectively, 
and extract relevant pairs of segment according 
to the same guideline we did in the first experi- 
ment by two students of the engineering depart- 
ment of our university. The numbers of segments 
are 32 for manual A(MITSUBISHI, 1995c), 33 for 
manual B(MITSUBISHI, 1995a) and 28 for manual 
C(MITSUBISHI, 1995b), respectively. The number 
of relevant pairs of segments are shown ill Table 1. 
We show the 11 points precision averages for these 
methods in Table 2. Each recall-precision curve, 
say Keyword, dimension N, cw+cw' tf, and Normal 
Query, corresponds to the methods described in the 
previous section. We describe the more precise defi- 
nition of each in the following. 
932 
Table 2: 11 point average of precision for each 
method and combination 
Method ACVB A¢~C BvvC 
Keyword 0.678 0.589 0.549 
cw+cw' tf 0.683 0.625 0.582 
C 0.1 0.6 1.3 
dimension N 0.684 0.597 0.556 
Normal Query 0.692 0.532 0.395 
Keyword: Using tf. idf for all nouns and verbs 
occuring in a pair of manuals. This is the baseline 
data. 
dimension N: Dimension Expansion method de- 
scribed in section 2.1. In this experiment, we use 
only noun-noun co-occurrences. 
cw+cw' tf: Modification of tf value method de- 
scribed in section2.2. In this experiment, we use 
only noun-verb co-occurrences. 
Normal Query: This is the same as Keyword ex- 
cept that vector values in one manual are all set to 
0 or 1, and vector values of the other manual are 
tf . id/. 
In the rest of this section, we consider the results 
shown above point by point. 
The effect of using tf. idf information of both 
segments 
We consider the effect of using tf. idf of two seg- 
ments that we calculate similarity. For comparison, 
we did the experiment Normal Query where tf.idf 
is used as vector value for one segment and 1 or 0 
is used as vector value for the other segment. This 
is a typical situation in IR. In our system, we calcu- 
late similarity of two segments .already given. That 
makes us possible using tf • idf for both segments. 
As shown in Table 2, Keyword outperforms Nor- 
mal Query. 
The effect of using co-occurrence information 
The same types of operation are generally de- 
scribed in relevant segments. The same type ofop- 
eration consists of the same action and equipment 
in high probability. This is why using co-occurrence 
information in similarity calculation magnifies sim- 
ilarities between relevant segments. Comparing di- 
mension expansion and modification of t f, the latter 
outperforms the former in precision for almost all 
recall rates. Modification of tf value method also 
shows better results than dimension expansion in 11 
point precision average shown in Table 2 for A-C 
and B-C manual pairs. As for normalization factor 
C of modification of tf value method, the smaller 
C becomes, the less tf value changes and the more 
similar the result becomes with the baseline ease in 
which only tf is used. On the contrary, the bigger C 
becomes, the more incorrect pairs get high similar- 
ity and the precision deteriorates in low recall area. 
As a result, there is an optimum C value, which we 
selected experimentally for each pair of manuals and 
is shown in Table 2 respectively. 
4 Conclusions 
We proposed two methods for calculating similarity 
of a pair of segments appearing in distinct manuals. 
One is Dimension Expansion method, and the other 
is Modification of tf value method. Both of them 
improve the recall and precision in searching pairs of 
relevant segment .This type of calculation of similar- 
ity between two segments is useful in implementing 
a user friendly manual browsing system that is also 
proposed and implemented in this research. 

References 
Roberto Basili, Fabrizio Grisoli, and Maria Teresa 
Pazienza. 1994. Might a semantic lexicon support 
hypertextual authoring? In 4th ANLP, pages 
174-179. 
David Elhs. Jonathan Furner-Hines and Peter Wil- 
lett. 1994. On the measurement of inter-linker 
consistency and retrieval effectiveness in hyper- 
text databases. In SIGIR '94, pages 51-60. 
Hitachi, 1995. How to use the APPGALLERY, 
APPGALLERY On-Line Help. Hitachi Limited. 
Stephen J.Green. 1996. Using lexcal chains to build 
hypertext links in newspaper articles. In Proceed- 
ings of AAAI Workshop on Knowledge Discovery 
in Databases, Portland, Oregon. 
S. Kurohashi, M. Nagao, S. Sato, and M. Murakami. 
1992. A method of automatic hypertext construc- 
tion from an encyclopedic dictionary of a specific 
field. In 3rd ANLP, pages 239-240. 
Yuji Matsumoto, Osamu Imaichi, Tatsuo Ya- 
mashita, Akira Kitauchi, and Tomoaki Imamura. 
1996. Japanese morphological analysis system 
ChaSen manual (version 1.0b4). Nara Institute of 
Science and Technology, Nov. 
MITSUBISHI, 1995a. MITSUBISHI Video Tape 
Recorder HV-BZ66 Instruction Manual. 
MITSUBISHI, 1995b. MITSUBISHI Video Tape 
Recorder HV-F93 Instruction Manual. 
MITSUBISHI, 1995c. MITSUBISHI Video Tape 
Recorder HV-FZ62 Instruction Manual. 
Gerard Salton, J. Allan, and Chris Buckley. 1993. 
Approaches to passage retrieval in full text infor- 
mation systems. In SIGIR '93, pages 49-58. 
Toru Takaki and Tsuyoshi Kitani. 1996. Rele- 
vance ranking of documents using query word co- 
occurrences (in Japanese). IPSJ SIG Notes 96-FI- 
41-8, IPS Japan, April. 
