Preprint No. 4 Classification: IR 2.3 
Automatic Processing of Foreign Language Documents 
G. Salton* 
Abstract 
Experiments conducted over the last few years with the SMART docu- 
ment retrieval system have shown that fully automatic text processing 
methods using relatively simple linguistic tools are as effective for pur- 
poses of document indexing, classification, search, and retrieval as the 
more elaborate manual methods normally used in practice. Up to now, all 
experiments were carried out entirely with English language queries and docu- 
ments. 
The present study describes an extension of the SMAKT procedures to 
German language materials. A multi-lingual thesaurus is used for the ana- 
lysis of documents and search requests, and tools are provided which make 
it possible to process English language documents against German queries, 
and vice versa. The methods are evaluated, and it is shown that the effec- 
tiveness of the mixed language processing is approximately equivalent to 
that of the standard process operating within a single language only. 
i. Introduction 
For some years, experiments have been under way to test the effec- 
tiveness of automatic language analysis and indexing methods in information 
retrieval, Specifically, document and query texts are processed fully auto- 
matically, and content identifiers are assigned using a variety of linguistic 
~Department of Computer Science, Cornell University, Ithaca, N. Y. 14850. 
This study was supported in part by the National Science Foundation under 
grant GN-750. 
-2- 
tools, including word stem analysis, thesaurus look-up, phrase recognition, 
statistical term association~ syntactic analysis, and so on. The resulting 
concept identifiers assigned to each document and search request are then 
matched, and the documents whose identifiers are sufficiently close to the 
queries are retrieved for the user's attention. 
The automatic analysis methods can be made to operate in real-time -- 
while the customer waits for an answer _ by restricting the query-document 
comparisons to only certain document classes, and interactive user-controlled 
search methods can be implemented which adjust the search request during the 
search in such a way that more useful, and less useless, material is retrieved 
from the file. 
The experimental evidence accumulated over the last few years indi- 
cates that retrieval systems based on automatic text processing methods -- 
including fully automatic content analysis as well as automatic document 
classification and retrieval -- are not in general inferior in retrieval effec- 
tiveness to conventional systems based on human indexing and human query 
formulation. 
One of the major objections to the praetical utilization of the 
automatic text processing methods has been the inability automatically to 
handle foreign language texts of the kind normally stored in documentation 
and library systems. Recent experiments performed with document abstracts 
and search requests in French and German appear to indicate that these ob- 
jections may be groundless. 
In the present study~ the SMART documsnt retrieval system is used 
to carry out experlments using as input foreign language documents and 
queries. The foreign language texts are automatically processed using a 
-3- 
thesaurus (synonym dictionary) translated directly from a previously avail- 
able English version. Foreign language query and document texts are looked- 
up in the foreign language thesaurus and the analyzed forms of the queries 
and documents are then compared in the standard manner before retrieving the 
highly matching items. The language analysis methods incorporated into the 
SMART system are first briefly reviewed. Thereafter, the main procedures 
used to process the foreign language documents are described, and the retrie- 
val effectiveness of the English text processing methods is compared with 
that of the foreign language material. 
2. The SMART System 
SMART is a fully-automatic document retrieval system operating on 
the IBM 7094 and 360 model 65. Unlike other computer-based retrieval systems, 
the SMART system does not rely on manually assigned key words or index terms 
for the identification of documents and search requests, nor does it use 
primarily the frequency of occurrence of certain words or phrases included 
in the texts of documents. Instead, an attempt is made to go beyond simple 
word-matchlng procedures by using a variety of intellectual aids in the form 
of synonym dictionaries, hierarchical arrangements of subject identifiers, 
statistical and syntactic phrase generation methods and the like, in order 
to obtain the content identifications useful for the retrieval process. 
Stored documents and search requests are then processed without any 
prior manual analy~i__sby one of several hundred automatic content analysis 
methods, and those documents which most nearly match a given search request 
are extracted from the document file in answer to the request. The system 
may be controlled by the use~, in that a search request can be processed 
-4- 
first in a standard mode; the user can then analyze the output obtained and, 
depending on his further requirements, order a reproeessing of the request 
under new conditions. The new output can again be examined and the process 
iterated until the right kind and amount of information are retrieved. \[1,2,3\] 
SMART is thus designed as an experimental automatic retrieval system 
of the kind that may become current in operational environments some years 
hence. The following facilities, incorporated into the SMART system for 
purposes of document analysis may be of principal interest: 
a) a system for separating English words into stems and affixes 
(the so-called suffix 's' and stem thesaurus methods) which 
can be used to construct document identifications consisting 
of the stems of words contained in the documents; 
b) a synonym dictionary, or thesaurus, which can be used to 
recognize synonyms by replacing each word stem by one or 
more "concept" numbers; these concept numbers then serve as 
content identifiers instead of the original word stems; 
c) a hierarchical arrangement of the concepts included in the 
thesaurus which makes it possible, given any concept number, 
to find its "parents" in the hierarchy, its "sons", its 
"brothers", and any of a set of possible cross references; 
the hierarchy can be used to obtain more general content 
identifiers than the ones originally given by going upin 
the hierarchy, more spsclflc ones by going down, and a set of 
related ones by picking up brothers and cross-references; 
d) statistical procedures to compate similarity coefficients 
based on co-occurrences of concepts within the sentences of 
a given collection; the ~elated concepts, determined by 
statistical association, can then be added to the originally 
available concepts to identify the various documents; 
e~ syntactic analysis methods which make it possible to compare 
-5- 
the syntactically analyzed sentences of documents and search 
requests with a pre-coded dictionary of syntactic structures 
("criterion trees") in such a way that the same concept number 
is assigned to a large number of semantically equivalent, but 
syntactically quite different constructions; 
f) statistical ~hrgse matching methods which operate like the 
preceding syntactic phrase procedures, that is, by using a 
preeonstructed dictionary to identify phrases used as content 
identifiers; however, no syntactic analysis is performed in 
this case, and phrases are defined as equivalent if the concept 
numbers of all components match, regardless of the syntactic 
relationshlps between components; 
g) a dictionary u~datln~ system, designed to revise the several 
dictionaries included in the system: 
i) word stem dictionary 
ii) word suffix dictionary 
iii) common word dictionary (for words to be deleted 
duping analysis) 
iv) thesaurus (synonym dictionary) 
v) concept hierarchy 
vi) statistical phrase dictionary 
vii) syntactic ("criterion") phmase dictionary. 
The operations of the system are built around a supemvisory system 
which decodes the input instructions and arranges the processing sequence 
in accordance with the instructions received. The SMART systems organization 
makes it possible to evaluate the effectiveness of the various processing 
methods by comparing the outputs produced by a variety of different runs. 
This is achieved by processing the same search requests against the same docu- 
ment collections several times, and making judicious changes in ~e analysis 
procedures between runs. In each case, the search effectiveness is evaluated 
by presenting paired comparisons of the average perfommance over many search 
requests for two given search and retrieval methodologies. 
-6- 
3. The Evaluation of Language Analysis Methods 
Many different criteria may suggest themselves for measuring the 
performance of an information system. In the evaluation work carried out with 
the SMART system, the effectiveness of an information system is assumed to 
depend on its ability to satisfy the users' information needs by retrieving 
wanted material, while rejecting unwanted items. Two measures have been 
widely used for this purpose, known as recall and precision, and representing 
respectively the proportion of relevant material actually retrieved, and the 
proportion of retrieved material actually relevant. \[3\] (Ideally, all rele- 
vant items should be retrieved, while at the same time, all nonrelevant items 
should be rejected, as reflected by perfect recall and precision values equal 
to i). 
It should be noted that both the recall and precision figures achie- 
vable by a given system are adjustable, in the sense that a relaxation of 
the search conditions often leads to high recall, while a tightening of the 
search criteria leads to high precision. Unhappily, experience has shown 
that on the average recall and precision tend to vary inversely since the 
retrieval of more relevant items normally also leads to the retrieval of 
more irrelevant ones. In practice, a compromise is usually made, and a per- 
for~nance level is chosen such that much of the relevant material is retrieved, 
while the number of nonrelevant items which are also retrieved is kept within 
tolerable limits. 
In theory, one might expect that the performance of a retrieval sys- 
I 
tem would improve as the language analysis methods used for document and 
query processing become more sophisticated. In actual fact, this turns out 
not to be the case. A first indication of the fact that retrieval effec- 
-7- 
tiveness does not vary directly with the complexity of the document or query 
analysis was provided by the output of the Asllb-Cranfield studies. This 
project tested a large variety of indexing languages in a retrieval envir- 
onment, and came to the astonishing conclusion that the simplest type of 
indexing language would produce the best results. \[4\] Specifically, three 
types of indexing languages were tested, called respectively single terms 
(that is, individual terms, or concepts assigned to documents and queries), 
controlled terms (that is, single terms assigned under the control of the 
well-known EJC Thesaurus of Engineering and Scientific Terms), and finally 
simple conce~ts (that is, phrases consisting of two or more single terms). 
The results of the Cranfield tests indicated that single terms are more 
effective for retrieval purposes than either controlled terms, or complete 
phrases. \[4\] 
These results might be dismissed as being due to certain peculiar 
test conditions if it were not for the fact that the results obtained with 
the automatic SMART retrieval system substantially confirqn the earlier Cran- 
field output. \[3\] Specifically, the following basic conclusions can be 
drawn from the main SMART experiments: 
a) the simplest automatic language analysis procedure consisting 
of the assignment to queries and documents of weighted word 
stems originally contained in these documents, produces a 
retrieval effectiveness almost equivalent to that obtained 
by intellectual indexing carried out manually under controlled 
conditions; \[3,5\] 
b) use of a thesaurus look-up process, designed to recognize 
synonyms and other term relations by repla<~ing the original word 
stems by the corresponding thesaurus categories, improves the 
retrieval effectiveness by about ten percent in both recall and 
-8- 
precision; 
c) additional, more sophisticated language analysis procedures, 
including the assignment of phrases instead of individual 
terms, the use of a concept hierarchy, the determination 
of syntactic relations between terms, and so on, do not, on 
the average, provide improvements over the standard thesaurus 
process. 
An example of a typical recall-precision graph produced by the SMART 
system is shown in Fig. i, where a statistical phrase method is compared 
with a syntactic phrase procedure. In the former case, phrases are assigned 
as content identifiers to documents and queries whenever the individual 
phrase components are all present within a given document; in the latter case, 
the individual components must also exhibit an appropriate syntactic rela- 
tionship before the phrase is assigned as an identifier. The output of Fig.l 
shows that the use of syntax degrades performance (the ideal perfor~nance 
region is in the upper right-hand corner of the graph where both the recall 
and the precision are close to i). Several arguments may explain the output 
of Fig. i: 
a) the inadequacy of the syntactic analyzer used to generate 
syntactic phrases; 
b) the fact that phrases are often appropriate content identi- 
fiers even when the phrase components are not syntactically 
related in a given context (e.g. the sentence "people who 
need information, require adequate retrieval services" is 
adequately identified by the phrase "information retrieval", 
even though the components are not related in the sentence); 
c) the variability of the user population which makes it unwise 
to overspecify document content; 
d) the ambiguity inherent in natural language texts which may 
work to advantage when attempting to satisfy the information 
needs of a heterogeneous user population with diverse infor- 
mation needs. 
-9- 
Precision 
i.O 
.8 
.6 
.4 
.2 
o-----o Statistical phrases 
a---,-a Syntactic phrases 
/-Ideal 
- ~P' Performance -~...~ ~d Region Recall 
0.3 
0.5 
0.7 
"o 0.9 
i I , I = - Recall .2 .4 .6 .8 1.0 v 
Precision 
0~0 13- .... D 
.960 : .938 
.834 I .776 
.769 : .735 
.706 I .625 
.546 I .467 I 
Comparison Between Statistical and Syntactic Phrases 
(averages aver 17 queries\] 
F£g, i 
-i0- 
Most likely a combination of some of the above factors is responsible 
for the fact that relatively simple content analysis methods are generally 
preferable in a retrieval environment to more sophisticated methods. The 
foreign language processing to be described in the remainder of this study 
must be viewed in the light of the foregoing test results. 
4. Multi-lii~ual Thesaurus 
The multi-lingual text processing experiment is motivated by the 
following principal considerations: 
a) in typical American libraries up to fifty percent of the stored 
materials may not be in English; about fifty percent of the 
material processed in a test at the National Library of Medi- 
cine in Washington was not in English (of this, German accounted 
for about 25%, French for 23%, Italian for 13%, Russian for 
11%, Japanese for 6%, Spanish for 5%, and Polish for 5%); \[6\] 
b) in certain statistical text processing experiments carried 
out with foreign language documents, the test results were 
about equally good for German as for English; \[7\] 
c) simple text processing methods appear to work well for English, 
and there is no a priori reason why they should not work 
equally well for another language. 
The basic multi~lingual system used for test purposes is outlined 
in Fig. 2. Document (or query)texts are looked-up in a thesaurus and re- 
duced to "concept vector" form; query vectors and document vectors are then 
compared, and document vectors sufficiently similar to the query are with- 
drawn from the file. In order to insure that mixed language input is pro- 
perly processed, the thesaurus must assign the same concept oategories~ no 
matter what the input language. The SMART system therefore utilizes a 
-ll- 
E~ 
'::3 
¢.3 
O r~ 
¢/\] 
C 
W 
m 
¢.. 
W 
0 
I- 
C 
0 
(1) 
-® I~.- 
3. c _ ~_ ,~._ G) ::} 
D 0 ~ ~ 
on" 
I m 
4,-, i r" 
E =:I 
0 
0 
O /o 
0 
E 
o o 
4J 
\[-4 
g 
N) 
O4 
r~ 
-12- 
multi-lingual thesaurus in which one concept category corresponds both to 
a family of English words, or word stems, as well as to their German trans- 
lation. 
A typical thesaurus excerpt is shown in Fig. 3, giving respectively 
concept numbers, English word class, and corresponding German word class. 
This thesaurus was produced by manually translating into German an origi- 
nally available English version. Tables 1 and 2 show the results of the 
thesaurus look-up operation for the English and German versions of query 
QB 13. The original query texts in three languages (English, French, and 
German) are shown in Fig. 4. It may be seen that seven out of 9 "English" 
concepts are common with the German concept vector for the same query. In 
view of this, one may expect that the German query processed against the 
German thesaurus could be matched against English language documents as 
easily as the English version of the query. Tables i and 2 also show that 
more query words were not found during look-up in the German thesaurus than 
in the English one. This is due to the fact th~ only a preliminary incom- 
plete version of the German thesaurus was available at run time. 
5. Foreign Language Retrieval Experiment 
To test the simple multi-lingual thesaurus process two collections 
of documents in the area of library science and documentation (the Ispra 
collection) were processed against a set of 48 search requests in documen- 
tation area. The English collection consisted of 1095 document abstracts, 
whereas the German collection contained only 468 document abstracts. The 
overlap between the two collections included 50 common documents. All 48 
queries were originally available in English; they were manually translated 
-13- 
Z30 ART 
2311NOEPEND 
a3a ASSOCZAXZVe 
233 DIVIDE 
23~ A~TI~E 
ACTIVITY. 
USAGE 
23~.'CATHODE 
G~T 
UiOOE 
FLYING~-SROT 
RAY 
RELAIS 
RELAY 
SCANNER 
TUBE 
23b KEDUNbANG¥ 
REQUNOAN¢ 
Z37 CHARGE 
ENTE~ 
ENTRY 
INSERT 
POST 
238 MULTI-LEVEL 
MULIIL~VEL 
239 INTELLEC~ 
INTELLECTUAL 
~NTELLIG 
MENTAL 
~UN-INTELLEGTUAL 
i~O ACTUAL 
PRAGT&GE 
~EAL 
Excerpt f~om Multi Lingua\]_ Thesaurus 
Fig. 3 
ARGHITEKTUR 
SELBSTAENDIG 
UNA~HAENGIG 
AKTIV 
AKIIVITAET 
TAEIIGKEiT 
DIODE 
VEKZ~EIGER 
EINGANG 
EINGEGANGEN 
EIHGEGEBEN 
EINSATZ 
EINSTELLEN 
EINTIU~GUNG 
GEISTIG 
P~AXIS 
-14- 
English Quer \[ QB 13 
Concepts Weights Thesaurus Category 
3J 
19 / 
33 / 
49 
65 J 
147 / 
2o7 / 
267 / 
345 
12 
12 
12 
12 
\].2 
12 
12 
12 
12 
computer, processor 
automatic, semiautomatic 
analyze, analyzer~ analysis, etc. 
compendium, compile, deposit 
authorship, originator 
discourse, language, linguistic 
area, branch, subfield 
concordance, keyword-in-context, 
KWIC bell 
anonymous, lettres 
/ common concept with German query 
words not found in thesaurus 
Thesaurus Look-up for English Query QB 13 
Table i 
-15- 
German Query QB 13 
Concepts Weights Thesaurus Category 
s/ 
\].9 / 
21 
83 / 
45 
64 
65 / 
68 
147 / 
207 / 
267 y 
12 
12 
4 
6 
4 
4 
12 
12 
6 
12 
12 
Computer, Datenverarbeitung 
Automatisch, Kybernetik 
Artikel, Presse, Zeitschrift 
Analyse, Sprachenanalyse 
Herausgabe, Publikation 
Buch, Heft, Werk 
Autor, Verfasser 
Literatur 
Linguistik, Sprache 
Arbeitsgebiet, Fach 
Konkordanz, KWIC 
schoenen, hilfrelch, vermutlieh 
anonymen, zusammenzustellen 
/ common concept with English query 
* words not found in thesaurus 
Thesaurus Look-up for German Query QB 13 
Table 2 
-16- 
SFIND QI3BAUIHQRS 
iN WHAI WAYS ARE CDMPUIER SYSIEMS BELNG 
APPLIED IO RESEARCH iN THE FIELD OF IHE 
BELLES LEIIRES ? HAS MACHINE ANALYSIS OF 
LANGUAGE PROVED u~EFUL FOR INSIANC£, iN 
DEIER~IJ~ING PKOBABLE AUTHORSHIP OF 
ANONYMOUS ~ORKS OR i~ CQM@ILZNG 
C ONC OdDANC E.S ? 
L)A~S WUEL SEN3 LES GALCULAIEUKS 
;>UNI--IL3 APPLIQUEb A LA RECAHE~ttE UAN~ 
LE bOMAINE DES BE&LE$-LETIRE$ ? E$I-{,E 
~UE L*ANALY.~t..,AUTOMAIIQUE DES IE&TE~ A 
ETE UTL~.E, PAR ExEMPLE, POUR DETEKMANER 
L°AUTEUR PROBABLE DoOUVKAGE~ ANUNVME~ UU 
POUR, FA~RE DES C,~\]N~UI~UAN~,E$ ? 
INWIEwEIT HERUEN COMPUTER-SYSTEME ZUR 
FOK~CHUN~ AUF UEM ~|ET DER $CHUENEN 
L|TEKAIUR VERWENDET ? HAT SIGH 
MA~CH|NELLE SPRACHENANALYSE ALS 
HILFRblCH ERH|E~EN, UM Z.~. DIE 
VERMU|LIGHE AUIORENSGHAFT ~EI ANONYMEN 
WERKEN ZU EEST|MMEN ODER UM.KONKORDANZEN 
ZU&AMMENZUSIELLEN ?. 
Query QB 13 in Three Languages 
Fig. 4 
-17 - 
into German by a native German speaker. The English queries were then 
processed against both the English and the German collections (runs E-E and 
E-G), and the same was done for the translated German queries (runs G-E and 
G-G, respectively). Relevance assessments were made for each English docu- 
ment abstract with respect to each English query by a set of eight American 
students in library science, and the assessors were not identical to the 
users who originally submitted the search requests. The German relevance 
assessments (German documents against German queries), on the other hand, 
were obtained from a different, German speaking, assessor. 
The principal evaluation results for the four runs using the the- 
saurus process are shown in Fig. 5, averaged over 48 queries in each case. 
It is clear from the output of Fig. 5 that the cross-language runs, E-G 
(English queries - German documents} and G-E (German queries - English docu- 
ments), are not substantially inferior to the corresponding output within 
a single language (G-G and E-E, respectively), the difference being of the 
order of 0.02 to 0.03 for a given recall level. On the other hand, both 
runs using the German document collection are inferior to the runs with the 
English collection. 
The output of Fig. 5 leads to the following principal conclusions: 
a) the query processing is comparable in both languages; for if 
this were not the case, then one would expect one set of 
query runs to be much less effective than the other (that is, 
either E-E and E-G, or else G-G and G-El; 
b) the language processing methods (that is, thesaurus categories, 
suffix cut-off procedures, etc.) are equally effective in 
both cases; if this were not the case, one would expect one 
of the single language runs to come out very poorly, but 
-18- 
t.. ~0 t'- m ~ NmO~o ~_ 
~o .~ ~ o~o 
=~ ~=. I I / // 
• ~'o ~. ~. ~. ~. 
Q. 
• -_ I ~ I I I 
I1. -- 
I J) 
U 
e a. 
o 
q) 
.c I- 
L. 
0 
0 
E L_ 
¢D 
(.9 
u~ 
r* 
u) °~ 
c~ t- 
uJ 
t- 
o u) 
°m 
o 
E 
0 o 
U 
U) 
0 m 
u~ 
-19- 
c) 
d) 
The foreign 
neither E-E, nor G-G came out as the poorest run; 
the cross-language runs are performed properly, for if this 
were not the cased one would expect E-G and G-E to perform 
much less well than the runs within a single language; since 
this is not the case, the principal conclusion is then ob- 
vious that documents in one language can be matched against 
queries in.~nothe F nearl \[ as well a 9 documents a~d ~ue~ies 
in a single language; 
'the runs using the German document collection (E-G and G-G) 
are less effective than those performed with the English 
collection; the indication is then apparent that some char- 
acteristic connected with the German document collection 
itself - for example, the type of abstract, or the language 
of the abstract, or the relevance assessments - requires 
improvement; the effectiveness of the cross-language pro- 
cessing, however, is not at issue. 
language analysis is summarized in Table 3. 
6. Failure Analysis 
Since the query processing operates equally well in both languages, 
while the German document collection produces a degraded performance, it 
becomes worthwhile to examine the principal differences between the two 
document collections. These are summarized in Table 4. The following prin- 
cipal distinctions arise: 
a) the organization of the thesaurus used to group words or 
word stems into thesaurus categories; 
b) the completeness of the thesaurus in terms of words included 
in it; 
c) the type of document abstracts included in the collection; 
-20- 
Translation 
Problem 
Poor query processing 
or poor translation 
Poor language processing 
Poor cross-language 
processing 
Poor processing of one 
document collection 
Corresponding Observation 
E-E and E-G much better 
than G-E and G-G, or 
vice-versa 
Either E-E or G-G much 
poorer than cross-language 
runs 
Both E-G and G-E poorer 
than other runs 
Either E-G and G-G, or 
else G-E and E-E simul- 
taneously poor 
Observation 
Confirmed 
No 
No 
No 
Yes 
E-E: English-quePies - English documents 
E-G: English queries - German documents 
G-E: German queries - English documents 
G-G: German queries - Get, nan documents 
Analysis of Foreign Language Processing 
Table 3 
-21- 
Document Collection Characteristics of Collections 
English German 
Number of document abstracts 1095 468 
Number of documents common to 50 50 
both collections 
Number of queries used in test 48 48 
Number of relevance assessors 8 1 
Number of common relevance 0 0 
assessors 
Generality of collection 0.013 0.029 
(number of relevant documents 
over total number of documents 
in collection) 
Average number of word occurrences 6.5 15.5 
not found in the thesaurus 
during look-up of document 
abstracts 
Characteristics of Document Collections 
Table 4 
-22- 
d) the accuracy of the relevance assessments obtained from the 
collections. 
Concerning first the organization of the multi-lingual thesaurus, 
it does not appear that any essential difficulties arise on that account. 
This is confirmed by the fact that the cross-language runs operate satis- 
factorily, and by the output of Fig. 6 (a) comparing a German word stem 
run (using standard suffix cut-off and weighting procedures~ with a German 
thesaurus run. It is seen that the German thesaurus improves performance 
over word stems for the German collection in the same way as the English 
thesaurus was seen earlier to improve retrieval effectiveness over the Eng- 
lish word stem analysis. \[2,3\] 
The other thesaurus characteristic - that is its completeness - 
appears to present a more serious problem. Table 4 shows that only approx- 
imately 6.5 English words per document abstract were not included in the 
English thesaurus, whereas over 15 words per abstract were missing from 
the German thesaurus. Obviously, if the missing words turn out to be 
impe~;tant for content analysis purposes, the German abstracts will be more 
difficult to analyze than their English counterpart. A brief analysis 
confirms that many of the missing German words, which do not therefore pro- 
duce concept numbers assignable to the documents, are indeed important for 
content identification. Fig. 7, listing the words not found for document 
0059 shows that 12 out of 14 missing words appear to be important for the 
analysis of that document. It would therefore seem essential that a more 
complete thesaurus be used under operational conditions and for future 
experiments. 
The other two collection characteristics, including the type of 
-23- 
I ~" ~ro .... N.--~ 
"G 
I1. ~. N. -- -- . 
ILl i I n- 
c; I:: o o 
.9 
"G- o ~ ~ - 
O. 
II a I 
• o 
o 
,- o 
u q 
O. 
~. ~. - o. q 
~1~ O~ 
o c; c; c~ c~ I 
| I I i 
i. 
0 o 
,I) n- 
ID 
4- =1 
0 
,.c 
E ~ 
U ~ C C I¢1 
U 
~ 0 
~ ....- 
O 
o 
4) 
E 
"o 
4) u 
t- 
o 
4) 
"o 
c 
o 
ih 
t._ 
o Ih 
t- 
F- 
~o 
L~ 
-24- 
0 
0 
Q. 
uJ 
,,~ 
~U 
~w 
~.~ ~~~~0~ 
  .~, ~ 1~ ~; ~ 1~ ~1~ X ~ ,1~, ,l~,~ ,~. ~ 
z ~' .,j UA ,-- ~,- Z*~,, 
0 uJ t,,, ,~ ~ ~3 ~u UJ .,,J ~1 
,~. ~,,.~ 
f4 
F-, 
I:::I o,~ 
;::1 
0 r.. 
4-~ 0 
0 
0 
g 
,,-4 
o 
0 t-, 
0 
0 .°2 0 0 0 0 ~ 0 
-25- 
abstracts and the accuracy of the relevance judgments are more difficult 
to assess, since these are not subject to statistical analysis. It is a 
fact that for some of the German documents informative abstracts are not 
available. For example, the abstract for document 028, included in Fig. 8, 
indicates that the corresponding document is a conference proceedings; very 
little is known about the subject matter of the conference, but the docu- 
ment was nevertheless judged relevant to six different queries (nos. 17, 27, 
31, 32, 52, and 531 dealing with subjects as diverse as "behavioral studies 
of information system users" (query 17~, and "the study of machine transla- 
tion" (query 27). One might quarrel with such relevance assessments, and 
with the inclusion of such documents in a test collection, particularly 
also since Fig. 6 (b} shows that the German queries operate more effectively 
with the English collection (using English relevance assessments) than with 
the German assessments. However, earlier studies using a variety of rele- 
vance assessments with the sam~document collection have shown that recall- 
precision results are not affected by ordinary differences in relevance 
assessments. \[81 For this reason, it would be premature to assume that the 
performance differences are primarily due to distinctions in the relevance 
assessments or in the collection make-up. 
7. Conclusion 
An experiment using a multi-lingual thesaurus in conjunction with 
two d~.fferent document collections, in German and English respectively, has 
shown that cross-language processing (for example, German queries against 
English documents) is nearly as effective as processing within a single lan- 
guage. Furthermore, a simple translation of thesaurus categories appears 
'0 ,,J t~l-. ,.J . ,~' ~i, 
~,~ .iI Uj uiI --I Iml,~ ' 
uJ~ 
CJUJ 
LSQJ ~.~a~ 
~U i~U U~ 
UJ~ ~U~ ,.,.Z : 
I~C3 ,~.~ 
ZC~ 0~ 
,,~Vl OU. 
V) U. 'Ia.~ 
i.ii-~ I,L, I,( '~ ,..~7 
w,.., w:~ 
C~ 
Z,Z ~ I- 
Lo .1t~ z~) a= 
:~,.., .ut(~ 
rK.~. ~lJJ ,C) a. "r 
u.~'Z t.3 
,.~UJl-. Z ~1~ uJ 
~-~Z uJLOl~.l... 
o~ z 
o r~ c~ uj r~ 
l.,. ,4 U. Z Z 
~'~ 
~~.~ 
,.~ .... 
~~:~ 
~~~ 
~U ~!;~' 
~ '~ ~Z'~ 
,~~ 
~ ~~ 
~z ~ 
W~W~WZ 
-26- 
~'Z 
I~UJ 
tnlM ¸~ 
~-., =C qt) 
r3I IJ *" 
~ "V' ,,J 
UJ I" • m~tU 
,UJ~N 
U:l: 
~ I,,'1 C) zz~, 
::)uJ 
,.J t~') c) 
Z 
uJc) 
Z k,- 
Z:l: 
w .J 0 
uJ .J u,. 
(,.~ I,,,. ,,,~ 
=~.kn 
C3 u't 0 
~ ::::) Z 
~¢3Z 
.JZO 
:3 ut ~d: 
)... t,u Z 
~ UZ~ ~z~ 
7~Z~. 
Z~ ~. 
~z~ ~~ ~ O~ ~ z~ ~ ~ ~ ~ 
.~ ~Z Q~Z 
~Z~z~ 
E-, 
0 o 
ill 
,-t 
u .~ 
co . 
..-I 
-27- 
to produce a document content analysis which is equally effective in Eng- 
lish as in German. In particular, differences in morphology (for example, 
in the suffix cut-off rules\], and in language ambiguities do not seem to 
cause a substantial degradation when moving from one language to another. 
For these reasons, the automatic retrieval methods used in the SMART system 
for English appear to be applicable also to foreign language material. 
Future experiments with foreign language documents should be carried 
out using a thesaurus that is reasonably complete in all languages, and 
with identical query and document collections for which the same relevance 
judgments may then be applicable across all runs. 
-28- 

References 

\[i~ G. Salton and M. E. Lesk, The SMART Automatic Document Retrieval 
System - An Illustration, Communications of the ACM, Vol. 8, No. 6, 
June 1965. 

\[21 G. Salton, Automatic Information Organization and Retrieval, McGraw 
Hill Book Company, New York, 1968, 514 pages. 

\[3\] G. Salton and M. E. Lesk, Computer Evaluation of Indexing and Text 
Processing, Journal of the ACM, Vol. 15, No. i, January 1968. 

\[41 C. W. Cleverdon and E. M. Keen, Factors Determining the Performance 
of Indexing Systems, Vol. i: Design, Vol. 2: Test Results, Aslib 
Cranfield Research Project, Cranfield, England, 1966. 

\[51 G. Salton, A Comparison Between Manual and Automatic Indexing Methods, 
American Documentation, Vol. 20, No. i, January 1969. 

\[6\] F. W. Lancaster, Evaluation of the Operating Efficiency of Medlars, 
Final Report, National Library of Medicine, Washington, January 1969. 

\[7J J. H. Williams, Computer Classification of Documents, FID-IFIP 
Conference on Mechanized Documentation, Rome, June 1967. 

\[8\] M. E. Lesk and G. Salton, Relevance Assessments and Retrieval System 
Evaluation, Information Storage and Retrieval, Vol. 4~ No. 4, October 
1968. 
