File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/69/c69-0401_metho.xml
Size: 21,705 bytes
Last Modified: 2025-10-06 14:11:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C69-0401"> <Title>Automatic Processing of Foreign Language Documents</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. The SMART System </SectionTitle> <Paragraph position="0"> SMART is a fully-automatic document retrieval system operating on the IBM 7094 and 360 model 65. Unlike other computer-based retrieval systems, the SMART system does not rely on manually assigned key words or index terms for the identification of documents and search requests, nor does it use primarily the frequency of occurrence of certain words or phrases included in the texts of documents. Instead, an attempt is made to go beyond simple word-matchlng procedures by using a variety of intellectual aids in the form of synonym dictionaries, hierarchical arrangements of subject identifiers, statistical and syntactic phrase generation methods and the like, in order to obtain the content identifications useful for the retrieval process.</Paragraph> <Paragraph position="1"> Stored documents and search requests are then processed without any prior manual analy~i__sby one of several hundred automatic content analysis methods, and those documents which most nearly match a given search request are extracted from the document file in answer to the request. The system may be controlled by the use~, in that a search request can be processed -4first in a standard mode; the user can then analyze the output obtained and, depending on his further requirements, order a reproeessing of the request under new conditions. The new output can again be examined and the process iterated until the right kind and amount of information are retrieved. \[1,2,3\] SMART is thus designed as an experimental automatic retrieval system of the kind that may become current in operational environments some years hence. The following facilities, incorporated into the SMART system for purposes of document analysis may be of principal interest: a) a system for separating English words into stems and affixes (the so-called suffix 's' and stem thesaurus methods) which can be used to construct document identifications consisting of the stems of words contained in the documents; b) a synonym dictionary, or thesaurus, which can be used to recognize synonyms by replacing each word stem by one or more &quot;concept&quot; numbers; these concept numbers then serve as content identifiers instead of the original word stems; c) a hierarchical arrangement of the concepts included in the thesaurus which makes it possible, given any concept number, to find its &quot;parents&quot; in the hierarchy, its &quot;sons&quot;, its &quot;brothers&quot;, and any of a set of possible cross references; the hierarchy can be used to obtain more general content identifiers than the ones originally given by going upin the hierarchy, more spsclflc ones by going down, and a set of related ones by picking up brothers and cross-references; d) statistical procedures to compate similarity coefficients based on co-occurrences of concepts within the sentences of a given collection; the ~elated concepts, determined by statistical association, can then be added to the originally available concepts to identify the various documents; e~ syntactic analysis methods which make it possible to compare -5the syntactically analyzed sentences of documents and search requests with a pre-coded dictionary of syntactic structures (&quot;criterion trees&quot;) in such a way that the same concept number is assigned to a large number of semantically equivalent, but syntactically quite different constructions; f) statistical ~hrgse matching methods which operate like the preceding syntactic phrase procedures, that is, by using a preeonstructed dictionary to identify phrases used as content identifiers; however, no syntactic analysis is performed in this case, and phrases are defined as equivalent if the concept numbers of all components match, regardless of the syntactic relationshlps between components; g) a dictionary u~datln~ system, designed to revise the several dictionaries included in the system: i) word stem dictionary ii) word suffix dictionary iii) common word dictionary (for words to be deleted duping analysis) iv) thesaurus (synonym dictionary) v) concept hierarchy vi) statistical phrase dictionary vii) syntactic (&quot;criterion&quot;) phmase dictionary.</Paragraph> <Paragraph position="2"> The operations of the system are built around a supemvisory system which decodes the input instructions and arranges the processing sequence in accordance with the instructions received. The SMART systems organization makes it possible to evaluate the effectiveness of the various processing methods by comparing the outputs produced by a variety of different runs. This is achieved by processing the same search requests against the same document collections several times, and making judicious changes in ~e analysis procedures between runs. In each case, the search effectiveness is evaluated by presenting paired comparisons of the average perfommance over many search requests for two given search and retrieval methodologies.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. The Evaluation of Language Analysis Methods </SectionTitle> <Paragraph position="0"> Many different criteria may suggest themselves for measuring the performance of an information system. In the evaluation work carried out with the SMART system, the effectiveness of an information system is assumed to depend on its ability to satisfy the users' information needs by retrieving wanted material, while rejecting unwanted items. Two measures have been widely used for this purpose, known as recall and precision, and representing respectively the proportion of relevant material actually retrieved, and the proportion of retrieved material actually relevant. \[3\] (Ideally, all relevant items should be retrieved, while at the same time, all nonrelevant items should be rejected, as reflected by perfect recall and precision values equal to i).</Paragraph> <Paragraph position="1"> It should be noted that both the recall and precision figures achievable by a given system are adjustable, in the sense that a relaxation of the search conditions often leads to high recall, while a tightening of the search criteria leads to high precision. Unhappily, experience has shown that on the average recall and precision tend to vary inversely since the retrieval of more relevant items normally also leads to the retrieval of more irrelevant ones. In practice, a compromise is usually made, and a perfor~nance level is chosen such that much of the relevant material is retrieved, while the number of nonrelevant items which are also retrieved is kept within tolerable limits.</Paragraph> <Paragraph position="2"> In theory, one might expect that the performance of a retrieval sysI null tem would improve as the language analysis methods used for document and query processing become more sophisticated. In actual fact, this turns out not to be the case. A first indication of the fact that retrieval effec-7- null tiveness does not vary directly with the complexity of the document or query analysis was provided by the output of the Asllb-Cranfield studies. This project tested a large variety of indexing languages in a retrieval environment, and came to the astonishing conclusion that the simplest type of indexing language would produce the best results. \[4\] Specifically, three types of indexing languages were tested, called respectively single terms (that is, individual terms, or concepts assigned to documents and queries), controlled terms (that is, single terms assigned under the control of the well-known EJC Thesaurus of Engineering and Scientific Terms), and finally simple conce~ts (that is, phrases consisting of two or more single terms).</Paragraph> <Paragraph position="3"> The results of the Cranfield tests indicated that single terms are more effective for retrieval purposes than either controlled terms, or complete phrases. \[4\] These results might be dismissed as being due to certain peculiar test conditions if it were not for the fact that the results obtained with the automatic SMART retrieval system substantially confirqn the earlier Cranfield output. \[3\] Specifically, the following basic conclusions can be drawn from the main SMART experiments: a) the simplest automatic language analysis procedure consisting of the assignment to queries and documents of weighted word stems originally contained in these documents, produces a retrieval effectiveness almost equivalent to that obtained by intellectual indexing carried out manually under controlled conditions; \[3,5\] b) use of a thesaurus look-up process, designed to recognize synonyms and other term relations by repla<~ing the original word stems by the corresponding thesaurus categories, improves the retrieval effectiveness by about ten percent in both recall and -8precision; null c) additional, more sophisticated language analysis procedures, including the assignment of phrases instead of individual terms, the use of a concept hierarchy, the determination of syntactic relations between terms, and so on, do not, on the average, provide improvements over the standard thesaurus process.</Paragraph> <Paragraph position="4"> An example of a typical recall-precision graph produced by the SMART system is shown in Fig. i, where a statistical phrase method is compared with a syntactic phrase procedure. In the former case, phrases are assigned as content identifiers to documents and queries whenever the individual phrase components are all present within a given document; in the latter case, the individual components must also exhibit an appropriate syntactic relationship before the phrase is assigned as an identifier. The output of Fig.l shows that the use of syntax degrades performance (the ideal perfor~nance region is in the upper right-hand corner of the graph where both the recall and the precision are close to i). Several arguments may explain the output of Fig. i: a) the inadequacy of the syntactic analyzer used to generate syntactic phrases; b) the fact that phrases are often appropriate content identifiers even when the phrase components are not syntactically related in a given context (e.g. the sentence &quot;people who need information, require adequate retrieval services&quot; is adequately identified by the phrase &quot;information retrieval&quot;, even though the components are not related in the sentence); c) the variability of the user population which makes it unwise to overspecify document content; d) the ambiguity inherent in natural language texts which may work to advantage when attempting to satisfy the information needs of a heterogeneous user population with diverse information needs.</Paragraph> <Paragraph position="5"> for the fact that relatively simple content analysis methods are generally preferable in a retrieval environment to more sophisticated methods. The foreign language processing to be described in the remainder of this study must be viewed in the light of the foregoing test results.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. Multi-lii~ual Thesaurus </SectionTitle> <Paragraph position="0"> The multi-lingual text processing experiment is motivated by the following principal considerations: a) in typical American libraries up to fifty percent of the stored materials may not be in English; about fifty percent of the material processed in a test at the National Library of Medicine in Washington was not in English (of this, German accounted for about 25%, French for 23%, Italian for 13%, Russian for 11%, Japanese for 6%, Spanish for 5%, and Polish for 5%); \[6\] b) in certain statistical text processing experiments carried out with foreign language documents, the test results were about equally good for German as for English; \[7\] c) simple text processing methods appear to work well for English, and there is no a priori reason why they should not work equally well for another language.</Paragraph> <Paragraph position="1"> The basic multi~lingual system used for test purposes is outlined in Fig. 2. Document (or query)texts are looked-up in a thesaurus and reduced to &quot;concept vector&quot; form; query vectors and document vectors are then compared, and document vectors sufficiently similar to the query are withdrawn from the file. In order to insure that mixed language input is properly processed, the thesaurus must assign the same concept oategories~ no matter what the input language. The SMART system therefore utilizes a multi-lingual thesaurus in which one concept category corresponds both to a family of English words, or word stems, as well as to their German translation. null A typical thesaurus excerpt is shown in Fig. 3, giving respectively concept numbers, English word class, and corresponding German word class. This thesaurus was produced by manually translating into German an originally available English version. Tables 1 and 2 show the results of the thesaurus look-up operation for the English and German versions of query QB 13. The original query texts in three languages (English, French, and German) are shown in Fig. 4. It may be seen that seven out of 9 &quot;English&quot; concepts are common with the German concept vector for the same query. In view of this, one may expect that the German query processed against the German thesaurus could be matched against English language documents as easily as the English version of the query. Tables i and 2 also show that more query words were not found during look-up in the German thesaurus than in the English one. This is due to the fact th~ only a preliminary incomplete version of the German thesaurus was available at run time.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 5. Foreign Language Retrieval Experiment </SectionTitle> <Paragraph position="0"> To test the simple multi-lingual thesaurus process two collections of documents in the area of library science and documentation (the Ispra collection) were processed against a set of 48 search requests in documentation area. The English collection consisted of 1095 document abstracts, whereas the German collection contained only 468 document abstracts. The overlap between the two collections included 50 common documents. All 48 queries were originally available in English; they were manually translated</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> -17 - </SectionTitle> <Paragraph position="0"> into German by a native German speaker. The English queries were then processed against both the English and the German collections (runs E-E and E-G), and the same was done for the translated German queries (runs G-E and G-G, respectively). Relevance assessments were made for each English document abstract with respect to each English query by a set of eight American students in library science, and the assessors were not identical to the users who originally submitted the search requests. The German relevance assessments (German documents against German queries), on the other hand, were obtained from a different, German speaking, assessor.</Paragraph> <Paragraph position="1"> The principal evaluation results for the four runs using the thesaurus process are shown in Fig. 5, averaged over 48 queries in each case.</Paragraph> <Paragraph position="2"> It is clear from the output of Fig. 5 that the cross-language runs, E-G (English queries - German documents} and G-E (German queries - English documents), are not substantially inferior to the corresponding output within a single language (G-G and E-E, respectively), the difference being of the order of 0.02 to 0.03 for a given recall level. On the other hand, both runs using the German document collection are inferior to the runs with the English collection.</Paragraph> <Paragraph position="3"> The output of Fig. 5 leads to the following principal conclusions: a) the query processing is comparable in both languages; for if this were not the case, then one would expect one set of query runs to be much less effective than the other (that is, either E-E and E-G, or else G-G and G-El; b) the language processing methods (that is, thesaurus categories, suffix cut-off procedures, etc.) are equally effective in both cases; if this were not the case, one would expect one of the single language runs to come out very poorly, but</Paragraph> <Paragraph position="5"> neither E-E, nor G-G came out as the poorest run; the cross-language runs are performed properly, for if this were not the cased one would expect E-G and G-E to perform much less well than the runs within a single language; since this is not the case, the principal conclusion is then obvious that documents in one language can be matched against queries in.~nothe F nearl \[ as well a 9 documents a~d ~ue~ies in a single language; 'the runs using the German document collection (E-G and G-G) are less effective than those performed with the English collection; the indication is then apparent that some characteristic connected with the German document collection itself - for example, the type of abstract, or the language of the abstract, or the relevance assessments - requires improvement; the effectiveness of the cross-language processing, however, is not at issue.</Paragraph> <Paragraph position="6"> language analysis is summarized in Table 3.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6. Failure Analysis </SectionTitle> <Paragraph position="0"> Since the query processing operates equally well in both languages, while the German document collection produces a degraded performance, it becomes worthwhile to examine the principal differences between the two document collections. These are summarized in Table 4. The following principal distinctions arise: a) the organization of the thesaurus used to group words or word stems into thesaurus categories; b) the completeness of the thesaurus in terms of words included in it; c) the type of document abstracts included in the collection; d) the accuracy of the relevance assessments obtained from the collections.</Paragraph> <Paragraph position="1"> Concerning first the organization of the multi-lingual thesaurus, it does not appear that any essential difficulties arise on that account. This is confirmed by the fact that the cross-language runs operate satisfactorily, and by the output of Fig. 6 (a) comparing a German word stem run (using standard suffix cut-off and weighting procedures~ with a German thesaurus run. It is seen that the German thesaurus improves performance over word stems for the German collection in the same way as the English thesaurus was seen earlier to improve retrieval effectiveness over the English word stem analysis. \[2,3\] The other thesaurus characteristic - that is its completeness appears to present a more serious problem. Table 4 shows that only approximately 6.5 English words per document abstract were not included in the English thesaurus, whereas over 15 words per abstract were missing from the German thesaurus. Obviously, if the missing words turn out to be impe~;tant for content analysis purposes, the German abstracts will be more difficult to analyze than their English counterpart. A brief analysis confirms that many of the missing German words, which do not therefore produce concept numbers assignable to the documents, are indeed important for content identification. Fig. 7, listing the words not found for document 0059 shows that 12 out of 14 missing words appear to be important for the analysis of that document. It would therefore seem essential that a more complete thesaurus be used under operational conditions and for future experiments.</Paragraph> <Paragraph position="2"> The other two collection characteristics, including the type of abstracts and the accuracy of the relevance judgments are more difficult to assess, since these are not subject to statistical analysis. It is a fact that for some of the German documents informative abstracts are not available. For example, the abstract for document 028, included in Fig. 8, indicates that the corresponding document is a conference proceedings; very little is known about the subject matter of the conference, but the document was nevertheless judged relevant to six different queries (nos. 17, 27, 31, 32, 52, and 531 dealing with subjects as diverse as &quot;behavioral studies of information system users&quot; (query 17~, and &quot;the study of machine translation&quot; (query 27). One might quarrel with such relevance assessments, and with the inclusion of such documents in a test collection, particularly also since Fig. 6 (b} shows that the German queries operate more effectively with the English collection (using English relevance assessments) than with the German assessments. However, earlier studies using a variety of relevance assessments with the sam~document collection have shown that recall-precision results are not affected by ordinary differences in relevance assessments. \[81 For this reason, it would be premature to assume that the performance differences are primarily due to distinctions in the relevance assessments or in the collection make-up.</Paragraph> </Section> class="xml-element"></Paper>