XML Viewer - c92-2069

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2069_metho.xml
Size: 32,467 bytes
Last Modified: 2025-10-06 14:12:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-2069">
  <Title>A LINEAR LEAST SQUARES FIT MAPPING METHOD FOR INFORMATION RETRIEVAL FROM NATURAL LANGUAGE TEXTS YIMING YANG</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
A LINEAR LEAST SQUARES FIT MAPPING METHOD FOR
INFORMATION RETRIEVAL FROM NATURAL LANGUAGE TEXTS
YIMING YANG
CHRISTOPHER G. CHUTE
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> This paper describes a unique method for mapping natural language texts to canonical terms that identify the contents of the texts. This method learns empirical associations between free-form texts and canonical terms from human-assigned matches and determines a Linear Least Squares Fit (LLSF) mapping function which represents weighted connections between words in the texts and the canonical terms. The mapping function enables us to project an arbitrary text to the canonical term space where the &amp;quot;transformed&amp;quot; text is compared with the terms, and similarity scores are obtained which quantify the relevance between the the text and the terms. This approach has superior power to discover synonyms or related terms and to preserve the context sensitivity of the mapping. We achieved a rate of 84~ in both the recall and the precision with a testing set of 6,913 texts, outperforming other techniques including string matching (15%), morphological parsing (17%) and statistical weighting (21%).</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> A common need in natural language information retrieval is to identify the information in free-form texts using a selected set of canonical terms, so that the texts can be retrieved by conventional database techniques using these terms as keywords. In medical classification, for example, original diagnoses written by physicians in patient records need to be classified into canonical disease categories which are specified for the purposes of research, quality improvement, or billing. We will use medical examples for discussion although our method is not limited to medical applications.</Paragraph>
    <Paragraph position="1"> String matching is a straightforward solution to automatic mapping from texts to canonical terms. Here we use &amp;quot;term&amp;quot; to mean a canonical description of a concept, which is often a noun phrase. Given a text (a &amp;quot;query ~) and a set of canonical terms, string matching counts the common words or phrases in the text and the terms, and choo~s the term containing the largest overlap as most relevant. Although it is a simple and therefore widely used technique, a poor success rate (typically 15% - 20%) is observed \[1\]. String-matching-based methods suffer from the problems known as &amp;quot;too little&amp;quot; and &amp;quot;too many&amp;quot;. As an example of the former, high blood pressure and hypertension are synonyms but a straightforward string matching cannot capture the equivalence in meaning because there is no common word in these two expressions. On the other hand, there are many terms which do share some words with the query high blood pressure, such as high head at term, fetal blood loss, etc.; these terms would be found by a string matcher although they are conceptually distant from the query, Human-defined synonyms or terminology thesauri have been tried as a semantic solution for the &amp;quot;too little&amp;quot; problem \[2\] \[3\]. It may significantly improve the mapping if the right set of synonyms or thesaurus is available. However~ as Salton pointed out \[4\], there is &amp;quot;no guarantee that a thesaurus tailored to a particular text collection can be usefully adapted to another collection. As a result, it has not been possible to obtain reliable improvements in retrieval effectiveness by using thesauruses with a variety of different document collections&amp;quot;.</Paragraph>
    <Paragraph position="2"> Salton has addressed the problem from a different angle, using statistics of word frequencies in a corpus to estimate word importance and reduce the &amp;quot;too many&amp;quot; irrelevant terms \[5\]. The idea is that &amp;quot;meaningful&amp;quot; words should count more in the mapping while unimportant words should count less. Although word counting is technically simple and this idea is commonly used in existing information retrieval systems, it inherits the basic weakness of surface string matching. That is, words used in queries but not occurring in the term collection have no affect on the mapping, even if they are synonyms of important concepts in the term collection.</Paragraph>
    <Paragraph position="3"> Besides, these word weights are determined regardless of the contexts where words have been used, so the lack of sentitivity to contexts is another weakness.</Paragraph>
    <Paragraph position="4"> We focus our efforts on an algorithmic solution for achieving the functionality of terminology thesauri and semantic weights without requiring human effort in identifying synonyms. We seek to capture such knowledge through samples representing its usage in various contexts, e.g. diagnosis texts with expert-assigned canonical terms collected from the Mayo Clinic patient record archive. We propose a numerical method, a &amp;quot;Linear ACRES DE COLING-92, NANTES, 23-28 AOUT 1992 4 4 7 Paoc, OF COL1NG-92, NANTES, AUG. 23-28, 1992 (a) text/term pairs and the matrix representation tagh grade cmx~id ulceratipn I dr, cry ruplure &amp;quot;-&amp;quot;'7 highgmdegLidegrnit / I maliss~&amp;quot;~&amp;quot;edegvtasml stom~hm~um II / gastdcinjL~y, \[ 0 1 ll g ' / j high o 1 11 i~j~-y l 1 0 0 l rapture 1 0 01 malignant  |0 1 01 stornaeh 1 0 O\[ neoplasm \[ 0 1 0l ul~ration 0 0 1 .J rupture L 0 0 1 /  matrix A matrix B (b) an LLSF solution W of the linear system WA = B carotid glioma grade high rupture stomach ulceration  Least Squares Fit&amp;quot; mapping model, which enables us to obtain mapping functions based on the large collection of known matches and then use these functions to determine the relevant canonical terms for an arbitrary text.</Paragraph>
    <Paragraph position="5"> 2. Computing an LLSF mapping function We consider a mapping between two languages, i.e.</Paragraph>
    <Paragraph position="6"> from a set of texts to a set of canonical terms. We call the former the source language and the latter the target language. For convenience we refer to an item in the source language (a diagnosis) as &amp;quot;text&amp;quot;, and an item in the target language (a canonical description of a disease category) as &amp;quot;canonical term&amp;quot; or &amp;quot;term&amp;quot;. We use &amp;quot;text&amp;quot; or &amp;quot;term&amp;quot; in a loose sense, in that it may be a paragraph, a sentence, one or more phrases, or simply a word. Since we do not restrict the syntax, there is no difference between a text and a term, both of them are treated as a set of words.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 A numerical representation of texts
</SectionTitle>
      <Paragraph position="0"> In mathematics, there are well-established numerical methods to approximate unknown functions using known data. Applying this idea to our text-to-term mapping, the known data are text/term pairs and the unknown function we want to determine is a correct (or nearly correct) text-to-term mapping for not only the texts included in the given pairs, but also for the texts which are not included. We need a numerical representation for such a computation.</Paragraph>
      <Paragraph position="1"> Vectors and matrices have been used for representing natural language texts in information retrieval systems for decades \[5\]. We employ such a representation in our model as shown in Figure 1 (a). Matrix A is a set of texts, matrix B is a set of terms, each column in A represents an individual text and the corresponding column of B represents the matched term. Rows in these matrices correspond to words and cells contaln the numbers of times words occur in corresponding texts or terms.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The mapping function
</SectionTitle>
      <Paragraph position="0"> Having matrix .4 and E, we are ready to compute the mapping function by solving the equation WA = B where W is the unknown function. The solution W, if it exists, should satisfy all the given text/term pairs, i.e. the equation WE~ = b~ holds for i = 1, ...,k, where k is the number of text/term pairs, Ei(n x 1) is a text vector, a column of A; bi(rn x 1) is a term vector, the corresponding column in B; n is the number of distinct source words and m is the number of distinct target words.</Paragraph>
      <Paragraph position="1"> Solving WA = B can be straightforward using techniques of solving linear equations if the system is consistent. Unfortunately the linear system WA = B does not always have a solution because there are only m x n unknowns in W, but the number of given vector pairs may be arbitrarily large and form an inconsistent system. The problem therefore needs to be modified as a  Linear Least Squares Fit which always has at least one solution.</Paragraph>
      <Paragraph position="2"> Definition 1. The LLSF problem is to find W which minimizes the sum</Paragraph>
      <Paragraph position="4"> where ~ d=~ Wgl - b'i is the mapping error of the ith text/term pair; the notation 11...112 is vector 2-norm, defined as 11712 x\]r~ ' 2 = =iv~ and ~'is m x 1; II ...lit is the Frobenius matrix norm, defined as</Paragraph>
      <Paragraph position="6"> and M is m x k.</Paragraph>
      <Paragraph position="7"> The meaning of the LLSF problem is to find the mapping function W that minimizes the total mapping errors for a given text/term pair collection (the &amp;quot;training AcrEs DE COLING-92, NANTES, 23-28 AOt~r 1992 4 4 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 set&amp;quot;). The underlying semantics of the transformation W~ = b'~ is to &amp;quot;translate&amp;quot; the meaning of each source word in the text into a set of target words with weights, and then linearly combine the translations of individual words to obtain the translation of the whole text, Figure 1 (b) is the W obtained from matrix A and B in (a). The columns of W correspond to source words, the rows correspond to target words, and the ceils are the weights of word-to-word connections between the two languages. A little algebra will show that vector bi = WS&amp;quot;i is the sum of the column vectors in W, which correspond to the source words in the text.</Paragraph>
      <Paragraph position="8"> The weights in W are optimally determined according to the training set. Note that the weights do not depend on the literal meanings of words. For example, the source word glioma has positive connections of 0.5 to both the target words malignant and neoplasm, show~ ing that these different words are related to a certain degree. On the other hand, ruptur~ is a word shared by both the source language and the target language, but the source word rupture and the target word rupfure have a connection weight of 0 because the two words do not co-occur in any of the text/term pairs in the training set. Negative weight is also possible for words that do not co-occur and its function is to preserve the context sensitivity of the mapping. For example, high grade in the context of high grade carotid ulceration does not lead to a match with malignan~ neoplasm, as it would if it were used in the context high grade glioma, because this ambiguity is cancelled by the negative weights. Readers can easily verify this by adding the corresponding column vectors of W for these two different contexts.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 The computation
</SectionTitle>
      <Paragraph position="0"> A conventional method for solving the LLSF is to use singular value decomposition (SVD) \[6\] \[7\]. Since mathematics is not the focus of this paper, we simply outline the computation without proof.</Paragraph>
      <Paragraph position="1"> Given matrix A (n x k) and B (mx k), the computation of an LLSF for WA = B consists of the following steps:  (1) Compute an SVD of A, yielding matrices U, S and  V: if n &gt; k, decompose A such that A = USV T, if n &lt; k, decompose the transpose A T such that .A T = VSU T, where U (n x p) sad V (k x p) contain the left and right singular vectors, respectively, and V ~r is the transpose of V; Sis a diagonal (pxp) which contains p non-zero singular values al &gt; s2 ... &gt; sp &gt; 0 and p &lt; rain (k,n); (2) Compute the mapping function W = BVS-1U T, where S -t = diag (l/s1, 1/s:~ ..... 1/sl, ). 3, Mapping arbitrary queries to canonical terms The LLSF mapping consists of the following steps: (1) Given an arbitrary text (a &amp;quot;query&amp;quot;), first form a query vector, ~, in the source vector space.</Paragraph>
      <Paragraph position="2"> A query vector is similar to a eolunm of matrix A, whose elements contain the numbers of times source words occur in the query. A query may Mso contain some words which are not in the source language; we ignore these words because no meaningful connections with them are provided by the mapping function. As an example, query severe stomach ulcers*ion is converted into vector ~=(0 0 0 0 0 1 1).</Paragraph>
      <Paragraph position="3"> (2) Transform the source vector a7 into t7 = W:~ in the target space.</Paragraph>
      <Paragraph position="4"> In our example, 17 = WPS - (0.375 0.5 0.5 -0.25 -0.25 0.375). Differing from text vectors in A and term vectors in B, the elements (coefficients) of 17 are not limited to non-negative integers. These numbers show how the meaning of a query distributes over the words in the target language.</Paragraph>
      <Paragraph position="5"> (3) Compare query-term similarity for all the term vectors and find the relevant terms.</Paragraph>
      <Paragraph position="6"> In linear algebra, eosine-theta (or dot-product) is a common measure for obtaining vector similarity. It is also widely accepted by the information retrieval community using vector-based techniques because of the reasonable underlying intuition: it captures the siufilarity of texts by counting the similarity of individual words and then summarizing them. We use the cosine value to evaluate query-term similarity, defined as below; null De\]tuition 2. Let ~ = (Yl , y2, ..., y,n) be the query vector in the target space and g = (vl,v2, ...,vm) be a term vector in the target space,</Paragraph>
      <Paragraph position="8"> \]In order to find the closest match, we need to compare with all the term vectors. We use C to denote the matrix of these vectors distinct from matrix B which represents the term collection in the training set. In general only a subset of terms are contained in a training set, so (7 has more columns than the unique columns of B. Furthermore, C could have more rows than B because of the larger vocabulary. However, since only the words in B have meaningful connections in the LLSF mapping function, we use the words in B to form a reduced target language and trim C into the same rows as B. Words not in the reduced target language are ignored.</Paragraph>
      <Paragraph position="9"> An exhaustive comparison of the query-term similarity Acll~:S DE COLING-92, NANTES, 23-28 Ao~r 1992 4 4 9 PROC. OF COLING-92, NAN'IXS, AUG. 23-28, 1992 values provides a ranked list of all the terms with respect to a query. A retrieval threshold can be chosen for drawing a line between relevant and irrelevant. Since relevance is often a relative concept, the choice of the threshold is left to the application or experiment.</Paragraph>
      <Paragraph position="10"> A potential weakness of this method is that the term vectors in matrix C are all surface-based (representing word occurrence frequency only) and are not affected by the training set or the mapping function. This weakness can be attenuated by a refined mapping method using a reverse mapping function R which is an LLSF solution of the linear system RB = A. The refinement is described in a separate paper \[8\].</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. The results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The primary test
</SectionTitle>
      <Paragraph position="0"> We tested our method with texts collected from patient records of Mayo Clinic. The patient records include diagnoses (DXs) written by physicians, operative reports written by surgeons, etc. The original texts need to be classified into canonical categories and about 1.5 million patient records are coded by human experts each year. We arbitrarily chose the cardiovascular disease subset from the 1990 surgical records for our primary test. After human editing to separate these texts from irrelevant parts in the patient records and to clarify the one-to-one correspondence between DXs and canonical terms, we obtained a set of 6,913 DX/term pairs. The target language consists of 376 canonical names of cardiovascular diseases as defined in the classification system ICD-9-CM \[9\]. A simple preproceseing was applied to remove punctuation and numbers, but no stemming or removal of non-discriminative words were used.</Paragraph>
      <Paragraph position="1"> We split the 6,913 DXs into two halves, called &amp;quot;oddhalf&amp;quot; and &amp;quot;even-half&amp;quot;. The odd-half was used as the training set, the even-half was used as queries, and the expert-assigned canonical terms of the even-half were used to evaluate the effectiveness of the LLSF mapping.</Paragraph>
      <Paragraph position="2"> We used conventional measures in the evaluation: recall and precision, defined as recall = j;erms retrieved and relevant total terms relevant precision = terms retrieved and relevant total terms retrieved For the query set of the even-half, we had a recall rate of 84% when the top choice only was counted and 96% recall among the top five choices. We also tested the odd-half, i.e. the training set itself, as queries and had a recall of 92% with the top choice and 99% with the top five. In our testing set, each text has one and only one relevant (or correct) canonical term, so the recall is always the same as the precision at the top choice.</Paragraph>
      <Paragraph position="3"> Our experimental system is implemented as a combination of C++, Perl and UNIX shell programming.</Paragraph>
      <Paragraph position="4"> For SVD, currently we use a matrix library in C++ \[10\] which implements the same algorithm as in LIN-PACK\[Ill. A test with 3,457 pairs in the training set took about 4.45 hours on a SUN SPARCstation 2 to compute the mapping function W and R. Since the computation of the mapping function is only needed once until the data collection is renewed, a real time response is not required. Term retrieval took 0.45 sec or le~ per query and was satisfactory for practical needs.</Paragraph>
      <Paragraph position="5"> Two person-days of human editing were needed for preparing the testing set of the 6,913 DXs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 The comparison
</SectionTitle>
      <Paragraph position="0"> For comparing our method with other approaches, we did additional tests with the same query set, the even-half (3,456 DXs), and matched it against the same term set, the 376 ICD-9-CM disease categories.</Paragraph>
      <Paragraph position="1"> For the test of a string matching method, we formed one matrix for all the 3,456 texts and the 376 terms, and used the cosine measure for computing the similarities.</Paragraph>
      <Paragraph position="2"> Only a 15% recall and precision rate was obtained at the top choice threshold.</Paragraph>
      <Paragraph position="3"> For testing the effect of linguistic canonicalization, we employed a morphological parser developed by the Evans group at CMU \[12\] (and refined by our group by adding synonyms) which covers over 10,000 lexical variants.</Paragraph>
      <Paragraph position="4"> We used it as a preprocessor which converted lexical variants to word roots, expanded abbreviations to full spellings, recognized non-discriminative categories such as conjunctions and prepositions and removed them, and converted synonyms into canonical terms. Both the texts and the terms were parsed, and then the string matching as mentioned above was applied. The recall (and precision) rate was 17% (i.e. only 2% improvement), indicating that lexical canonicalization does not solve the crucial part of the problem; obviously, very little information was captured. Although synonyms were also used, they were a small collection and not especially favorable for the cardiovascular diseases.</Paragraph>
      <Paragraph position="5"> For testing the effectiveness of statistical weighting, we ran the SMART system (version 10) developed at Cornell by Salton's group on our testing set. Two weighting schemes, one using term frequency and another using a combination of term frequency and &amp;quot;inverse document frequency&amp;quot;, were tested with default parameters; 20% and 21% recall rates (top choice) were obtained, respectively. An interactive scheme using user feedback for improvement is also provided in SMART, but our tests did not include that option.</Paragraph>
      <Paragraph position="6"> For further analysis we checked the vocabulary overlap between the query set and the term set. Only 20% of the source words were covered by the target words, which partly explains the unsatisfactory results of the above methods. Since they are all surface-based up- null (2) the &amp;quot;odd-ludf' (3,457 DXs) was used as the Iraining sa in the LLSF tests, which formed a source l~8uage including 945 distinct wolds and t lascar language (reduc.ed) including 376 unique canonical terms and 224 distinct words; O) the refined mapping method mentioned in Section 3 was u~d in the I\]~SF tests. \] DIAGNOSISWRITTFNIIyPHYSICIAN~ TI~d~IFOUNDHYAS~IRINGMATCHING TERM FOUNI\] BY THE LLSF MAPPING / / vasculitis Itft elbow tn~oimr~ve left heart failure art~fiti, unspecified r up/ured fight fe~noral p seudoaneurytm dxlominal aoeurysm r Ul~ured aneurym of urtery o f low~ extreanit y | unmpturexl cJcutld 5ifmr~on emeurym amaic ueur,/sm anent\]urn of artery of neck / ruptured abdominal aortic m~eurysm abdominal aneurysm ruptured abdominal aneurysm ruptured | abdominal aortic mncaryamunruptured I~lomlnal aneurysm abdominal aneurysm without mention / of luptule / bold: word effective in the staSng matching \] J Hgttre 2. Sasnple ~ult~ of file DX--to-tenn mapping using the LLSF and a string matching method proaches, only 20% of the query words were effectively used and roughly 80% of the information was ignored.</Paragraph>
      <Paragraph position="7"> The~e approaches share a common weakness in that they can not capture the implicit meaning of words (or only captured a little), and this seems to be a crucial problem.</Paragraph>
      <Paragraph position="8"> The LLSF method, on the other hand, does not have such disadvantages. First, since the training set and the query set were from the sanle data collection, a much higher vocabulary coverage of 67% was obtained.</Paragraph>
      <Paragraph position="9"> Second, the 67% source words were further connected to their synonyms or related words by the LLSF mapping, according to the matches in the training set. Not only word co-occurrence, &amp;quot;but also the contexts (sets of words) where the words have been used, were taken into account in the computation of weights; these connections were therefore context-sensitive. As a result, the {~7% word coverage achieved an 84% recall and precision rate (top choice), outperforming the other methods by 63% or more. Table 1 summarizes these tests.</Paragraph>
      <Paragraph position="10"> Figure 2 shows some sample results where each query is listed with the top choice by the LLSF mapping and the top choice by the string matching. All the terms chosen by the LLSF mapping agreed with expert-aesigned matches. It is evident that the LLSF mapping succemfully captures the semantic associations between the different surface expressions where a~ the string matching failed completely or missed important information.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
,5. Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Impact to computational linguistics
</SectionTitle>
      <Paragraph position="0"> ltecognizing word meanings or underlying concepts in natural language texts is a major focus in computational linguistics, especially in applied natural language processing such as information retrieval. Lexico-syntaetic approaches have had limited achievement because lexo icai canonicalization and syntactic categorization can not capture much information about the implicit meaning of words and surface expressions. Knowledge-based approaches using semantic thesauri or networks, on the other hand, lead to the fundamental question about what should be put in a knowledge base. Is a general knowledge base for unrestricted subject areas re~ aiistic? If unlikely, then what should be chosen for a domain-specific or application-specific knowledge bane? ls there a systematic way to avoid ad hoe decisions or the inconsistency that have often been involved in human development of semantic classes and the relationships between them? No clear answers have been given for these questions.</Paragraph>
      <Paragraph position="1"> The LLSF method gives an effective solution for capturing semantic implications between surface expressions. The word-to-word connections between two languages capture synonyms and related terms with respect to the contexts given in the text/term pairs of the training set.</Paragraph>
      <Paragraph position="2"> Furthermore, by taking a training set from the same data collection as the queries the knowledge (semm~tic~) is self-restricted, i.e. domain-specific, application-specific and user-group-specific. No symbolic representation of the knowledge is involved nor necessary, so subjective decisions by humans are avoided. As a re-Ac.q'ES DE COLING-92, NANTES, 23-28 Aotrr 1992 4 5 1 PROC. OF COL1NG-92, NARrEs, AuG. 23-28, 1992 suit, the 6%69% improvement over the string matching and the morphological parsing is evidence of our assertions. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Difference from other vector-based methods
</SectionTitle>
      <Paragraph position="0"> The use of vector/matrix representation, cosine measure and SVD makes our approach look similar to other vector-based methods, e.g. Saiton's statistical weighting scheme and Deerwester's Latent Semantic Indexing (LSI) \[13\] which uses a word-document matrix and truncated SVD technique to adjust word weights in a document retrieval. However, there is a fundamental difference in that they focus on word weights based on counting word occurrence frequencies in a text collection, so only the words that appeared in queries and documents (terms in our context) have an affect on the retrieval. On the other hand, we focus on the weights of word-to-word connections between two languages, not weight of words; our computation is based on the information of human-assigned matches, the word co-occurrence and the contexts in the text/term pairs, not simply word occurrence frequencies. Our approach has an advantage in capturing synonyms or terms semantically related at various degrees and this makes a significant difference. As we discussed above, only 20% of query words were covered by the target words. So even if the statistical methods could find optimal weights for these words, the majority of the information was still ignored, and as a result, the top choice recall and precision rate of SMART did not exceed 20% by much. Our tests with the LSI were mentioned in a separate paper \[14\]; the results were not better than SMART or the string matching method discussed above.</Paragraph>
      <Paragraph position="1"> In short, besides the surface characteristics such as using matrix, cosine-theta and SVD, the LLSF mapping uses different information and solves the problem on a different scale.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Potential applications
</SectionTitle>
      <Paragraph position="0"> We have demonstrated the success of the LLSF mapping in medical cP, ssification, but our method is not limited to this application. An attractive and practical application is automatic indexing of text databases and a retrieval using these indexing terms. As most existing text databmms use human-assigned keywords for indexing documents, numerous amounts of document/term pairs can be easily collected and used as training sets. The obtained LLSF mapping functions then can be used for automatic document indexing with or without human monitoring and refinement. Queries for retrieval can be mapped to the indexing terms using the same mapping functions and the rest of the task is simply a keyword-based search.</Paragraph>
      <Paragraph position="1"> Another interesting potential is machine translation.</Paragraph>
      <Paragraph position="2"> Brown\[15\] proposed a statistical approach for machine translation which used word-to-word translation probability between two languages. They had about three million pairs of English-French sentences but the difficult problem was to break the sentence-to-sentence association down to word-to-word. While they had a sophisticated algorithm to determine an alignment of word connections with maximum probability, it required estimation and re-estimation about possible alignments. Our LLSF mapping appears to have a great opportunity to discover the optimal word-to-word translation probability, according to the English-French sentence pairs but without requiring any subjective estimations. null</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Other aspects
</SectionTitle>
      <Paragraph position="0"> Several quastion~ deserve a short discussion: is the word a good choice for the basis of the LLSF vector space? Is the LLSF the only choice or the best choice for a numerical mapping? The word is not the only choice as the basis. We use it as a suitable starting point and for computational efficiency. We also treat some special phrases such as Acgulfed Immunod~ficiency Syndrome as a single word, by putting hyphens between the words in a pre-formatting.</Paragraph>
      <Paragraph position="1"> An alternative choice to using words is to use noun phrases for invoking more syntactic constraints. While it may improve the precision of the mapping (how much is unclear), a combinatorial increase of the problem size is the trade-off.</Paragraph>
      <Paragraph position="2"> Linear fit is a theoretical limitation of the LLSF mapping method. More powerful mapping functions are used in some neural networks\[16\]. However, the fact that the LLSF mapping is simple, fast to compute, and has well known mathematical properties makes it preferable at this stage of research. There are other numerical methods possible, e.g. using polynomial fit instead of linear fit, or using interpolation (going through points) instead of least squares fit, etc. The LLSF model demonstrated the power of numerical extraction of the knowledge from human-assigned mapping results, and finding the optimal solution among different fitting methods is a matter of implementation and experimentation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Acknowledgement
</SectionTitle>
    <Paragraph position="0"> We would like to thank Tony Plate and Kent Bailey for fruitful discussions and Geoffrey Atkin for programruing. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML