File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0207_metho.xml
Size: 24,296 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0207"> <Title>Pasteur's Quadrant, Computational Linguistics, LSA, Education</Title> <Section position="4" start_page="0" end_page="20" type="metho"> <SectionTitle> 3 The History and Nature of LSA </SectionTitle> <Paragraph position="0"> In the early eighties the management of Bell Telephone Laboratories, where I was working, asked me to form a group to find out why secretaries in the legal department were having trouble using UNIX, an obvious godsend, and fix them. This led to trying to find out why customers sometimes couldn't find what they wanted in the Yellow Pages, why service representatives didn't always give correct charges even thought they were plainly stated in well indexed manuals, and why the new online databases for parts and circuits required so much training and yielded only small gains in speed and accuracy, if any.</Paragraph> <Paragraph position="1"> We undertook a series of lab experiments whose details are skippable. What we discovered was this. In every case the words that people wanted to use, to give orders to computers, or to look things up, rarely matched the words the computer understood or the manuals were indexed by. Roughly, but almost always, the data could be summarized as: ask 100 people by what one word something should be called and you will get 30 different answers. The distribution is such that it takes five words to cover half the answers. We called this the problem of &quot;verbal disagreement&quot; (Furnas et al., 1987).</Paragraph> <Paragraph position="2"> Our first solution was brute force; find all the words people would use for what we called an &quot;information object&quot; and index by all of them, which we called &quot;unlimited aliasing&quot; (what do you think the chances are that anyone else would have named them that way?). Later, largely led by George Furnas (1985), we invented some ways to semi-automate that process by what he called &quot;adaptive indexing&quot;, having the computer ask people if the words they had used unsuccessfully should be added as pointers to things they eventually found. Of course, we also worried about the problem of ambiguity, now known as &quot;the Google problem&quot;, that almost every word has several very different meanings that will lead you astray. At least under some circumstances that was fixable by giving more context in the response, one version of which is Furnas' &quot;fisheye view&quot;, to guide navigation. (Adaptive indexing also greatly reduces the ambiguity problem because the pointers are one way--from what people said to the one thing they actually told the systems they meant.) So what had we done here? We'd used the practical problem to lead to empirically exploration of how people actually used words in daily life (although computers were not as much of daily life then as now, and some of their persisting problems may be due to our failure to get our solutions widely adopted. Here I am, still trying.) The surprising extent and universality of verbal disagreement could be viewed as a baby step in language science, at least as we construed language science.</Paragraph> <Paragraph position="3"> But just pinning down the nature of the problem in the statistics of actual pragmatic word usage (we called the new field &quot;statistical semantics&quot;, which didn't catch on), was only a start. Clearly the problems that computers were having understanding what people meant is special to computers. People understand each other much better. (People also have trouble, although less, with queries of one or two isolated words, but they are very good at using baseline statistics of what people mean by a word (which is, of course, Google's stock in trade, using an indirect version of adaptive indexing), and they appear to use context when available in a much more efficient manner (although this still needs research in the style of statistical semantics.) What was needed was a way to mimic what people do so well--understand all the meanings of all the words they know, and know just how much and how any word is related to any other. It is perfectly obvious that people learn the meanings of the words in their language, only slightly less so that they must do so primarily from experiencing the words in context and from how they are used in combination to produce emergent meanings. With these facts and clues in mind, the next step was to find computational techniques to do something similar, and see if it improved a computer's understanding.</Paragraph> <Paragraph position="4"> (An apology is in order for idiosyncratic use of the words &quot;meaning&quot;, &quot;understanding&quot;, and &quot;semantics&quot;. They are used here in special senses that differ from myriad usages in linguistics and philosophy, and may offend some readers. Because detailed definitions and circumlocutions would be burdensome and of little value, let us leave it to context.) The best method we hit upon was what is now called Latent Semantic Analysis, LSA (or, in information retrieval, Latent Semantic Indexing, LSI.) Because there have been some misinterpretations in the literature it may be useful to give a conceptual explanation of how LSA works. It assumes that the meaning of a passage (in practice typically a paragraph) can be approximated by the sum of the meanings of its words. That makes a large print corpus a huge system of simultaneous linear equations. To solve such systems we used the matrix algebraic technique of Singular Value Decomposition (SVD), the general method behind factor analysis and principal components analysis. Applied to a corpus of text, the result is a vector standing for every word in the corpus, with any passage represented by the vector sum of its word vectors. (At first we could only do that with rather small corpora, but with improved algorithms and hardware, size is no longer a barrier.) The first applications of LSA were to information retrieval, which we conceived of as a problem in the psychology of meaning, how to measure the similarity of meaning to a human of a query and a document given pervasive verbal disagreement. The method was to compute the similarity of corresponding vectors, typically by their cosine (of their angle in a very high dimensional &quot;semantic space&quot;.) The result was that, everything else equal (e.g. tokenizing, term-weighting, etc.), LSI gave about 20% better precision-for-recall results, largely because it could rightly judge meaning similarity despite differences in literal word use. It also does any language, and cross language retrieval handily because its numerical vectors don't care whether the &quot;words&quot; are Chinese characters or Arabic script. If the training corpus contains a moderate number of known good translations, and is processed correctly, it does pretty well with no other help.</Paragraph> <Paragraph position="5"> Along the way we discovered that choosing the right number of dimensions--the number of (independent) elements composing each vector--was critical, three hundred to five hundred being strongly optimal. One way of describing the value of reducing the number of dimensions well below the number of word types or passages is that it forces the system to induce relations between every word and every other rather than keeping track of the full pattern of empirical occurrences of each, as standard vector retrieval methods do.</Paragraph> <Paragraph position="6"> Because we like to think we are trying to model human minds as well as solve practical problems, we have also tested LSA on a variety of human tasks. For word meaning an early test was to give it a standardized multiple-choice vocabulary tests (it chooses the word with the most similar meaning by computing which has the highest cosine).</Paragraph> <Paragraph position="7"> Trained on text of similar volume and context to what an American high school senior has read, it does well on the Test of English as a Foreign Language (TOEFL), equaling successful non-native applicants to U.S. Colleges. It also mimics the astounding ten words per day vocabulary growth of middle school children as measured by multiple choice tests. To evaluate its representations of passage meaning, perhaps the most interesting and quantitative tests have been through its use in scoring the conceptual content of expository essays. In actual essay scoring systems we use a suite of analytic tools that includes other things. However, for the present purpose we need to consider how well LSA does when used alone. In doing this, LSA is used to predict the score a human grader would give a new essay on the basis of its similarity to other essays on the same topic that have previously been humanly scored. The LSA-based score predicts very nearly as well as does that of a second independent human reader. Several other evidences of passage-passage success will be described later.</Paragraph> <Paragraph position="8"> The astute reader will be puzzled by how this could happen, given the very strong simplification of LSA's additivity assumption, by which word order within passages is completely ignored. We will return to this matter, and to more on essay grading later.</Paragraph> <Paragraph position="9"> Before going on, a few more common misinterpretations of LSA need dealing with. First, LSA is not a measure of co-occurrence, at least as co-occurrence is usually conceived. For LSA a passage meaning is the combination of its word meanings.</Paragraph> <Paragraph position="10"> This does not imply that the words in a passage have the same meaning; indeed that would not be very useful. Empirically, over a typical large corpus, the correlation between the cosine between a random pair of words and the number of passages in which they both occurred is +.35, while the correlation with how often they occur separately, which by the usual interpretation should make them dissimilar, is +.30.</Paragraph> <Paragraph position="11"> By the same token--unlike n-gram language models--LSA estimates the probability that one word will follow another only indirectly and very weakly.</Paragraph> <Paragraph position="12"> (Although, surprisingly, LSA similarities have recently been shown to account for much of what goes on in recalling word lists in order, but not by conditional probability effects (Howard and Kahana, 2001)). More correct interpretations are that LSA reflects the degree to which two words could substitute for one another in similar contexts, that they tend to appear in similar (but not necessarily identical) contexts, and, most precisely, that they have the same effects on passage meanings.</Paragraph> <Paragraph position="13"> Now what about the fact that LSA ignores word order and thus all syntactically conveyed grammatical effects on sentential meaning? First, it needs emphasis that LSA is very good at measuring the similarity of two words or two passages, sometimes good on sentence to sentence similarity and sometimes not, and least good on word to sentence, or word-to-passage meanings. A good and bad feature of its word-to-word function is that it merges all contextual effects (different senses) of a word into a frequency-weighted average. LSA, as a theory of psychological meaning, proposes that a word is represented as a single central meaning that is modified by context (see Kintsch (2002) for how this could play out in predication and metaphor). The reason it does well on passage-to-passage is that passages are redundant and complex, and that local syntactic effects tend to average out. (This is true for humans too--e.g. they ignore misplaced nots) LSA should be used with all of this in mind.</Paragraph> <Paragraph position="14"> However, still, you might say, LSA's lack of understanding of prediction, attachment, binding, and constituent structure, thus of representation of logical propositions--all traditional foci of linguistic semantics and computational linguistics-- must surely weaken if not cripple it. Weaken surely, but by how much? Here is one &quot;ballpark&quot; estimate. A typical college educated adult understands around 100,000 word forms, an average sentence contains around 20 tokens. There are thus 100,000 = 332 bits of information in word choice alone. There are 20! =2.4 x 10 possible orders of 20 words for and additional maximum of 61 bits from syntactic effects. Of the possible information in a sentence, then, the part that bag-of-words LSA can use is 332/(61+ 332) = 84%.</Paragraph> <Paragraph position="15"> A substantial amount of human meaning is missing from LSA, but a much larger component is apparently captured. It turns out that, judiciously applied, this component can be quite useful.</Paragraph> <Paragraph position="16"> Moreover, applying it can help pin down the roles of what's missing and not and thus advance our understanding of the nature language as used. Some successful and less so applications to education are described next, along with some implications, as well as some radical conjectures.</Paragraph> </Section> <Section position="5" start_page="20" end_page="20" type="metho"> <SectionTitle> 4 Applications of LSA in Education </SectionTitle> <Paragraph position="0"> First, a few more words on the use of LSA in information retrieval (IR) (and relevant to some educational applications described later) and essay scoring. What LSA captures in IR is the degree to which two documents are about the same thing, independent of what equivalent wording may be used. Thus it is useful for finding documents that talk about something, even though it misses details-sometimes important ones--about what was said about the matter. What kind of computation might achieve a representation of the rest? To achieve a high degree of validity in representing word meaning, LSA uses only information on how words are used, it does not need to assume or identify more primitive semantic features. A possible hint from its success may be that the meaning of groups of words in their order may also rely entirely on how they relate to other groups of words in their orders. (Unpublished work of the psychologist Simon Dennis is pushing in this direction with very interesting results.) Could it be possible that word strings themselves actually are the deepest, most fundamental representation of verbal meaning, not some more abstract underlying primitive entities or structures? In essay grading, LSA information turns out to be almost, but not quite enough. In practice we add a number of primarily statistical measures, for example n-gram model estimates of how well the words have been ordered relative to standard English statistics. The remarkable thing is that even without any explicit extraction or representation of the logic or propositions in the essays, the methods usually produce slightly more reliable scores than do humans. Is it possible that merely the joint choosing of a set of words and a normative order for arranging them (including nonlinear interactions) suffices to convey all that's needed, without needing any deeper level of representation? Clearly, this is very doubtful, but perhaps worth thinking about? LSA's text analysis and matching capability, originally devised for IR, has found several fairly direct applications in education. One automatically measures the overlap between the content of courses by the text in their exams--agreeing well with teacher judgments on samples. This is used to help rationalize curricula. Another relates the content of job tasks, training materials, and work histories, all by placing their verbal descriptions in the same semantic space, and uses the results to assign people to jobs and just-in-time compensatory training. A new application automatically matches test items and learning materials to state achievement standards, with high agreement to human experts. Another automatically finds best-sentence summaries and categories as an aid for meta-data tagging of learning objects. A kind of inversion of the LSA representation automatically generates candidate keywords.</Paragraph> <Paragraph position="1"> The closest relative to essay grading is LSA's role in the Summary Street program. In this application students read 4-10 page educational documents, then write 100-200 word summaries.</Paragraph> <Paragraph position="2"> Using LSA, the system tells the student about how well the summary covers each section of the document, how coherent it is--by measuring the similarity of successive sentences--and marks redundant and irrelevant sentences. (Interestingly, experiments have shown that students learn more from text that is coherent, but not excessively so, and LSA can be used to determine the right degree, although no working application has yet been built around the capability.) Another version of the Summary Street and essay analysis technology is a web based tool that scores short essays written to summarize or discuss the content of chapters of college textbooks, providing feedback on what sections to re-read to improve coverage.</Paragraph> <Paragraph position="3"> A somewhat different manner of extending LSA's text analytic properties lies behind another group of applications. Suppose that a student reads a document about the human heart, then wants to choose another to read that will best advance her knowledge. Experiments have shown that the greatest learning will occur if the next reading introduces neither too little nor too much new knowledge. We call this the Goldilocks principle. By LSA analysis of how all of a set of materials on a topic are related to one another it is possible to accurately place them on a continuum of conceptual sophistication and automatically choose optimum steps. For a large electronic maintenance model currently under development, the technique is being generalized to provide optimum paths to knowledge (c) in which users choose a starting place and a target procedure they want to know, and the system picks a sequence of sections to read that is intended to introduce the needed information in an effective and efficient order for understanding. Combined with fisheye views, adaptive indexing, meaning-based LSA search, embedded LSA-based constructed response assessments, and other guidance features the system is a sort of midway, automatically constructed, intelligent tutor.</Paragraph> <Paragraph position="4"> Still another application combines aspects of the search and essay evaluation techniques to act as a kind of automated mentor for a collaborative learning environment. Its most interesting capabilities are monitoring and continuously assessing the content of the individual and the total group contributions, connecting individuals with others who are have made comments about similar things, posting alerts when the discussion wanders, both on request and autonomously reaching out to repositories for materials relevant to a discussion, and measuring the formation of consensus. In one small experiment, the system's automatic evaluation of individual content contributions over a semester had a correlation of .9 with independent ratings by participating instructors. Still more applications are just entering the research stage. One set is stimulated by the widely perceived inadequacy of multiple choice testing; students need to be able to think of answers, not just choose someone else's. The goal is to replace, for example, missing word multiple choice vocabulary tests with ones in which the student supplies the word and the system evaluates how well it fits.</Paragraph> <Paragraph position="5"> That's enough for successes. What about failures and limitations, what they teach, and where they point research? First, it is true that many laboratory tasks can reveal shortcomings and errors in LSA. Incorrect measures of similarity occur especially for sentences to sentence comparisons in which syntax has strong effects, where broader contextual information or pragmatic intent is involved, and where word meanings have strong relations to perceptual sources to which LSA training has had no access. In some of these cases, it is reasonable to suppose that the basic theoretical foundation is sound but the training data is not sufficient. In other cases it is fairly obvious that more fundamental limitations are at fault, such as the lack of a purely computation process by which to contextually disambiguate the phenomena traditionally described as multiple word senses.</Paragraph> <Paragraph position="6"> But what about the lessons from trying to solve educational problems promised earlier? There are two glaring examples. One is scoring answers to math problems, or mathematical answers to problems in physics and chemistry (never mind questions requiring drawings or diagrams), something we are frequently asked to do. Syntactic expressions with abstract symbols, where order is critical to logic and bindings are arbitrary, are simply beyond the powers of LSA. How to get them into a fully computational model, one that does not use human help in the form, for example, of manually constructed rules that natural humans could not know, preferably one in which the system learns the capability from the same interaction with the world that humans do, is the challenge to computational cognitive psychology and linguistics that forcefully presents itself, and whose solution could not help but require important new scientific knowledge about language.</Paragraph> <Paragraph position="7"> A second educational soft spot for LSA is its weakness on sentences. It would almost certainly be better to be able to treat the meaning of an essay as the combination of the meaning of its sentences and the propositional information that order, both within and between sentences, helps to convey.</Paragraph> <Paragraph position="8"> Moreover, simply scoring short answers, another frequent request is problematic. The usual LSA-based methods are not useless, but they fall significantly short of human reliabilities. There seem to be two issues involved. One is again the necessity of accounting for syntax, especially for negation, quantification, and binding. &quot;The price of cloth will go up and the cost of plastics down&quot; is not handled by LSA. The other is that short answer questions often require very specific responses in which some words must be literal entities and others admit of synonyms, circumlocutions and ambiguity. No one has found a way to match humans with without adding what we consider ad hoc methods, rules and triggers devised and coded by people who know the answer. What we want is a fully computational method that might be a possible model of how natural human minds represent knowledge and turn it into an answer of a few words or sentences that can be reliably evaluated by a human who has also learned the needed knowledge in a computationally realistic way. Finding one is another strong challenge whose successful attack would almost have to reveal new scientific truth.</Paragraph> <Paragraph position="9"> Finally, it is worth noting that LSA has up to very recently relied exclusively on SVD for its central engine. There are certainly other possibilities for doing the same job, and perhaps for doing it better, and for doing more. For example, several new matrix decomposition methods (that's what LSA is) have recently been devised that have interesting new properties, such as more interpretable representations. Other new approaches use entirely different computations, for example the model of Simon Dennis mentioned earlier relies on string-edit theory, computing what operations it takes to change one sentence into another. There is no room, and as yet no results to warrant review of these here, but it is clear that the exploration of innovative computational models of language, ones that, like LSA, are quite different in spirit from linguistic tradition, is being pushed by a desire to solve practical problems, featuring especially ones in education, and that the effort has not nearly reached its limits.</Paragraph> </Section> class="xml-element"></Paper>