File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1003_metho.xml
Size: 27,426 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1003"> <Title>Noun-Phrase Analysis in Unrestricted Text for Information Retrieval</Title> <Section position="4" start_page="17" end_page="17" type="metho"> <SectionTitle> 3. Need for shallow understanding </SectionTitle> <Paragraph position="0"> While the large amount of unrestricted text makes NLP more difficult for IR, the fact that a deep and complete understanding of the text may not be necessary for IR makes NLP for IR relatively easier than other NLP tasks such as machine translation. The goal of an IR system is essentially to classify documents (as relevant or irrelevant) vis-a-vis a query. Thus, it may suffice to have a shallow and partial representation of the content of documents.</Paragraph> <Paragraph position="1"> Information retrieval thus poses the genuine challenge of processing large volumes of unrestricted natural-language text but not necessarily at a deep level.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 1.3 Our Work </SectionTitle> <Paragraph position="0"> This paper reports on our evaluation of the use of simple, yet robust and efficient noun-phrase analysis techniques to enhance phrase-based IR. In particular, we explored an extension of the ~phrase-based indexing in the CLARIT TM system deg using a hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases exploiting both corpusstatistics and linguistic heuristics. Using such subcompounds rather than whole noun phrases as indexing terms helps a phrase-based IR system solve the phrase normalization problem, that is, the problem of matching syntactically different, but semantically similar phrases. The results of our experiments show that both recall and precision are improved by using extracted subcompounds for indexing.</Paragraph> </Section> </Section> <Section position="5" start_page="17" end_page="18" type="metho"> <SectionTitle> 2 Phrase-Based Indexing </SectionTitle> <Paragraph position="0"> The selection of appropriate indexing terms is critical to the improvement of both precision and recall in an IR task. The ideal indexing terms would directly represent the concepts in a document. Since 'concepts' are difficult to represent and extract (as well as to define), concept-based indexing is an elusive goal. Virtually all commercial IR systems (with the exception of the CLARIT system) index only on &quot;words', since the identification of words in texts is typically easier and more efficient than the identification of more complex structures. However, single words are rarely specific enough to support accurate discrimination and their groupings are often accidental. An often cited example is the contrast between &quot;junior college&quot; and &quot;college junior&quot;. Word-based indexing cannot distinguish the phrases, though their meanings are quite different.</Paragraph> <Paragraph position="1"> Phrase-based indexing, on the other hand, as a step toward the ideal of concept-based indexing, can address such a case directly.</Paragraph> <Paragraph position="2"> Indeed, it is interesting to note that the use of phrases as index terms has increased dramatically among the systems that participate in the TREC evaluations. ~ Even relatively traditional word-based systems are exploring the use of multi-word terms by supplementing words with statistical phrases--selected high frequency adjacent word pairs (bigrams). And a few systems, such as CLARIT--which uses simplex noun phrases, attested subphrases, and contained words as index terms--and New York University's TREC systemS--which uses &quot;head-modifier pairs&quot; derived from identified noun phrases--have demonstrated the practicality and effectiveness of thorough NLP in IR tasks.</Paragraph> <Paragraph position="3"> The experiences of the CLAR1T system are instructive. By using selective NLP to identify simplex NPs, CLARIT generates phrases, subphrases, and individual words to use in indexing documents and queries. Such a first-order analysis of the linguistic structures in texts approximates concepts and affords us alternative methods for calculating the fit between documents and queries. In particular, we can choose to treat some phrasal structures as atomic units and others as additional information about (or representations of) content. There are immediate effects in improving precision:</Paragraph> </Section> <Section position="6" start_page="18" end_page="18" type="metho"> <SectionTitle> 2. Phrases can supplement word-level matches. </SectionTitle> <Paragraph position="0"> For example, if only the individual words &quot;junior&quot; and &quot;college&quot; are used for indexing, both &quot;junior college&quot; and &quot;college junior&quot; will match a query with the phrase &quot;junior college&quot; equally well. But if we also use the phrase &quot;junior college&quot; for indexing, then &quot;junior college&quot; will match better than &quot;college junior&quot;, even though the latter also will receive some credit as a match at the word level.</Paragraph> <Paragraph position="1"> We can see, then, that it is desirable to distinquish-and, if possible, extract--two kinds of phrases: those that behave as lexical atoms and those that reflect more general linguistic relations.</Paragraph> <Paragraph position="2"> Lexical atoms help us by obviating the possibility of extraneous word matches that have nothing to do with true relevance. We do not want &quot;hot&quot; or &quot;dog&quot; to match on &quot;hot dog&quot;. In essence, we want to eliminate the effect of the independence assumption at the word level by creating new words--the lexical atoms--in which the individual word dependencies are explicit (structural).</Paragraph> <Paragraph position="3"> More general phrases help us by adding detail.</Paragraph> <Paragraph position="4"> Indeed, all possible phrases (or paraphrases) of actual content in a document are potentially valuable in indexing. In practice, of course, the indexing term space has to be limited, so it is necessary to select a subset of phrases for indexing. Short phrases (often nominal compounds) are preferred over long complex phrases, because short phrases have better chances for matching short phrases in queries and will still match longer phrases owing to the short phrases they have in common. Using only short phrases also helps solve the phrase normalization problem of matching syntactically different long phrases (when they share similar meaning). 6 Thus, lexical atoms and small nominal compounds should make good indexing phrases.</Paragraph> <Paragraph position="5"> While the CLARIT system does index at the level of phrases and subphrases, it does not currently index on lexical atoms or on the small compounds that can be derived from complex NPs, in particular, reflecting cross-simplex NP dependency relations.</Paragraph> <Paragraph position="6"> Thus, for example, under normal CLARIT processing the phrase &quot;the quality of surface of treated stainless steel strip &quot;7 would yield index terms such as &quot;treated stainless steel strip&quot;, &quot;treated stainless steel&quot;, &quot;stainless steel strip&quot;, and &quot;stainless steel&quot; (as a phrase, not lexical atom), along with all the relevant single-word terms in the phrase. But the process would not identify &quot;stainless steel&quot; as a potential lexical atom or find terms such as &quot;surface quality&quot;, &quot;strip surface&quot;, and &quot;treated strip&quot;. To achieve more complete (and accurate) phrase-based indexing, we propose to use the following</Paragraph> </Section> <Section position="7" start_page="18" end_page="18" type="metho"> <SectionTitle> 6 (Smeaton, 1992) </SectionTitle> <Paragraph position="0"> ZThis is an actual example from a U.S. patent document.</Paragraph> <Paragraph position="1"> four kinds of phrases as indexing terms: 1. Lexical atoms (e.g., &quot;hot dog&quot; or 2.</Paragraph> <Paragraph position="2"> 3.</Paragraph> <Paragraph position="3"> 4.</Paragraph> <Paragraph position="4"> perhaps &quot;stainless steel&quot; in the example above) Head modifier pairs (e.g., &quot;treated strip&quot; and &quot;steel strip&quot; in the example above) Subcompounds (e.g., &quot;stainless steel strip&quot; in the example above) Cross-preposition modification pairs (e.g., &quot;surface quality&quot; in the example above) In effect, we aim to augment CLARIT indexing with lexical atoms and phrases capturing additional (discontinuous) modification relations than those that can be found within simplex NPs.</Paragraph> <Paragraph position="5"> It is clear that a certain level of robust and efficient noun-phrase analysis is needed to extract the above four kinds of small compounds from a large unrestricted corpus. In fact, the set of small compounds extracted from a noun phrase can be regarded as a weak representation of the meaning of the noun phrase, since each meaningful small compound captures a part of the meaning of the noun phrase. In this sense, extraction of such small compounds is a step toward a shallow interpretation of noun phrases. Such weak interpretation is useful for tasks like information retrieval, document classification, and thesaurus extraction, and indeed forms the basis in the CLARIT system for automated thesaurus discovery.</Paragraph> </Section> <Section position="8" start_page="18" end_page="21" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"> Our task is to parse text into NPs, analyze the noun phrases, and extract the four kinds of small compounds given above. Our emphasis is on robust and efficient NLP techniques to support large-scale applications.</Paragraph> <Paragraph position="1"> For our purposes, we need to be able to identify all simplex and complex NPs in a text. Complex NPs are defined as a sequence of simplex NPs that are associated with one another via prepositional phrases. We do not consider simplex NPs joined by relative clauses.</Paragraph> <Paragraph position="2"> Our approach to NLP involves a hybrid use of corpus statistics supplemented by linguistic heuristics. We assume that there is no training data (making the approach more practically useful) and, thus, rely only on statistical information in the document database itself. This is different from many current statistical NLP techniques that require a training corpus. The volume of data we see in IR tasks also makes it impractical to use sophisticated statistical computations.</Paragraph> <Paragraph position="3"> The use of linguistic heuristics can assist statistical analysis in several ways. First, it can focus the use of statistics by helping to eliminate irrelevant structures from consideration. For example, syntactic category analysis can filter out impossible word modification pairs, such as \[adjective, adjective\] and \[noun, adjective\]. Second, it may improve the reliability of statistical decisions. For example, the counting ofbigrams that occur only within noun phrases is more reliable for lexical atom discovery than the counting of all possible bigrams that occur in the corpus. In addition, syntactic category analysis is also helpful in adjusting cutoff parameters for statistics. For example, one useful heuristic is that we should use a higher threshold of reliability (evidence) for accepting the pair \[adjective, noun\] as a lexical atom than for the pair \[noun, noun\]: a noun-noun pair is much more likely to be a lexical atom than an adjective-noun one.</Paragraph> <Paragraph position="4"> The general process of phrase generation is illustrated in Figure 1. We used the CLARIT NLP module as a preprocessor to produce NPs with syntactic categories attached to words. We did not attempt to utilize CLARIT complex-NP generation or sub-phrase analysis, since we wanted to focus on the specific techniques for subphrase discovery that we describe in this paper.</Paragraph> <Paragraph position="5"> After preprocessing, the system works in two stages--parsing and generation. In the parsing stage, each simplex noun phrase in the corpus is parsed. In the generation stage, the structured noun phrase is used to generate candidates for all four kinds of small compounds, which are further tested for occurrence (validity) in the corpus.</Paragraph> <Paragraph position="6"> Parsing of simplex noun phrases is done in multiple phases. At each phase, noun phrases are partially parsed, then the partially parsed structures are used as input to start another phase of partial parsing. Each phase of partial parsing is completed by concatenating those most reliable modification pairs together to form a single unit. The reliability of a modification pair is determined by a score based on frequency statistics and category analysis and is further tested via local optimum phrase analysis (described below). Lexical atoms are discovered at the same time, during simplex noun phrase parsing.</Paragraph> <Paragraph position="7"> Phrase generation is quite simple. Once the structure of a noun phrase (with marked lexical atoms) is known, the four kinds of small compounds can be easily produced. Lexical atoms are already available. Head-modifier pairs can be extracted based on the modification relations implied by the structure.</Paragraph> <Paragraph position="8"> Subcompounds are just the substructures of the NP.</Paragraph> <Paragraph position="9"> Cross-preposition pairs are generated by enumerating all possible pairs of the heads of each simplex NP within a complex NP in backward order. 8 To validate discontinuous compounds such as non-sequential head-modifier pairs and cross-preposition pairs, we use a standard technique of CLARIT processing, viz., we test any nominated compounds against the corpus itself. If we find independently attested (whole) simplex NPs that match the candidate compounds, we accept the candidates as index terms. Thus for the NP &quot;the quality of surface of treated stainless steel strip&quot;, the head-modifier pairs &quot;treated strip&quot;, &quot;stainless steel&quot;, &quot;stainless strip&quot;, and &quot;steel strip&quot;, and the cross-preposition pairs &quot;strip surface&quot;, &quot;surface quality&quot;, and &quot;strip quality&quot;, would be generated as index terms only if we found independent evidence of such phrases in the corpus in the form of free-standing simplex NPs.</Paragraph> <Section position="1" start_page="19" end_page="21" type="sub_section"> <SectionTitle> 3.1 Lexical Atom Discovery </SectionTitle> <Paragraph position="0"> A lexical atom is a semantically coherent phrase unit. Lexical atoms may be found among proper names, idioms, and many noun-noun compounds.</Paragraph> <Paragraph position="1"> Usually they are two-word phrases, but sometimes they can consist of three or even more words, as in the case of proper names and technical terms.</Paragraph> <Paragraph position="2"> Examples of lexical atoms (in general English) are &quot;hot dog&quot;, &quot;tear gas&quot;, &quot;part of speech&quot;, and &quot;yon Neumann&quot;.</Paragraph> <Paragraph position="3"> However, recognition of lexical atoms in free text is difficult. In particular, the relevant lexical atoms for a corpus of text will reflect the various discourse domains encompassed by the text. In a collection of medical documents, for example, &quot;Wilson's disease&quot; (an actual rheumatological disorder) may be used as a lexical atom, whereas in a collection of general news stories, &quot;Wilson's disease&quot; (reference to the disease that Wilson has) may not be a lexical atom. Note that in the case of the medical usage, we would commonly find &quot;Wilson's disease&quot; as a bigram and we would not find, for example, 8 (Schwarz, 1990) reports a similar strategy.</Paragraph> <Paragraph position="5"> &quot;Wilson's severe disease&quot; as a phrase, though the latter might well occur in the general news corpus.</Paragraph> <Paragraph position="6"> This example serves to illustrate the essential observation that motivates our heuristics for identitying lexical atoms in a corpus: (1) words in lexical atoms have strong association, and thus tend to co-occur as a phrase and (2) when the words in a lexical atom co-occur in a noun phrase, they are never or rarely separated.</Paragraph> <Paragraph position="7"> The detection of lexical atoms, like the parsing of simplex noun phrases, is also done in multiple phases. At each phase, only two adjacent units are considered. So, initiall~ only two-word lexical atoms can be detected. But, once a pair is determined to be a lexical atom, it will behave exactly like a single word in subsequent processing, so, in later phases, atoms with more than two words can be detected.</Paragraph> <Paragraph position="8"> Suppose the pair to test is \[W1, W2\]. The first heuristic is implemented by requiring the frequency of the pair to be higher than the frequency of any other pair that is formed by either word with other words in common contexts (within a simplex noun phrase). The intuition behind the test is that (1) in general, the high frequency of a bigram in a simple noun phrase indicates strong association and (2) we want to avoid the case where \[W1, W2\] has a high frequency, but \[W1, W2, W\] (or \[W, W1, W2\]) has an even higher frequency, which implies that W2 (or W1) has a stronger association with W than with W1 (or W2, respectively). More precisely, we require the following:</Paragraph> <Paragraph position="10"> uous frequencies of \[X, Y\], respective135 within a simple noun phrase, i.e., the frequency of patterns \[...X, Y...\] and patterns \[...X, ..., Y...\], respectively. The second heuristic requires that we record all cases where two words occur in simplex NPs and compare the number of times the words occur as a strictly adjacent pair with the number of times they are separated. The second heuristic is simply implemented by requiring that F(W1, W2) be much higher than DF(W1, W2) (where 'higher' is determined by some threshold).</Paragraph> <Paragraph position="11"> Syntactic category analysis also helps filter out impossible lexical atoms and establish the thresh- null old for passing the second test. Only the following category combinations are allowed for lexical atoms: \[noun, noun\], \[noun, lexatom\], \[lexatom, noun\], \[adjective, noun\], and \[adjective, lexatom\], where &quot;lexatom&quot; is the category for a detected lexical atom. For combinations other than \[noun, noun\], the threshold for passing the second test is high.</Paragraph> <Paragraph position="12"> In practice, the process effectively nominates phrases that are true atomic concepts (in a particular domain of discourse) or are being used so consistently as unit concepts that they can be safely taken to be lexical atoms. For example, the lexical atoms extracted by this process from the CACM corpus (about 1 MB) include &quot;operating system&quot;, &quot;data structure&quot;, &quot;decision table&quot;, &quot;data base&quot;, &quot;real time&quot;, &quot;natural language&quot;, &quot;on line&quot;, &quot;least squares&quot;, &quot;numerical integration&quot;, and &quot;finite state automaton&quot;, among others.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.2 Bottom-Up Association-Based Parsing </SectionTitle> <Paragraph position="0"> Extended simplex noun-phrase parsing as developed in the CLARIT system, which we exploit in our process, works in multiple phases. At each phase, the corpus is parsed using the most specific (i.e., recently created) lexicon of lexical atoms. New lexical atoms (results) are added to the lexicon and are reused as input to start another phase of parsing until a complete parse is obtained for all the noun phrases.</Paragraph> <Paragraph position="1"> The idea of association-based parsing is that by grouping words together (based on association) many times, we will eventually discover the most restrictive (and informative) structure of a noun phrase. For example, if we have evidence from the corpus that &quot;high performance&quot; is a more reliable association and &quot;general purpose&quot; a less reliable one, then the noun phrase &quot;general purpose high performance computer&quot; (an actual example from the CACM corpus) would undergo the following grouping process: general purpose high performance computer =~ general purpose \[high=performance\] computer =~</Paragraph> <Paragraph position="3"> Word pairs are given an association score (S) according to the following rules. Scores provide evidence for groupings in our parsing process. Note that a smaller score means a stronger association.</Paragraph> <Paragraph position="4"> 1. Lexical atoms are given score 0. This gives the highest priority to lexical atoms.</Paragraph> <Paragraph position="5"> 2. The combination of an adverb with an adjective, past participle, or progressive verb is given score 0.</Paragraph> <Paragraph position="6"> 3. Syntactically impossible pairs are given score 100. This assigns the lowest priority to those pairs filtered out by syntactic category analysis. The 'impossible' combinations include pairs such as \[noun, adjective\], \[noun, adverb\], \[adjective, adjective\], \[past-participle, adjective\], \[past-participle, adverb\], and \[past-participle, past-participle\], among others.</Paragraph> <Paragraph position="7"> 4. Other pairs are scored according to the formulas given in Figure 2. Note the following effects of the formulas: The association score (based principally on frequency) can sometimes be unreliable. For example, if the phrase &quot;computer aided design&quot; occurs frequently in a corpus, &quot;aided design&quot; may be judged a good association pair, even though &quot;computer aided&quot; might be a better pair. A problem may arise when processing a phrase such as &quot;program aided design&quot;: if &quot;program aided&quot; does not occur frequently in the corpus and we use frequency as the principal statistic, we may (incorrectly) be led to parse the phrase as &quot;\[program (aided design)\]&quot;. One solution to such a problem is to recompute the bigram occurrence statistics after making each round of preferred associations. Thus, using the example above, if we first make the association &quot;computer aided&quot; everywhere it occurs, many instances of &quot;aided design&quot; will be removed from the corpus. Upon recalculation of the (free) bigram statistics, &quot;aided design&quot; will be demoted in value and the false evidence for &quot;aided design&quot; as a preferred association in some contexts will be eliminated.</Paragraph> <Paragraph position="8"> The actual implementation of such a scheme requires multiple passes over the corpus to generate phrases. The first phrases chosen must always be the most reliable. To aid us in making such decisions we have developed a metric for scoring preferred associations in their local NP contexts.</Paragraph> <Paragraph position="9"> To establish a preference metric, we use two statistics: (1) the frequency of the pair in the corpus, F(W1, W2), and (2) the number of the times that the pair is locally dominant in any NP in which the pair occurs. A pair is locally dominant in an NP iff it has a higher association score than either of the pairs that can be formed from contiguous other words in the NP. For example, in an NP with the sequence \[X, Y, g\], we compare S(X, Y) with S(Y, g); whichever is higher is locally dominant. The preference score (PS) for a pair is determined by the ratio of its local dominance count (LDC)--the total number of cases in which the pair is locally dominant--to its frequency:</Paragraph> <Paragraph position="11"> By definition all two-word NIPs score their pairs as locally dominant.</Paragraph> <Paragraph position="12"> In general, in each processing phase we make only those associations in the corpus where a pair's PS is above a specified threshold. If more than one association is possible (above theshold) in a particular NP, we make all possible associations, but in order of PS: the first grouping goes to the pair with highest PS, and so on. In practice, we have used 0.7 as the threshold for most processing phases. 9</Paragraph> </Section> </Section> <Section position="9" start_page="21" end_page="22" type="metho"> <SectionTitle> 4 Experiment </SectionTitle> <Paragraph position="0"> We tested the phrase extraction system (PES) by using it to index documents in an actual retrieval task.</Paragraph> <Paragraph position="1"> In particular, we substituted the PES for the default NLP module in the CLARIT system and then indexed a large corpus using the terms nominated by the PES, essentially the extracted small compounds and single words (but not words within a lexical atom). All other normal CLARIT processing-weighting of terms, division of documents into subdocuments (passages), vector-space modeling, etc.--was used in its default mode. As a baseline degWhen the phrase data becomes sparse, e.g., after six or seven iterations of processing, it is desirable to reduce the threshold.</Paragraph> <Paragraph position="2"> for comparison, we used standard CLARIT processing of the same corpus, with the NLP module set to return full NPs and their contained words (and no further subphrase analysis).l 0 The corpus used is a 240-megabyte collection of Associated Press newswire stories from 1989 (AP89), taken from the set of TREC corpora. There are about 3-million simplex NPs in the corpus and about 1.5-million complex NPs. For evaluation, we used TREC queries 51-100, ll each of which is a relatively long description of an information need. Queries were processed by the PES and normal CLARIT NLP modules, respectively, to generate query terms, which were then used for CLARIT retrieval.</Paragraph> <Paragraph position="3"> To quantify the effects of PES processing, we used the standard IR evaluation measures of recall and precision. Recall measures how many of the relevant documents have been actually retrieved. Precision measures how many of the retrieved documents are indeed relevant. For example, if the total number of relevant documents is N and the system returns M documents of which K are relevant, then,</Paragraph> <Paragraph position="5"> We used the judged-relevant documents from the TREC evaluations as the gold standard in scoring the performance of the two processes.</Paragraph> <Paragraph position="6"> suggests that the PES could be used to support other IR enhancements, such as automatic feedback of the top-returned documents to expand the initial query for a second retrieval step) 2</Paragraph> </Section> class="xml-element"></Paper>