File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0302_metho.xml

Size: 31,683 bytes

Last Modified: 2025-10-06 14:13:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0302">
  <Title>ROBUST TEXT PROCESSING IN AUTOMATED INFORMATION RETRIEVAL</Title>
  <Section position="4" start_page="9" end_page="9" type="metho">
    <SectionTitle>
OVERALL DESIGN
</SectionTitle>
    <Paragraph position="0"> Our information retrieval system consists of a traditional statistical backbone (NIST's PRISE system; Harman and Candela, 1989) augmented with various natural language processing components that assist the system in database processing (stemming, indexing, word and phrase clustering, selectional restrictions), and translate a user's information request into an effective query. This design is a careful compromise between purely statistical non-linguistic approaches and those requiring rather accomplished (and expensive) semantic analysis of data~ often referred to as 'conceptual retrieval'.</Paragraph>
    <Paragraph position="1"> In our system the database text is first processed with a fast syntactic parser. Subsequently certain types of phrases are extracted from the parse  trees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform user's request into a search query.</Paragraph>
    <Paragraph position="2"> The user's natural language request is also parsed, and all indexing terms occurring in them are identified. Certain highly ambiguous, usually single-word terms may be dropped, provided that they also occur as elements in some compound terms. At the same time, other terms may be added, namely those which are linked to some query term through admissible similarity relations. For example, &amp;quot;unlawful activity&amp;quot; is added to a query containing the compound term &amp;quot;illegal activity&amp;quot; via a synonymy link between &amp;quot;illegal&amp;quot; and &amp;quot;unlawful&amp;quot;. After the final query is constructed, the database search follows, and a ranked list of documents is returned.</Paragraph>
    <Paragraph position="3"> The purpose of this elaborate linguistic processing is to create a better representation of documents and to generate best possible queries out of user's initial requests. Despite limitations of termand-weight type representation (or boolean versions thereof), very good queries can be produced by human experts. In order to imitate an expert, the system must be able to learn about its database, in particular about various correlations among index terms.</Paragraph>
  </Section>
  <Section position="5" start_page="9" end_page="11" type="metho">
    <SectionTitle>
FAST PARSING WITH TTP PARSER
</SectionTitle>
    <Paragraph position="0"> &amp;quot;I'I'P (Tagged Text Parser) is based on the Linguistic String Grammar developed by Sager (1981). The parser currently encompasses some 400 grammar productions, but it is by no means complete.</Paragraph>
    <Paragraph position="1"> The parser's output is a regularized parse tree representation of each sentence, that is, a representation that reflects the sentence's logical predicate-argument structure. For example, logical subject and logical object are identified in both passive and active sentences, and noun phrases are organized around their head elements. The significance of this representation will be discussed below. The parser is equipped with a powerful skip-and-fit recovery mechanism that allows it to operate effectively in the faze of ill-formed input or under a severe time pressure. In the runs with approximately 83 million words of TREC's Wall Street Journal texts~ the parser's 4 Approximately 0.5 GBytes of text. over 4 million senteilcC/~. null speed averaged between 0.3 and 0.5 seconds per sentence, or up to 4200 words per minute, on a Sun's SparcStation-2.</Paragraph>
    <Paragraph position="2"> 'I'I'P is a full grammar parser, and initially, it attempts to generate a complete analysis for each sentence. However, unlike an ordinary parser, it has a built-in timer which regulates the amount of time allowed for parsing any one sentence. If a parse is not returned before the allotted time elapses, the parser enters the skip-and-fit mode in which it will try to &amp;quot;fit&amp;quot; the parse. While in the skip-and-fit mode. the parser will attempt to forcibly reduce incomplete constituents, possibly skipping portions of input in order to restart processing at a next unattempted constituent. In other words, the parser will favor reduction to backtracking while in the skip-and-fit mode. The result of this strategy is an approximate parse, partially fitted using top-down predictions. The fragments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal parser that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse structure. As an illustration, consider the following sentence taken from the CACM-3204 corpus: The method is illustrated by the automatic construction of both reeursive and iterative programs operating on natural numbers, lists, and trees, in order to construct a program satisfying certain specifications a theorem induced by those specifications is proved, and the destred program is extracted from the proof.</Paragraph>
    <Paragraph position="3"> The italicized fragment is likely to cause additional complications in parsing this lengthy string, and the parser may be better off ignoring this fragment altogether. To do so successfully, the parser must close the currently open constituent (i.e., reduce a program satisfying certain specifications to NP), and possibly a few of its parent constituents, removing corresponding productions from further consideration, until an appropriate production is reactivated. In this case, TIP may force the following reductions: SI -~ to V NP, SA --~ SI; S -.~ NP V NP SA, until the production S ~ S and S is reached. Next, the parser skips input to find and, and resumes normal processing. null As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be performed with a degree of determinism. This means that most of the iexical level ambiguity must be removed from the input text. prior to parsing. We achieve this using a stochastic parts of speech tagger to preprocess the text. Full details of the parser can be found in (Strzalkowski, 1992).</Paragraph>
  </Section>
  <Section position="6" start_page="11" end_page="11" type="metho">
    <SectionTitle>
PART OF SPEECH TAGGER
</SectionTitle>
    <Paragraph position="0"> One way of dealing with lexical ambiguity is to use a tagger to preprocess the input marking each word with a tag that indicates its syntactic categorization: a part of speech with selected morphological features such as number, tense, mode, case and degree. The following are tagged sentences from the CACM-32(M collection: 5 The/dt paper/nn presents/vbz aldt proposal/nn for~in structured/vbn representation/nn of/in muhiprogramming/vbg in~in a/dt high/jj level/nn language/nn ./per The/dt notation/nn used/vbn explicitly/rb associates/vbz a/dt data/nns structure/nn shared/vbn by~in concurrent/jj processes/nns with~in operations/nns defined/vbn on~in it/pp ./per The tags are understood as follows: dt - determiner, nn - singular noun, nns - plural noun, in - preposition, jj - adjective, vbz - verb in present tense third person singular, to - particle &amp;quot;to&amp;quot;, vbg - present participle, vbn - past participle, vbd - past tense verb, vb infinitive verb, cc - coordinate conjunction.</Paragraph>
    <Paragraph position="1"> Tagging of the input text substantially reduces the search space of a top-down parser since it resolves most of the lexical level ambiguities. In the examples above, tagging of presents as &amp;quot;vbz&amp;quot; in the first sentence cuts off a potentially long and costly &amp;quot;garden path&amp;quot; with presents as a plural noun followed by a headless relative clause starting with (that) a proposal .... In the second sentence, tagging resolves ambiguity of used (vbn vs. vbd), and associates (vbz vs. nns). Perhaps more importantly, elimination of word-level lexical ambiguity allows the parser to make projection about the input which is yet to be parsed, using a simple lookahead; in particular, phrase boundaries can be determined with a degree of confidence (Church, 1988). This latter property is critical for implementing skip-and-fit recovery technique outlined in the previous section.</Paragraph>
    <Paragraph position="2"> Tagging of input also helps to reduce the number of parse structures that can be assigned to a sentence, decreases the demand for consulting of the dictionary, and simplifies dealing with unknown words. Since every item in the sentence is assigned a tag, so are the words for which we have no entry in the lexicon. Many of these words will be tagged as &amp;quot;rip&amp;quot; (proper noun), however, the surrounding tags may force other selections. In the following example, chinese, which does not appear in the dictionary, s Tagged using the 35-tag Penn Treebank Tagset created at the Univemty of Penn~Ivtnnt is tagged as -jj,,:6 this~dr paper/nn dates/vbz back/rb the~dr  genesis/nn of~in binary/jj conception/nn circa~in 5000/cd years/nns ago/rb ,~corn as/rb derived/vbn by~in the~dr chinese/jj ancients/nns ./per</Paragraph>
  </Section>
  <Section position="7" start_page="11" end_page="11" type="metho">
    <SectionTitle>
WORD SUFFIX TRIMMER
</SectionTitle>
    <Paragraph position="0"> Word stemming has been an effective way of improving document recall since it reduces words to their common morphological root, thus allowing more successful matches. On the other hand, stemming tends to decrease retrieval precision, if care is not taken to prevent situations where otherwise unrelated words are reduced to the same stem. In our system we replaced a traditional morphological stemmer with a conservative dictionary-assisted suffix trim- null tasks: (1) it reduces inflected word forms to their root forms as specified in the dictionary, and (2) it converts nominalized verb forms (e.g., &amp;quot;implementation&amp;quot;, &amp;quot;storage&amp;quot;) to the root forms of corresponding verbs (i.e., &amp;quot;implement&amp;quot;, &amp;quot;store&amp;quot;). This is accomplished by removing a standard suffix, e.g..</Paragraph>
    <Paragraph position="1"> &amp;quot;stor+age&amp;quot;, replacing it with a standard root ending C+e&amp;quot;), and checking the newly created word against the dictionary, i.e., we check whether the new root (&amp;quot;store&amp;quot;) is indeed a legal word, and whether the original root (&amp;quot;storage&amp;quot;) is defined using the new root (&amp;quot;store&amp;quot;) or one of its standard inflectional forms (e.g., &amp;quot;storing&amp;quot;). For example, the following definitions are excerpted from the Oxford Advanced Learner's Dictionary (OALD): storage n \[13\] (space used for, money paid for) the storing of goods ...</Paragraph>
    <Paragraph position="2"> diversion n \[U\] diverting ...</Paragraph>
    <Paragraph position="3"> procession n It\] number of persons, vehicles, etc moving forward and following each other in an orderly way.</Paragraph>
    <Paragraph position="4"> Therefore, we can reduce &amp;quot;diversion&amp;quot; to &amp;quot;divert&amp;quot; by removing the suffix &amp;quot;+sion&amp;quot; and adding root form suffix &amp;quot;+t&amp;quot;. On the other hand, &amp;quot;process+ion&amp;quot; is not reduced to &amp;quot;process&amp;quot;.</Paragraph>
    <Paragraph position="5"> Earlier experiments with CACM-3204 collection showed an improvement in retrieval precision by  6% to 8% over the base system equipped with a standard morphological stemmer (the SMART stemmer).</Paragraph>
    <Paragraph position="6"> 6 We use the machine ~_d_~ie version of the Oxford Advanced Learner's Dictionary (OALD).</Paragraph>
    <Paragraph position="7"> 7 Dealing with prefixes is a more complicated matter, since they may have quite strong effect upon the meaning of the resulting tenn. e.g., un- usually introduces explicit negation.</Paragraph>
  </Section>
  <Section position="8" start_page="11" end_page="11" type="metho">
    <SectionTitle>
HEAD-MODIFIER STRUCTURES
</SectionTitle>
    <Paragraph position="0"> Syntactic phrases extracted from TIP parse trees are head-modifier pairs. The head in such a pair is a central element of a phrase (main verb, main noun, etc.), while the modifier is one of the adjunct arguments of the head. In the TREC experiments reported here we extracted head-modifier word and fixed-phrase pairs only. While TREC WSJ database is large enough to warrant generation of larger compounds, we were in no position to verify their effectiveness in indexing. This was largely because of the tight schedule, but also because of rapidly escalating complexity of the indexing process: even with 2word phrases, compound terms accounted for nearly 96% of all index entries, in other words, including 2word phrases has increased the index size 25 times! Let us consider a specific example from WSJ database: The former Soviet president has been a local hero ever since a Russian tank invaded WisconSilt. null The tagged sentence is given below, followed by the regularized parse structure generated by 'FI'P, given  It should be noted that the parser's output is a predicate-argument structure centered around main elements of various phrases. In Figure 1, BE is the main predicate (modified by HAVE) with 2 arguments (subject, object) and 2 adjuncts (adv, sub_oral). INVADE is the predicate in the subordinate clause with 2 arguments (subject. object). The subject of BE is a noun phrase with PRESIDENT as the head element, two modifiers (FORMER, SOVIET) and a determiner (THE). From this structure, we extract head-modifier pairs that become candidates for compound terms. The following types of pairs are considered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb. These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval system; retrieval of information from databases;, and information that can be retrieved by a user-controlled interactive search process. In the example at hand, the following head-modifier pairs are  such as BE and FORMER, or names, such as WISCONSIN, will be later discarded):  We may note that the three-word phrase former Soviet president has been broken into two pairs former president and Soviet president, both of which denote things that are potentially quite different from what the original phrase refers to, and this fact may have potentially negative effect on retrieval precision. This is one place where a longer phrase appears more appropriate. The representation of this sentence may therefore contain the following terms: PRESIDENT. SOVIET, PRESIDENT+SOVIET.</Paragraph>
  </Section>
  <Section position="9" start_page="11" end_page="13" type="metho">
    <SectionTitle>
PRESIDENT+FORMEIL HERO, HERO+LOCAL,
</SectionTitle>
    <Paragraph position="0"> INVADE. TANK. TANK+INVADE. TANK+RUSSIAN.</Paragraph>
    <Paragraph position="1"> RUSSIAN. INVADE+WISCONSIN. WISCONSIN.</Paragraph>
    <Paragraph position="2"> The particular way of interpreting syntactic contexts was dictated, to some degree at least, by statistical considerations. Our original experiments  were performed on a relatively small collection (CACM-3204), and therefore we combined pairs obtained from different syntactic relations (e.g., verb-object, subject-verb, noun-adjunct, etc.) in order to increase frequencies of some associations. This became largely unnecessary in a large collection such as TIPSTER, but we had no means to test alternative options, and thus decided to stay with the original. It should not be difficult to see that this was a compromise solution, since many important distinctions were potentially lost, and strong associations could be produced where there weren't any. A way to improve things is to consider different syntactic relations independently, perhaps as independent sources of evidence that could lend support (or not) to certain term similarity predictions. We have already started testing this option.</Paragraph>
    <Paragraph position="3"> One difficulty in obtaining head-modifier pairs of highest accuracy is the notorious ambiguity of nominal compounds. For example, the phrase natural language processing should generate language+natural and processing+language, while dynamic information processing is expected to yield processing+dynamic and processing+information. A still another case is executive vice president where the association president+executive may be stretching things a bit too far. Since our parser has no knowledge about the text domain, and uses no semantic preferences, it does not attempt to guess any internal associations within such phrases. Instead, this task is passed to the pair extractor module which processes ambiguous parse smactures in two phases.</Paragraph>
    <Paragraph position="4"> In phase one, all and only unambiguous head-modifier pairs are extracted, and the frequencies of their occurrences are recorded. In phase two, frequency information about pairs generated in the first pass is used to form associations from ambiguous structures. For example, if language+natural has occurred unambiguously a number times in contexts such as parser for natural language, while processing+natural has occurred significantly fewer times or perhaps none at all, then we will prefer the former association as valid.</Paragraph>
  </Section>
  <Section position="10" start_page="13" end_page="14" type="metho">
    <SectionTitle>
TERM CORRELATIONS FROM TEXT
</SectionTitle>
    <Paragraph position="0"> Head-modifier pairs form compound terms used in database indexing. They also serve as occurrence contexts for smaller terms, including single-word terms. If two terms tend to be modified with a number of common modifiers and otherwise appear in few distinct contexts, we assign them a similarity coefficient, a real number between 0 and 1. The similarity is determined by comparing distribution characteristics for both terms within the corpus: how much information contents do they carry, do their information contribution over contexts vat3' greatly, are the common contexts in which these terms occur specific enough? In general we will credit high-contents terms appearing in identical contexts, especially if these contexts are not too commonplace, s The relative similarity between two words xi and x2 can be obtained using the following formula (ct is a large constant): 9</Paragraph>
    <Paragraph position="2"> and IC is the Information Contribution measure indicating the strength of word pairings, and defined as</Paragraph>
    <Paragraph position="4"> where f~,y is the absolute frequency of pair Ix,y\] in the corpus, nx is the frequency of term x at the head position, and dx is a dispersion parameter understood as the number of distinct syntactic contexts in which term x is found. The similarity function is further normalized with respect to SIM (x i ,x I ). Example similarities are listed in Table 1.</Paragraph>
    <Paragraph position="5"> We also considered a term clustering option which, unlike the shnilatity formula above, produces clusters of related words and phrases, but will not generate uniform term similarity ranking across clusters. We used a variant of weighted Tanimoto's measure described in (Grefenstette, 1992):</Paragraph>
    <Paragraph position="7"> Sample clusters obtained from approx. 100 MByte (17 million words) sample of WSJ are given in Table 2.</Paragraph>
    <Paragraph position="8"> In order to generate better similarities and clusters, we require that words x\] and x2 appear in at least M distinct common contexts, where a common context is a couple of pairs \[x\],y\] and \[x2,y\], or \[y,x l \] and \[y,x 2 \] such that they each occurred at least twice. Thus, banana and Baltic will not be considered for similarity relation on the basis of their occurrences in the common context of republic, no matter how frequent, unless there is another such common context comparably frequent (there wasn't any in TREC WSJ database). For smaller or narrow domain databases M=2 is usually sufficient. For large databases covering rather diverse subject matter, like TIPSTER or even WSJ, we used M_&gt;3) deg It may be worth pointing out that the similarities are calculated using term co-occurrences in syntactic rather than in document-size contexts, the latter being the usual practice in non-linguistic clustering (e.g., Sparck Jones and Barber, 1971; Crouch, 1988; Lewis and Croft, 1990). Although the two methods of term clustering may be considered mutually complementary in certain situations, we believe that more and stronger associations can be obtained through syntactic-context clustering, given sufficient amount of data and a reasonably accurate syntactic parser) ~</Paragraph>
  </Section>
  <Section position="11" start_page="14" end_page="14" type="metho">
    <SectionTitle>
QUERY EXPANSION
</SectionTitle>
    <Paragraph position="0"> Similarity relations are used to expand user queries with new terms, in an attempt to make the final search query more comprehensive (adding synonyms) and/or more pointed (adding specializations)) 2 It follows that not all similarity relations will be equally useful in query expansion, for instance, complementary and antonymous relations like the 1o For example banana and Dominican were found to have two common contexts: republic and plant, although this second ocin appare, nfly different senses in Dominican plant and bana. na p/ant.</Paragraph>
    <Paragraph position="1"> &amp;quot; Nun-syntactic contexts cross sentence boundaries with no fuss. which is helpful with short, succinct documents (such as CACM absuacts), but less so with longer texts; see also (Gnsimaan et al,, 1986).</Paragraph>
    <Paragraph position="2"> :2 Query expansion (in the sense considered here, though not quite in the same way) has been used in information retrieval rescacch before (e.g., Sparc~ Jones and Tait. 1984; Hamum, 1988). usually with nuxcd ~csults. An ahemanve is to use term clusters to create new teans, &amp;quot;meta~nns&amp;quot;. and use them to index the database instead {e.g.. Crouch. 1988; lewis and Croft, 1990). We found that the query expansion approach gives the system more flexibility, for instance, by making room for hypenext-style topic exploration via user feedback.</Paragraph>
    <Paragraph position="3"> one between Australian and Canadian, or accept and reject may actually harm system's performance, since we may end up retrieving many irrelevant documents. Similarly, the effectiveness of a query containing vitamin is likely to diminish if we add a similar but far more general term such as acid. On the other hand, database search is likely to miss relevant documents if we overlook the fact that fortran is a programming language, or that infant is a baby and baby is a child. We noted that an average set of similarities generated from a text corpus contains about as many &amp;quot;good&amp;quot; relations (synonymy, specialization) as &amp;quot;bad&amp;quot; relations (antonymy. complementation, generalization), as seen from the query expansion viewpoint. Therefore any attempt to separate these two classes and to increase the proportion of &amp;quot;good&amp;quot; relations should result in improved retrieval. This has indeed been confirmed in our experiments where a relatively crude filter has visibly increased retrieval precision.</Paragraph>
    <Paragraph position="4"> In order to create an appropriate filter, we devised a global term specificity measure (GTS) which is calculated for each term across all contexts in which it occurs. The general philosophy here is that a more specific word/phrase would have a more limited use, i.e., a more specific term would appear in fewer distinct contexts. In this respect, GTS is similar to the standard inverted document frequency (idjO measure except that term frequency is measured over syntactic units rather than document size units. 13 Terms with higher GTS values are generally considered more specific, but the specificity comparison is only meaningful for terms which are already known to be similar. The new function is calculated according to the following formula:</Paragraph>
    <Paragraph position="6"> For any two terms w~ and w 2, and a constant ~ &gt; I.</Paragraph>
    <Paragraph position="7"> if GTS(w 2) &gt; 8 * GTS(w\]) then w 2 is considered more specific than w\]. In addition, if &amp;quot; We believe that measuring term specificity over document-size contexts (e.g.. Sparck Jones. 1972) may not be appropnate in this case. In pameular, s3mtax-based contexts allow for pr~:essmg texts without any internal docmnent structure.</Paragraph>
    <Paragraph position="8">  SlM,,o,~(w i,w2) = o &gt; O, where 0 is an empirically established threshold, then w2 can be added to the query containing term w 1 with weight o. 14 For example, the following were obtained from TREC WSJ training database:</Paragraph>
    <Paragraph position="10"> Therefore both baby and infant can be used to specialize child. With this filter, the relationship between baby and infant had to be discarded, as we are unable to tell synonymous or near synonymous relationships from those which are primarily complementary, e.g., man and woman.</Paragraph>
  </Section>
  <Section position="12" start_page="14" end_page="17" type="metho">
    <SectionTitle>
SUMMARY OF RESULTS
</SectionTitle>
    <Paragraph position="0"> We have processed the total of 500 MBytes of articles from Wall Street Journal section of TREC database. Retrieval experiments involved 50 user information requests (topics) (TREC topics 51-100) consisting of several fields that included both text and user supplied keywords. A typical topic is shown  below: &lt;:top&gt; &lt;head&gt; Tipster Topic Description &lt;hum&gt; Number:. 059 &lt;dora&gt; Domain: Environment &lt;title&gt; Topic: Weather Related Fatalities &lt;desc&gt; Description: Document will report a type of weather event which has directly caused at least one fatality in some location. .~narr&gt; Narrative:  A relevant document will include the number of people killed and injured by the weather eveat, as well as reporting the type of we.~er event and the location of the event.</Paragraph>
    <Paragraph position="1"> &lt;con&gt; Cmc~(s): For CAC'M-3204 colle~ion the filter was most effective at o = 0..5&amp;quot;7. For TREC-I we changed the similarity formula slightly in order to obtain ~ nonnahza~vns m all cases. This however lowered smailanty coefficients in general and a new threshold had to be selected. We used o = 0.1 m TREC-I rims, although it tamed om tobcapoor choice. In all C/au~Svaried between 10and I00. I. lightning, avalanche, tornado, typhoon, humcane. heat. heat wave. flood, snow. rain. downpour. blizzard, storm, freezing temperatures  2. dead. killed, fatal, death, fatality, victim 3. NOT man-made disasters, NOT war-induced famine 4. NOT earthquakes, NOT volcanic ernptions &lt;/top&gt;  Note that this topic actually consists of two different statements of the same query: the natural language specification consisting of &lt;desc&gt; and &lt;nan-&gt; fields. and an expert-selected list of key terms which are often far more informative than the narrative part. Results obtained for queries using text fields only and those involving both text and keyword fields are reported separately. Further experiments have suggested that natural language processing impact is significant but may be severely limited by the expressiveness of the term-based representation. Since the &lt;con&gt; field is considered the expert-user's rendering of the 'optimal&amp;quot; search query, our system is able to discover much of it from a less complete specification in the text section of the request via query expansion. In fact, we noted that the recall/precision gap between automatically generated queries and those supplied by the user was largely closed when NLP was used. Moreover, even with the keyword field included in the query along with other fields, NLP's impact on the system's performance is still noticeable.</Paragraph>
    <Paragraph position="2"> Other results on the impact of different fields in TREC topics on the final recall/precision results were reported by Broglio and Croft (1993) at the ARPA HLT workshop, although text-only runs were not included. One of the most striking observations they have made is that the narrative field is entirely disposable, and moreover that its inclusion in the query actually hurts the system's performance. It has to be pointed out, however, that they do little language processing. 15 Summary statistics for these runs are shown in Table 4. These results are fairly tentative and should be regarded with some caution. For one, the column named txt reports performance of &lt;dcsc&gt; and &lt;narr&gt; fields which have been processed with our suffix~rimmer. This means some NIP has been done already (tagging + lexicon), and therefore what we see there is not the performance of 'pure' statistical system. The same applies to con column. (For u Brace Cmfl (personal communication. 1992) has suggested that excluding Ill expert-made fields (i.e.. &lt;ctm&gt; and &lt;:lac&gt;) would make the queries quite ineffective. Broglio (personal commumeanvc, 1993) co.rims Ibis showing thaz text-only retrieval (i.e.. with &lt;desc&gt; and ~narr'&gt;) shows an average prnc:sion at morn than 30% below that of &lt;con&gt;-based retrieval.</Paragraph>
    <Paragraph position="3">  the more specific term).</Paragraph>
    <Paragraph position="4"> word cluster takeover merge, buy-out acquisition, bid stock share, issue, bond, price staff personnel, employee, force share stock, issue,fund sensitive crucial, difficult, critical rumor speculate president director, executive chairman, manage outlook forecast, prospect trend, picture law rule, legislate bill, regulate earnings revenue, income por(olio asset, invest, loan property, hold inflate growth, earnings, rise industry business, company, market help additional, support, involve growth increase, rise, gain decline, earnings, profit firm bank, concern, group, unit environ climate, condition situation, trend debt loan, secure, bond custom( er ) client, investor buyer, consume(r) counsel attorney compute machine, software competitor rival, partner, buyer company business, firm, bank market, industry, concern big large, major, huge base facile, source reserve, support asset property, loan,fund, invest share, stock, money Table 2. Selected clusters obtained from approx. 107 words of text with weighted Tanimoto formula.  comparison, see Table 3 where runs with CACM-3204 collection included 'pure' statistics run (base), and note the impact our suffix trimmer is having.) Nonetheless, one may notice that automated NLP can be very effective at discovering the right query from an imprecise narrative specification: as much as 82% of the effectiveness of the expert-generated query can be attained.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML