File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1040_metho.xml

Size: 18,998 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1040">
  <Title>APPENDIX: SAMPLE DATA DOCUMENT TEXT: *RECORD*</Title>
  <Section position="3" start_page="0" end_page="206" type="metho">
    <SectionTitle>
1. OVERALL DESIGN
</SectionTitle>
    <Paragraph position="0"> Our information retrieval system consists of a traditional statistical backbone (Harman and Candela, 1989) augmented with various natural language processing components that assist the system in database processing (stemming, indexing, word and phrase clustering, selectional restrictions), and translate a user's information request into an effective query. This design is a careful compromise between purely statistical non-linguistic approaches and those requiring rather accomplished (and expensive) semantic analysis of data, often referred to as 'conceptual retrieval'. The conceptual retrieval systems, though quite effective, are not yet mature enough to be considered in serious information retrieval applications, the major problems being their extreme inefficiency and the need for manual encoding of domain knowledge (Mauldin, 1991).</Paragraph>
    <Paragraph position="1"> In our system the database text is first processed with a fast syntactic parser. Subsequently certain types of phrases are extracted from the parse lxees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform user's request into a search query.</Paragraph>
    <Paragraph position="2"> The user's natural language request is also parsed, and all indexing terms occurring in them are identified. Next, certain highly ambiguous (usually single-word) terms are dropped, provided that they also occur as elements in some compound terms. For example, &amp;quot;natural&amp;quot; is deleted from a query already containing &amp;quot;natural language&amp;quot; because  &amp;quot;natural&amp;quot; occurs in many unrelated contexts: &amp;quot;natural number&amp;quot;, &amp;quot;natural logarithm&amp;quot;, &amp;quot;natural approach&amp;quot;, etc. At the same time, other terms may be added, namely those which are linked to some query term through admissible similarity relations. For example, &amp;quot;fortran&amp;quot; is added to a query containing the compound term &amp;quot;program language&amp;quot; via a specification link. After the final query is constructed, the database search follows, and a ranked list of documents is returned.</Paragraph>
    <Paragraph position="3"> It should be noted that all the processing steps, those performed by the backbone system, and these performed by the natural language processing components, are fully automated, and no human intervention or manual encoding is required.</Paragraph>
  </Section>
  <Section position="4" start_page="206" end_page="206" type="metho">
    <SectionTitle>
2. FAST PARSING WITH TTP
TIP (Tagged Text Parser) is based on the Linguistic String
</SectionTitle>
    <Paragraph position="0"> Grammar developed by Sager (1981). Written in Quintus Prolog, the parser currently encompasses more than 400 grammar productions. It produces regularized parse tree representations for each sentence that reflect the sentence's logical structure. The parser is equipped with a powerful skip-and-fit recovery mechanism that allows it to operate effectively in the face of ill-formed input or under a severe time pressure. In the recent experiments with approximately 6 million words of English texts, 1 the parser's speed averaged between 0.45 and 0.5 seconds per sentence, or up to 2600 words per minute, on a 21 MIPS SparcStation ELC.</Paragraph>
    <Paragraph position="1"> Some details of the parser are discussed below. 2 TIP is a full grammar parser, and initially, it attempts to generate a complete analysis for each sentence. However, unlike an ordinary parser, it has a built-in timer which regulates the amount of time allowed for parsing any one sentence. If a parse is not returned before the allotted time I These include CACM-3204, MUC-3, and a selection of nearly 6,000 technical articles extracted from Computer Library database (a Zfff Communications Inc. CD-ROM).</Paragraph>
  </Section>
  <Section position="5" start_page="206" end_page="206" type="metho">
    <SectionTitle>
2 A complete description can be found in (Strzalkowski, 1991).
</SectionTitle>
    <Paragraph position="0"> elapses, the parser enters the skip-and-fit mode in which it will try to &amp;quot;fit&amp;quot; the parse. While in the skip-and-fit mode, the parser will attempt to forcibly reduce incomplete constituents, possibly skipping portions of input in order to restart processing at a next unattempted constituent. In other words, the parser will favor reduction to backtracking while in the skip-and-fit mode. The result of this strategy is an approximate parse, partially fitted using top-down predictions. The fragments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal parser that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse structure.</Paragraph>
    <Paragraph position="1"> As an illustration, consider the following sentence taken from the CACM-3204 corpus: The method is illustrated by the automatic construction of both recursive and iterative programs operating on natural numbers, lists, and trees, in order to construct a program satisfying certain specifications a theorem induced by those specifications is proved, and the desired program is extracted from the proof.</Paragraph>
    <Paragraph position="2"> The italicized fragment is likely to cause additional complications in parsing this lengthy string, and the parser may be better off ignoring this fragment altogether. To do so successfully, the parser must close the currently open constituent (i.e., reduce a program satisfying certain specifications to NP), and possibly a few of its parent constituents, removing corresponding productions from further consideration, until an appropriate production is reactivated. In this case, TIP may force the following reductions: SI --&gt; to V NP; SA ----&gt; SI; S ---&gt; NP V NP SA, until the production S --&gt; S and S is reached. Next, the parser skips input to find and, and resumes normal processing.</Paragraph>
    <Paragraph position="3"> As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be performed with a degree of determinism. This means that most of the lexical level ambiguity must be removed from the input text, prior to parsing. We achieve this using a stochastic parts of speech tagger 3 to preprocess the text.</Paragraph>
  </Section>
  <Section position="6" start_page="206" end_page="206" type="metho">
    <SectionTitle>
3. WORD SUFFIX TRIMMER
</SectionTitle>
    <Paragraph position="0"> Word stemming has been an effective way of improving document recall since it reduces words to their common morphological root, thus allowing more successful matches.</Paragraph>
    <Paragraph position="1"> On the other hand, stemming tends to decrease retrieval precision, if care is not taken to prevent situations where otherwise unrelated words are reduced to the same stem. In our system we replaced a traditional morphological stemmer with a conservative dictionary-assisted suffix trimmer. 4 The suffix trimmer performs essentially two tasks: (1) it reduces inflected word forms to their root forms as specified in the dictionary, and (2) it converts nominalized verb</Paragraph>
  </Section>
  <Section position="7" start_page="206" end_page="206" type="metho">
    <SectionTitle>
3 Courtesy of Bolt Beranek and Newman.
4 We use Oxford Advanced Leamer's Dictionary (OALD) MRD.
</SectionTitle>
    <Paragraph position="0"> forms (eg. &amp;quot;implementation&amp;quot;, &amp;quot;storage&amp;quot;) to the root forms of corresponding verbs (i.e., &amp;quot;implement&amp;quot;, &amp;quot;store&amp;quot;). This is accomplished by removing a standard suffix, eg.</Paragraph>
    <Paragraph position="1"> &amp;quot;stor+age&amp;quot;, replacing it with a standard root ending (&amp;quot;+e&amp;quot;), and checking the newly created word against the dictionary, i.e., we check whether the original root (&amp;quot;storage&amp;quot;) is defined using the new root (&amp;quot;store&amp;quot;). This allows reducing &amp;quot;diversion&amp;quot; to &amp;quot;diverse&amp;quot; while preventing &amp;quot;version&amp;quot; to be replaced by &amp;quot;verse&amp;quot;. Experiments with CACM-3204 collection show an improvement in retrieval precision by 6% to 8% over the base system equipped with a standard morphological stemmer (the SMART stemmer).</Paragraph>
  </Section>
  <Section position="8" start_page="206" end_page="206" type="metho">
    <SectionTitle>
4. HEAD-MODIFIER STRUCTURES
</SectionTitle>
    <Paragraph position="0"> Syntactic phrases extracted from TTP parse trees are head-modifier pairs: from simple word pairs to complex nested structures. The head in such a pair is a central element of a phrase (verb, main noun, etc.) while the modifier is one of the adjunct arguments of the head. 5 For example, the phrase fast algorithm for parsing context-free languages yields the following pairs: algorithm+fast, algorithm+parse, parse+language, language+context_free. The following types of pairs were considered: (1) a head noun and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb, These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content.</Paragraph>
    <Paragraph position="1"> For example, the pair \[retrieve,information\] is extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and information that can be retrieved by a user-controlled interactive search process. 6 An example is shown in the appendix .7</Paragraph>
  </Section>
  <Section position="9" start_page="206" end_page="207" type="metho">
    <SectionTitle>
5. TERM CORRELATIONS FROM TEXT
</SectionTitle>
    <Paragraph position="0"> Head-modifier pairs form compound terms used in database indexing. They also serve as occurrence contexts for smaller terms, including single-word terms. In order to determine whether such pairs signify any important association between terms, we calculate the value of the 5 In the experiments reported here we extracted head-modifier word pairs only. CACM collection is too small to warrant generation of larger compounds, because of their low frequencies.</Paragraph>
    <Paragraph position="1"> To deal with nominal compounds we use frequency information about the pairs generated from the entire corpus to form preferences in ambiguous situations, such as natural language processing vs. dynamic information processing.</Paragraph>
    <Paragraph position="2"> 7 Note that working with the parsed text ensures a high degree of precision in capturing the meaningful phrases, which is especially evident when compared with the results usually obtained from either unprocessed or only partially processed text (Lewis and Croft, 1990).</Paragraph>
    <Paragraph position="3">  Informational Contribution (IC) function for each element in a pair. Higher values indicate stronger association, and the element having the largest value is considered semantically dominant. IC function is a derivative of Fano's mutual information formula recently used by Church and Hanks (1990) to compute word co-occurrence patterns in a 44 million word corpus of Associated Press news stories. They noted that while generally satisfactory, the mutual information formula often produces counterintuitive results for low-frequency data. This is particularly worrisome for relatively smaller IR collections since many important indexing terms would be eliminated from consideration. Therefore, following suggestions in Wilks et al. (1990), we adopted a revised formula that displays a more stable behavior even on very low counts. This new formula IC (x ,\[x,y \]) is'based on (an estimate o0 the conditional probability of seeing a  word y to the right of the word x, modified with a dispersion parameter for x.</Paragraph>
    <Paragraph position="4"> fx~r lC (x,\[x,y \]) n,, + d,, -1  where fx~, is the frequency of \[x ,y \] in the corpus, n x is the number of pairs in which x occurs at the same position as in Ix,y\], and d(x) is the dispersion parameter understood as the number of distinct words with which x is paired. When IC(x,\[x,y\])=O, x and y never occur together (i.e., fx,y = 0); when IC(x,\[x,y\]) = 1, x occurs only with y (i.e., fx,y =n, and d~ = 1). Selected examples generated from CACM-3204 corpus are given in Table 2 at the end of the paper. IC values for terms become the basis for calculating term-to-term similarity coefficients. If two terms tend to be modified with a number of common modifiers and otherwise appear in few distinct contexts, we assign them a similarity coefficient, a real number between 0 and 1. The similarity is determined by comparing distribution characteristics for both terms within the corpus: how much information contents do they carry, do their information contribution over contexts vary greatly, are the common contexts in which these terms occur specific enough? In general we will credit high-contents terms appearing in identical contexts, especially if these contexts are not too commonplace. 8 The relative similarity between two words xl and x z is obtained using the following formula (a is a large constant): null SIM (x 1 ,x 2) = log (a ~ sim~ (x 1,x 9) where simy (x l ,x z) = MIN (I C (x 1,\[x l ,y \]) j C (x 2,\[x 2,y \]))</Paragraph>
    <Paragraph position="6"> The similarity function is further normalized with respect to 8 It would not be appropriate to predict similarity between language and logarithm on the basis of their co-occurrence with natural.</Paragraph>
    <Paragraph position="7"> SIM(xl,xl). It may be worth pointing out that the similarities are calculated using term co-occurrences in syntactic rather than in document-size contexts, the latter being the usual practice in non-linguistic clustering (eg. Sparck Jones and Barber, 1971; Crouch, 1988; Lewis and Croft, 1990).</Paragraph>
    <Paragraph position="8"> Although the two methods of term clustering may be considered mutually complementary in certain situations, we befieve that more and slxonger associations can be obtained through syntactic-context clustering, given sufficient amount of data and a reasonably accurate syntactic parser. 9</Paragraph>
  </Section>
  <Section position="10" start_page="207" end_page="208" type="metho">
    <SectionTitle>
6. QUERY EXPANSION
</SectionTitle>
    <Paragraph position="0"> Similarity relations are used to expand user queries with new terms, in an attempt to make the final search query more comprehensive (adding synonyms) and/or more pointed (adding specializations). 1deg It follows that not all similarity relations will be equally useful in query expansion, for instance, complementary relations like the one between algol and fortran may actually harm system's performance, since we may end up retrieving many irrelevant documents. Similarly, the effectiveness of a query containing fortran is likely to diminish if we add a similar but far more general term such as language. On the other hand, database search is likely to miss relevant documents if we overlook the fact that fortran is a programming language, or that interpolate is a specification of approximate. We noted that an average set of similarities generated from a text corpus contains about as many &amp;quot;good&amp;quot; relations (synonymy, speciafization) as &amp;quot;bad&amp;quot; relations (antonymy, complementation, generalization), as seen from the query expansion viewpoint. Therefore any attempt to separate these two classes and to increase the proportion of &amp;quot;good&amp;quot; relations should result in improved retrieval. This has indeed been confirmed in our experiments where a relatively crude filter has visibly increased retrieval precision. In order to create an appropriate filter, we expanded the IC function into a global specificity measure called the cumulative informational contribution function (ICW). ICW is calculated for each term across all contexts in which it occurs. The general philosophy here is that a more specific word/phrase would have a more limited use, i.e., would appear in fewer distinct contexts. ICW is similar to the standard inverted document frequency (idj) measure except that term frequency is measured over syntactic units rather than 9 Non-syntactic contexts cross sentence boundaries with no fuss, which is helpful with short, succinct documents (such as CACM abstracts), but less so with longer texts.</Paragraph>
    <Paragraph position="1"> to Query expansion (in the sense considered here, though not quite in the same way) has been used in information retfeval research before (eg. Sparek Jones and Tait, 1984; Harman, 1988), usually with mixed results. An alternative is to use term clusters to create new terms, &amp;quot;metaterms&amp;quot;, and use them to index the database instead (eg. Crouch, 1988; Lewis and Croft, 1990). We found that the query expansion approach gives the system more flexibiUty, for instance, by making room for hypertext-style topic explora- tion via user feedback.  document size units. 11 Terms with higher ICW values are generally considered more specific, but the specificity comparison is only meaningful for terms which are already known to be similar. The new function is calculated according to the following formula: 12</Paragraph>
    <Paragraph position="3"> and analogously for IC R (w ).</Paragraph>
    <Paragraph position="4"> For any two terms w 1 and w 2, and a constant ~i &gt; 1, if ICW(w2)&gt;_~* ICW(wl) then w 2 is considered more specific than w 1. In addition, if SIM,~,~(Wl,Wz)=~&gt; O, where 0 is an empirically established threshold, then w 2 can be added to the query containing term w 1 with weight o. 13 In the CACM-3204 collection:</Paragraph>
    <Paragraph position="6"> Therefore interpolate can be used to specialize approximate, while language cannot be used to expand algol. Note that if 8 is well chosen (we used 5=10), then the above filter will also help to reject antonymous and complementary relations, such as SIM~orm (pl_i,cobol)=0.685 with ICW (pl_i)=O.O 175 and ICW (cobol)=0.0289. We continue working to develop more effective filters. Examples of filtered similarity relations obtained from CACM-3204 corpus are given in Table 3.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML