File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1072_metho.xml
Size: 15,050 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1072"> <Title>DOCUMENT REPRESENTATION IN NATURAL LANGUAGE TEXT RETRIEVAL</Title> <Section position="4" start_page="364" end_page="364" type="metho"> <SectionTitle> 2. OVERALL DESIGN </SectionTitle> <Paragraph position="0"> We have established the general architecture of a NLP-IR system, depicted schematically below, in which an advanced NLP module is inserted between the textual input (new documents, user queries) and the database search engine (in our case, NIST's PRISE system\[4\]). This design has already shown some promise in producing a better performance than the base statistical system \[5,6,7\].. text NLP repres, databas NLP: ~ \]~ In our system the database text is first processed with a sequence of programs that include a part-of-speech tagger, a lexicon-based morphological stemmer and a fast syntactic parser (TTP). 3 Subsequently certain types of phrases are extracted from the parse trees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of similarity links between smaller subphrases and words occurring in them. A further filtering process maps these similarity links onto semantic relations (generalization, specialization, synonymy, etc.) after which they are used to transform a user's request into a search query.</Paragraph> <Paragraph position="1"> The user's natural language request is also parsed, and all indexing terms occurring in it are identified. Certain highly ambiguous, usually single-word terms may be dropped, provided that they also occur as elements in some compound terms. For example, &quot;natural&quot; may be deleted from a query already containing &quot;natural language&quot; because &quot;natural&quot; occurs in many unrelated contexts: &quot;natural number&quot;, &quot;natural logarithm&quot;, &quot;natural approach&quot;, etc. At the same time, other terms may be added, namely those which are linked to some query term through admissible similarity relations.</Paragraph> <Paragraph position="2"> For example, &quot;unlawful activity&quot; is added to a query (TREC topic 055) containing the compound term &quot;illegal activity&quot; via a synonymy link between &quot;illegal&quot; and &quot;unlawful&quot;.</Paragraph> <Paragraph position="3"> One of the observations made during the course of TREC-2 was to note that removing low-quality terms from the queries is at least as important (and often more so) as adding synonyms and specializations. In some instances (e.g., routing runs) low-quality terms had to be removed (or inhibited) before similar terms could be added to the query or else the effect of query expansion was all but drowned out by the increased noise.</Paragraph> <Paragraph position="4"> 3 For a description of TTP parser, refer to \[8,9\].</Paragraph> <Paragraph position="5"> After the final query is constructed, the database search follows, and a ranked list of documents is returned. It should be noted that all the processing steps, those performed by the backbone system, and those performed by the natural language processing components, are fully automated, and no human intervention or manual encoding is required.</Paragraph> </Section> <Section position="5" start_page="364" end_page="365" type="metho"> <SectionTitle> 3. SELECTING PHRASAL TERMS </SectionTitle> <Paragraph position="0"> Syntactic phrases extracted from the parse structures are represented as head-modifier pairs. The head in such a pair is a central element of a phrase (main verb, main noun, etc.), while the modifier is one of the adjuncts or arguments of the head. In the TREC experiments reported here we extracted head-modifier word pairs only, i.e., nested pairs were not used even though this was warranted by the size of the database. 4 Figure 1 shows all stages of the initial linguistic analysis of a sample sentence from the WSJ database. The reader may note that the parser's output is a predicate-argument structure centered around the main elements of various phrases. For example, BE is the main predicate (modified by HAVE) with 2 arguments (subject, object) and 2 adjuncts (adv, sub ord). INVADE is the predicate in the subordinate clause with 2 arguments (subject, object). The subject of BE is a noun phrase with PRESIDENT as the head element, two modifiers (FORMER, SOVIET) and a determiner (THE). From this structure, we extract head-modifier pairs that become candidates for compound terms. In general, the following types of pairs are considered: (1) a head noun of a noun phrase and its left adjective or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb.</Paragraph> <Paragraph position="1"> These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and information that can be retrieved by a user-controlled interactive search process. 5 We also attempted to identify and remove any terms which were explicitly negated in order to prevent matches against their positive counterparts, either in the database or in the queries.</Paragraph> <Paragraph position="2"> One difficulty in obtaining head-modifier pairs of highest accuracy is the notorious ambiguity of nominal compounds. The pMr extractor looks at the distribution statistics of the compound terms to decide whether the association between any two words (nouns and adjectives) in a noun phrase is both syntactically valid and semantically significant. For example, we may accept language+natural and processing+language from natural language processing as correct, however, case+trading would make a mediocre term when extracted from insider trading case. On the other hand, it is important to extract trading+insider to be able to match 4 Even with 2-word phrases, compound terms accounted for nearly 88% of all index entries, in other words, including 2-word phrases increased the index size approximately 8 times.</Paragraph> <Paragraph position="3"> s Longer phrases or nested pairs may be more appropriate in some cases, e.g., when former Soviet president is broken into former president and Soviet president, we get something potentially quite different from what the original phrase refers to, and this may have a negative effect on retrieval precision.</Paragraph> <Paragraph position="4"> \[subject \[np In TANK\] It_pos A\] \[adj \[RUSSIAN\]Ill \[object \[np \[name \[WISCONSIN\]IlllIIlll EXTRACTED TERMS & WEIGHTS president 2.623519 soviet 5.416102 president+former 14.594883 hero 7.896426 invade 8.435012 tank 6.848128 tank+russian 16.030809 mssian 7.383342 president+soviet 11.556747 hero+local 14.314775 tank+invade 17.402237 wisconsin 7.785689 documents containing phrases insider trading sanctions act or insider trading activity. In addition, phrases with a significant number of occurrences across different documents, including those for which no clear disambiguation into paks can be obtained, are included as a third level of index (beside single-word terms, and pairs). 6</Paragraph> </Section> <Section position="6" start_page="365" end_page="366" type="metho"> <SectionTitle> 4. TERM WEIGHTING ISSUES </SectionTitle> <Paragraph position="0"> Finding a proper term weighting scheme is critical in term-based retrieval since the rank of a document is determined by the weights of the terms it shares with the query. One popular term weighting scheme, known as tf.idf, weights terms proportionately to their inverted document frequency scores and to their in-document frequencies (tf). The in-document frequency factor is usually norrealized by the document length, that is, it is more significant for a term to occur in a short lO0-word abstract, than in a 5000-word are concentrated in a single section or a paragraph rather than spread around the article. See the following section for more discussion.</Paragraph> <Paragraph position="1"> In our official TREC runs we used the normalized tf.idf weights for all terms alike: single 'ordinary-word' terms, proper names, as well as phrasal terms consisting of 2 or more words. 8 Whenever phrases were included in the term set of a document, the length of this document was increased accordingly. This had the effect of decreasing tf factors for 'regular' single word terms.</Paragraph> <Paragraph position="2"> A standard tf.idf weighting scheme may be inappropriate for mixed term sets, consisting of ordinary concepts, proper names, and phrases, because: (1) It favors terms that occur fairly frequently in a document, which supports only general-type queries (e.g., &quot;all you know about 'star wars'&quot;). Such queries were not typical in TREC.</Paragraph> <Paragraph position="3"> (2) It attaches low weights to infrequent, highly specific terms, such as names and phrases, whose only occurrences in a document are often decisive for relevance. Note that such terms cannot be reliably distinguished using their distribution in the database as the sole factor, and therefore syntactic and lexical information is required.</Paragraph> <Paragraph position="4"> (3) It does not address the problem of inter-term dependencies arising when phrasal terms and their component single-word terms are all included in a document representation, i.e., launch+satellite and satellite are not independent, and it is unclear whether they should be counted as two terms. In our post-TREC-2 experiments we considered (1) and (2) only. We noted that linguistic phrases, that is, phrases derived from text through primarily linguistic means, display a markedly different statistical behaviour than 'statistical phrases', i.e., those obtained using frequency-based or probabilistic formulas such as Mutual Information \[ii\]. For example, while statistical phrases with few occurrences in the corpus could be dismissed as insignificant or 'noise', infrequent linguistic phrases may in fact turn out to be quite important if only we could count all their implicit occurrences, e.g., as anaphors.</Paragraph> <Paragraph position="5"> Rather than trying to resolve anaphoric references, we changed the weighting scheme so that the phrases (but not the names, which we did not distinguish in TREC-2) were more heavily weighted by their idf scores while the in-document frequency scores were replaced by logarithms multiplied by sufficiently large constants. In addition, the top N highest-idf matching terms (simple or compound) were counted more toward the document score than the remaining terms.</Paragraph> <Paragraph position="6"> Schematically, these new weights for phrasal and highly specific terms are obtained using the following formula, while weights for most of the single-word terms remain unchanged: weight (Ti)=( C 1 *log (tf )+C 2&quot; ~(N, i) )*idf In the above, ~(N,i) is 1 for i <N and is 0 otherwise. 9 Table 1 illustrates the effect of differential weighting of phrasal terms using topic 101 and a relevant document (WSJ870226-0091) s Specifically, the system used Inc-ntc combination of weights which is already one of the most effective options of ff.idf; see \[ 10\] for details. 9 The selection of a weighting formula was partly constrained by the fact that document-length-normalized ff weights were precomputed at the indexing stage and could not be altered without re-indexing of the entire database. The intuitive interpretation of the 0~(N,i) facctoris given in the following section.</Paragraph> <Paragraph position="7"> as an example. Note that while most of the affected terms have their weights increased, sometimes substantially, for some (e.g., space+base) the weight actually decreases. Table 2 shows how ranks of the relevant documents change when phrasal terms are used with the new weighting scheme. Changing the weighting scheme for compound terms has led to an overall increase of precision of more than 20% over our official TREC-2 ad-hoc results. Table 3 summarizes statistics of the runs for queries 101-150 against the WSJ database, both with new weighting scheme and with the standard tf.idf weighting.</Paragraph> </Section> <Section position="7" start_page="366" end_page="366" type="metho"> <SectionTitle> 5. 'HOT SPOT' RETRIEVAL </SectionTitle> <Paragraph position="0"> short relevant passages. If the bulk of the document is not directly relevant to the query, then there is a strong possibility that the document will score low in the final ranking, despite some strongly relevant material in it. This problem can be dealt with by subdividing long documents at paragraph breaks, or into approximately equal length fragments and indexing the database with respect to these (e.g., \[12\]). While such approaches are effective, they also tend to be costly because of increased index size and more complicated access methods.</Paragraph> <Paragraph position="1"> Efficiency considerations have led us to investigate an alternative approach to the hot spot retrieval which would not require re-indexing of the existing database or any changes in document access. In our approach, the maximum number of terms on which a query is permitted to match a document is limited to N highest weight terms, where N can be the same for all queries or may vary from one query to another. Note that this is not the same as simply taking the N top terms from each query. Rather, for each document for which there are M matching terms with the query, only min(M,N) of them, namely those which have highest weights, will be considered when computing the document score. Moreover, only the global importance weights for terms are considered (such as idf), while local in-document frequency (eg., t o is suppressed by either taking a log or replacing it with a constant. The effect of this 'hot spot' retrieval is shown in Table 4 in the ranking of relevant documents within the top 30 retrieved documents for topic 72.</Paragraph> <Paragraph position="2"> The final ranking is obtained by adding the scores of documents in 'regular' tf.idf ranking and in the hot-spot ranking.. While some of the recall may be sacrificed ('hot spot' retrieval has often lower recall than full query retrieval, and this becomes the lower bound on recall for the combined ranking) the combined ranking precision has been consistently better than in either of the original rankings: an average improvement is 10-12% above the tf.idf run precision (which is often the stronger of the two). The 'hot spot' weighting is represented with the (x factor in the term weighting formula given in the previous section.</Paragraph> </Section> class="xml-element"></Paper>