File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/x98-1020_abstr.xml

Size: 15,135 bytes

Last Modified: 2025-10-06 13:49:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="X98-1020">
  <Title>Arab Hijackers' Demands Similar To Those of Hostage- Takers in Lebanon SUMMARIZER TOPIC: Evidence of Iranian support for Lebanese hostage takers</Title>
  <Section position="1" start_page="0" end_page="141" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Natural language processing techniques may hold a tremendous potential for overcoming the inadequacies of purely quantitative methods of text information retrieval. Under the Tipster contracts in phases I through III, GE group has set out to explore this potential through development and evaluation of new text processing techniques. This work resulted in some significant advances and in a better understanding on how NLP may benefit IR. Tipster research has laid a critical groundwork for future work.</Paragraph>
    <Paragraph position="1"> In this paper we summarize GE work on document detection in Tipster Phase III. Our summarization research is described in a separate paper appearing in this volume.</Paragraph>
    <Paragraph position="2"> Background The main thrust of this project has been to demonstrate that robust if relatively shallow NLP can help to derive better representation of text documents for indexing and search purposes than any simple word and string-based methods commonly used in statistical full-text retrieval. This was based on the premise that linguistic processing can uncover certain critical semantic aspects of document content, something that simple word counting cannot do, thus leading to more accurate representation. The project's progress has been rigorously evaluated in a series of five Text Retrieval Conferences (TREC's) organized by the U.S. Government under the guidance of NIST and DARPA. Since 1995, the project scope widened substantially to include several parallel efforts at GE, Rutgers, Lockheed Martin Corporation, New York University, University of Helsinki, and Swedish Institute for Computer Science (SICS).</Paragraph>
    <Paragraph position="3"> We have also collaborated with SRI International during TREC-6. At TREC we demonstrated that NLP can be done efficiently on a very large scale, and that it can have a significant impact on IR. At the same time, it became clear that exploiting the  full potential of linguistic processing is harder than originally anticipated.</Paragraph>
    <Paragraph position="4"> Not surprisingly, we have noticed that the amount of improvement in recall and precision which we could attribute to NLP, appeared to be related to the quality of the initial search request, which in turn seemed unmistakably related to its length (cf. Table 1). Long and descriptive queries responded well to NLP, while terse one-sentence search directives showed hardly any improvement. This was not particularly surprising or even new, considering that the shorter queries tended to contain highly discriminating words in them, and that was just enough to achieve the optimal performance. On the other hand, comparing various evaluation categories at TREC, it was also quite clear that the longer queries just did better than the short ones, no matter what their level of processing. Furthermore, while the short queries needed no better indexing than with simple words, their performance remained inadequate, and one definitely could use better queries. Therefore, we started looking into ways to build full-bodied search queries, either automatically or interactively, out of users' initial search statements.</Paragraph>
    <Paragraph position="5"> TREC-5 (1996), therefore, marks a shift in our approach away from text representation issues and towards query development problems. While our TREC-5 system still performs extensive text processing in order to extract phrasal and other indexing terms, our main focus moved on to query construction using words, sentences, and entire pas- null sages to expand initial search specifications in an attempt to cover their various angles, aspects and contexts. Based on the observations that NLP is more effective with highly descriptive queries, we designed an expansion method in which entire passages from related, though not necessarily relevant documents were quite liberally imported into the user queries.</Paragraph>
    <Paragraph position="6"> This method appeared to have produced a dramatic improvement in the performance of several different statistical search engines that we tested boosting the average precision by anywhere from 40% to as much as 130%. Therefore, topic expansion appears to lead to a genuine, sustainable advance in IR effectiveness. Moreover, we show in TREC-6 and TREC-7 that this process can be automated while maintaining the performance gains.</Paragraph>
    <Paragraph position="7"> The other notable new feature of our TREC-5 system is the stream architecture. It is a system of parallel indexes built for a given collection, with each index reflecting a different text representation strategy. These indexes are called streams because they represents different streams of data derived from the underlying text archive. A retrieval process searches all or some of the streams, and the final ranking is obtained by merging individual stream search results. This allows for an effective combination of alternative document representation and retrieval strategies, in particular various NLP and non-NLP methods. The resulting meta-search system can be optimized by maximizing the contribution of each stream. It is also a convenient vehicle for an objective evaluation of streams against one another.</Paragraph>
    <Paragraph position="8"> NLP-Based Indexing in Information Retrieval In information retrieval (IR), a typical task is to fetch relevant documents from a large archive in response to a user's query, and rank these documents according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that (a) select terms (words, phrases, and other units) from documents that are deemed to best represent their content, and (b) create an inverted index file (or files) that provide an easy access to documents containing these terms. A subsequent search process will attempt to match preprocessed user queries against term-based representations of documents in each case determining a degree of relevance between the two which depends upon the number and types of matching terms. Although many sophisticated search and matching methods are available, the fundamental problem remains to be an adequate representation of content for both the documents and the queries.</Paragraph>
    <Paragraph position="9"> In term-based representation, a document (as well as a query) is transformed into a collection of weighted terms (or surrogates representing combinations of terms), derived directly from the document text or indirectly through thesauri or domain maps.</Paragraph>
    <Paragraph position="10"> The representation is anchored on these terms, and thus their careful selection is critical. Since each unique term can be thought to add a new dimensionality to the representation, it is equally critical to weigh them properly against one another so that the document is placed at the correct position in the N-dimensional term space3 Our goal is to have the documents on the same topic placed close together, while those on different topics placed sufficiently apart. The above should hold for any topics, a daunting task indeed, which is additionally complicated by the fact that we often do not know how to compute terms weights. The statistical weighting formulas, based on terms distribution within the database, such as tf*idf, are far from optimal, and the assumptions of term independence which are routinely made are false in most cases. This situation is even worse when single-word terms are intermixed with phrasal terms and the term independence becomes harder to justify.</Paragraph>
    <Paragraph position="11"> There are a number of ways to obtain &amp;quot;phrases&amp;quot; from text. These include generating simple collocations, statistically validated N-grams, part-of-speech tagged sequences, syntactic structures, and even semantic concepts. Some of these techniques are aimed primarily at identifying multi-word terms that have come to function like ordinary words, for example &amp;quot;white collar&amp;quot; or &amp;quot;electric car&amp;quot;, and capturing other co-occurrence idiosyncrasies associated with certain types of texts. This simple approach has proven quite effective for some systems, for example the Cornell group reported (Buckley et al., 1995) that adding simple collocations to the list of available terms can increase retrieval precision by as much as 10%.</Paragraph>
    <Paragraph position="12"> Other more advanced techniques of phrase extraction, including extended N-grams and syntactic parsing, attempt to uncover &amp;quot;concepts&amp;quot;, which would capture underlying semantic uniformity across various surface forms of expression. Syntactic phrases, for example, appear reasonable indicators of content, arguably better than proximity-based phrases, since they can adequately deal with word order changes and other structural variations (e.g., &amp;quot;college junior&amp;quot; vs. &amp;quot;junior in college&amp;quot; vs. &amp;quot;junior college&amp;quot;). A subsequent regularization process, 1in a vector-space model term weights are represented as coordinate values; in a probabilistic model estimates of prior probabilities are used.</Paragraph>
    <Paragraph position="13">  where alternative structures are reduced to a &amp;quot;normal form&amp;quot;, helps to achieve the desired uniformity, for example, &amp;quot;college+junior&amp;quot; will represent a college for juniors, while &amp;quot;junior+college&amp;quot; will represent a junior in a college. A more radical normalization would have also &amp;quot;verb object&amp;quot;, &amp;quot;noun rel-clause&amp;quot;, etc. converted into collections of such ordered pairs.</Paragraph>
    <Paragraph position="14"> This head+modifier normalization has been used in our system, and is further described in this paper.</Paragraph>
    <Paragraph position="15"> In order to obtain the head+modifier pairs of respectable quality, we used a full-scale robust syntactic parsing (TTP) In 1998, in collaboration with the University of Helsinki, we used their Functional Dependency Grammar system to perform all linguistic analysis of TREC data and to derive multiple dependency-based indexing streams.</Paragraph>
    <Section position="1" start_page="140" end_page="140" type="sub_section">
      <SectionTitle>
Stream-based Information Retrieval
Model
</SectionTitle>
      <Paragraph position="0"> The stream model was conceived to facilitate a thorough evaluation and optimization of various text content representation methods, including simple quantitative techniques as well as those requiring complex linguistic processing. Our system encompasses a number of statistical and natural language processing techniques that capture different aspects of document content: combining these into a coherent whole was in itself a major challenge. Therefore, we designed a distributed representation model in which alternative methods of document indexing (which we call &amp;quot;streams&amp;quot;) are strung together to perform in parallel. Streams are built using a mixture of different indexing approaches, term extracting and weighting strategies, even different search engines.</Paragraph>
      <Paragraph position="1"> The final results are produced by merging ranked lists of documents obtained from searching all streams with appropriately preprocessed queries, i.e., phrases for phrase stream, names for names stream, etc. The merging process weights contributions from each stream using a combination that was found the most effective in training runs. This allows for an easy combination of alternative retrieval and routing methods, creating a meta-search strategy which maximizes the contribution of each stream.</Paragraph>
      <Paragraph position="2"> Among the advantages of the stream architecture we may include the following: * stream organization makes it easier to compare the contributions of different indexing features or representations. For example, it is easier to design experiments which allow us to decide if a certain representation adds information which is not contributed by other streams.</Paragraph>
      <Paragraph position="3"> * it provides a convenient testbed to experiment with algorithms designed to merge the results obtained using different IR engines and/or techniques. null * it becomes easier to fine-tune the system in order to obtain optimum performance * it allows us to use any combination of IR engines without having to adapt them in any way.</Paragraph>
    </Section>
    <Section position="2" start_page="140" end_page="141" type="sub_section">
      <SectionTitle>
Advanced Linguistic Streams
Head+Modifier Pairs Stream
</SectionTitle>
      <Paragraph position="0"> Our linguistically most advanced stream is the head+modifier pairs stream. In this stream, documents are reduced to collections of word pairs derived via syntactic analysis of text followed by a normalization process intended to capture semantic uniformity across a variety of surface forms, e.g., &amp;quot;information retrieval&amp;quot;, &amp;quot;retrieval of information&amp;quot;, &amp;quot;retrieve more information&amp;quot;, &amp;quot;information that is retrieved&amp;quot;, etc. are all reduced to &amp;quot;retrieve+information&amp;quot; pair, where &amp;quot;retrieve&amp;quot; is a head or operator, and &amp;quot;information&amp;quot; is a modifier or argument. It has to be noted that while the head-modifier relation may suggest semantic dependence, what we obtain here is strictly syntactic, even though the semantic relation is what we are really after. This means in particular that the inferences of the kind where a head+modifier is taken as a specialized instance of head, are inherently risky, because the head is not necessarily a semantic head, and the modifier is not necessarily a semantic modifier, and in fact the opposite may be the case. In the experiments that we describe here, we have generally refrained from semantic interpretation of head-modifier relationship, treating it primarily as an ordered relation between otherwise equal elements.</Paragraph>
      <Paragraph position="1"> Nonetheless, even this simplified relationship has already allowed us to cut through a variety of surface forms, and achieve what we thought was a non-trivial level of normalization. The apparent lack of success of linguistically-motivated indexing in information retrieval may suggest that we haven't still gone far enough.</Paragraph>
      <Paragraph position="2"> In our system, the head+modifier pairs stream is  derived through a sequence of processing steps that include: 1. Part-of-speech tagging 2. Lexicon-based word normalization (extended &amp;quot;stemming&amp;quot; ) 3. Syntactic analysis with TTP parser 4. Extraction of head+modifier pairs</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML