File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1102_metho.xml
Size: 5,012 bytes
Last Modified: 2025-10-06 14:13:26
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1102"> <Title>ROBUST TEXT PROCESSING AND INFORMATION RETRIEVAL</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> ROBUST TEXT PROCESSING AND INFORMATION RETRIEVAL </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> PROJECT GOALS </SectionTitle> <Paragraph position="0"> The general objective of this research has been the enhancement of traditional key-word based statistical methods of document retrieval with advanced natural language processing techniques. In the work to date the focus has been on obtifining a better representation of document contents by extracting representative phrases from syntactically preprocessed text.</Paragraph> <Paragraph position="1"> In addition, statistical clustering methods have been developed that generate doxn;dn-specific term correlations which can be used to obtain better search queries via expansion.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> RECENT RESULTS </SectionTitle> <Paragraph position="0"> A prototype text retrieval system hits been developed in which a robust natural language processing unodule is integrated with a traditional statistical engine (NIST's PRISE). Natural language processing is used to (1) preprocess the documents in order to extract contents-carrying tenns, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the &ltabase domain, and (3) process user's natural language requests into effective search queries. The statistical engine builds inverted index files from pre-processed documents, and then searches and ranks the documents in response to user queries. The fe;Lsibility of this approach hits been demonstrated in various experiments with 'standard' IR collections such as CACM-3204 and Cranfield, as well as in the large-scale evaluation with TIPSTER database.</Paragraph> <Paragraph position="1"> The centerpiece of the natural language processing module is the TTP parser, a fast and robust syntactic analyzer which produces 'regularized' parse structures out of running text.</Paragraph> <Paragraph position="2"> The parser, presently the fastest of this type, is designed to produce full analyses, but is capable of generating ,approximate 'best-fit' structures if under a time pressure or when faced with unexpected input.</Paragraph> <Paragraph position="3"> We participated in the first Text Retrieval Conference (TREC1), during which the total of 500 MBytes of Wall Street Journal articles have been parsed. An enhanced version of 'ITP parser has been developed for this purpose with the average speed ranging from 0.3 to 0.5 seconds per sentence. We also developed and improved the morphological word stemmer, syntactic dependencies extractor, and tested sever;d clustering formulax. A close co-operation with BBN h~L~ produced a better part-of-speech tagger which is an essential pre-processor before parsing.</Paragraph> <Paragraph position="4"> We also took part in the continuing parser/gr;unmar evaluation workshop. In an informal test runs with 100 sentence s:unple of WSJ unaterial, &quot;FrP has conne suprisingly strong ~unong 'regular' parsers which are hundreds times slower and far less robust. During the latest meeting the focus of evaluation effort has shifted toward &quot;deeper' representations, including operator-argument structures which is the sUmdard form of output from &quot;FI'P. During last year TIP licenses have been issued to several sites for research purposes.</Paragraph> <Paragraph position="5"> In another effort, in co-operation with the Canadian Institute of Robotics and Intelligent Systems (IRIS), a number of qualitative methods for predicting semantic correctness of word associations are being tested. When finished, these results will be used to further improve the accuracy of document representation with compound terms.</Paragraph> <Paragraph position="6"> Research on reversible grammars continued last year with some more important results including a formal evaluation system for generation algorithms, and a generalized notion of guides for controling the order of evaluation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="408" type="metho"> <SectionTitle> PLANS FOR THE COMING YEAR </SectionTitle> <Paragraph position="0"> The major effort in the conning months is the participation in TREC-2 evaluation. For this purpose we.aquired a new version of PRISE system, which is currently being adapted to work with language processing module. New nnethods of document ranking are also considered, including local scores for most relevant fragments within a document. New clustering methods are tested for generating term similarities, its well as more effective filters to subcategorize similarities into sennantic classes.</Paragraph> </Section> class="xml-element"></Paper>