File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1095_metho.xml

Size: 3,976 bytes

Last Modified: 2025-10-06 14:12:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1095">
  <Title>New York University Proteus Project: ROBUST AND PORTABLE TEXT PROCESSING</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ROBUST AND PORTABLE TEXT PROCESSING
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
PROJECT GOALS
</SectionTitle>
    <Paragraph position="0"> Our primary goal is the development of robust and portable systems for processing natural language text, particularly for the purposes of extracting information or retrieving passages or documents from a text collection. A major focus has been on automatically or semi-antomatically acquiring the syntactic and semantic characteristics of new domains from samples of text.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
RECENT WORK
</SectionTitle>
    <Paragraph position="0"> Over the last few months we have begun to adapt our Proteus information extraction system to process reports of terrorist activity. This is being done as part of MUC (Message Understanding Conference) -3, a comparative evaluation of information extraction systems organized by the Naval Ocean Systems Center. These reports are substantially more complex, both syntactically and semantically, than the Navy operational reports we previously processed. Considerable effort has therefore gone into extending our grammatical coverage and improving the efficiency and accuracy of our parser; we have experimented with several techniques, including closest attachment heuristics, merging of alternative analyses of constituents, statistically-lrained stochastic grammars, and stochastic part-of-speech tagging prior to parsing (the last done in collaboration with BBN Systems and Technologies Corp.).</Paragraph>
    <Paragraph position="1"> We have also begun to investigate the benefits of natural language processing for document retrieval. We have developed a fast and robust Tagged Text Parser, which uses text stochastically tagged by part-of-speech (again with the assistance of BBN), and generates full syntactic analyses (,possibly skipping some unanalyzable constituents). These parses are then used to identify co-occurrence patterns and compute similarity coefficients between words. This work, by Strzalkowski and Vauthey, is reported in a separate paper in these proceedings.</Paragraph>
    <Paragraph position="2"> Finally, as part of our research on sublanguage-based machine translation, we are continuing development of our Japanese grammar.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="421" type="metho">
    <SectionTitle>
PLANS FOR THE COMING YEAR
</SectionTitle>
    <Paragraph position="0"> Our primary task for the remainder of MUC-3 (through May 1991) will be to incorporate additional semantic information about the terrorist domain. We intend to do this, as much as possible, through semi-automatic techniques driven by the parsed corpus of 1300 reports. Following completion of the formal evaluation in May, we expect to spend considerable time evaluating the contributions of various system features to our overall performance. null For document retrieval, we intend to utilize a clustering procedure which, based on the similarity coefficients we have generated, will form domain-specific word classes. We will then investigate the benefits of using these word classes as a thesaurus for keyword-based document retrieval, using a standard document test collection.</Paragraph>
    <Paragraph position="1"> Finally, we expect over the coming year to bring together our pilot study of sublanguage-based machine translation, our work on reversible grammars, and our Japanese and English grammars to create a reversible Japanese-English translation system, albeit initially operating on a very limited domain and vocabulary.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML