File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1098_metho.xml
Size: 4,088 bytes
Last Modified: 2025-10-06 14:13:09
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1098"> <Title>ROBUSTNESS, PORTABILITY, AND SCALABILITY LANGUAGE SYSTEMS</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> ROBUSTNESS, PORTABILITY, AND SCALABILITY LANGUAGE SYSTEMS </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> OF NATURAL OBJECTIVE </SectionTitle> <Paragraph position="0"> In the DoD, every unit, from the smallest to the largest, communicates through messages. Messages are fundamental in command and control, in intelligence analysis, and in planning and replanning. Our objective is to create algorithms that will 1) robustly process open source text, identifying relevant messages, and updating a data base based on the relevant messages; b) reduce the effort required in porting natural language (NL) message processing software to a new domain from months to weeks; and c) be scalable to broad domains with vocabularies of tens of thousands of words.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> APPROACH </SectionTitle> <Paragraph position="0"> Our approach is to apply probabilistic language models and training over large corpora in all phases of natural language processing. This new approach will enable systems to adapt to both new task domains and linguistic expressions not seen before by semi-automatically acquiring 1) a domain model, 2) facts required for semantic processing, 3) grammar rules, 4) information about new words, 5) probability models on frequency of occurrence, and 6) rules for mapping from semantic representation to application structure.</Paragraph> <Paragraph position="1"> For instance, a statistical model of categories of words will enable systems to predict the most likely category of a word never encountered by the system before and to focus on its most likely interpretation in context, rather than skipping the word or considering all possible interpretations. Markov modelling techniques will be used for this problem.</Paragraph> <Paragraph position="2"> In an analogous way, statistical models of language will be developed and applied at the level of syntax (form), at the level of semantics (content), and at the contextual level (meaning and impact).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> RECENT RESULTS </SectionTitle> <Paragraph position="0"> Achieved performance levels in MUC-3 of identification of over 40% of the data present (&quot;recall&quot;) with an accuracy above 50% (&quot;precision&quot;). (Only one quarter of the systems in MUC-3 achieved comparable performance; we achieved this performance with half a person-year of effort to move to this domain, much less than the labor invested by the other top performing groups.) Distributed POST, our software for statistically labelling words in text, to several other DARPA contractors (New York University, Syracuse University, and the University of Chicago).</Paragraph> <Paragraph position="1"> Ported our PLUM message processing system to a class of long range air messages in only seven person-weeks.</Paragraph> </Section> <Section position="5" start_page="0" end_page="465" type="metho"> <SectionTitle> PLANS FOR THE COMING YEAR </SectionTitle> <Paragraph position="0"> Create automated procedures for the syntactic training of NL systems, both to improve system performance and to reduce human effort in porting the NL system to a new domain.</Paragraph> <Paragraph position="1"> Create automated procedures for semantic training. Develop strategies for automatically inferring a domain model from a corpus, a task which is highly labor-intensive in today's technology.</Paragraph> <Paragraph position="2"> Create a probabilistic model for predicting the most likely (partial) interpretations of a novel form or errorful input, both of which are significant challenges to current technology.</Paragraph> </Section> class="xml-element"></Paper>