File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1029_metho.xml

Size: 14,148 bytes

Last Modified: 2025-10-06 14:12:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1029">
  <Title>Bracketin g Phrase Normalized Sentences Preprocessor HEADE R Words/ Sentences SYNTACTI C STRUCTURES Conceptua\ Conceptual CONCEPTUAL Templates Analysis FRAMES Discourse Analyst: Matche r instantiated Pattem Unification, Formattin g&amp; editin g</Title>
  <Section position="3" start_page="0" end_page="192" type="metho">
    <SectionTitle>
APPROACH
</SectionTitle>
    <Paragraph position="0"> PAKTUS may be viewed from two perspectives. In one view it is seen as a generic environment for buildin g NLP systems, incorporating modules for lexical acquisition, grammar building, and conceptual template specification. The other perspective focuses on the grammar, lexicon, concept templates, and parser alread y embedded within it, and views it as an NLP system itself . The early emphasis in developing PAKTUS was on thos e components supporting the former view. The grammar and lexicon that form the common core of English, as wel l as the stock of generic conceptual templates, entered PAKTUS primarily as a side effect of the testing of extension s to the NLP system development environment. More recent work has focused on extending the linguistic knowledg e within the overall architecture, such as prepositional phrase attachment, compound nominals, temporal analysis, an d metaphorical usage, and on adapting the core to particular domains, such as RAINFORM messages or news reports .</Paragraph>
    <Paragraph position="1"> The first step in this project was an evaluation of existing techniques for NLP, as of 1984 . This evaluatio n included implementing rapid prototypes using techniques as in [1], [2], [3], and [4] . Judging that no one technique was adequate for a full treatment of the NLP problem, we adopted a hybrid approach, breaking the text understanding process into specialized modules for text stream preprocessing, lexical analysis, including morphology, syntacti c analysis of clauses, conceptual analysis, domain-specific pattern matching based on an entire discourse (e .g., a news report), and final output-record generation .</Paragraph>
    <Paragraph position="2"> Knowledge about word morphology was drawn from [5] and is represented as a semantic network, as is lexical and semantic knowledge in general . The grammar specification has been based on our analysis of message text, an d draws from [5], [6], and [7] . It was first implemented as an augmented transition network (ATN), using a linguisti c notation similar to that in [4] . This implementation relies on an interactive graphic interface to specify and debug grammar rules. More recent investigations focus on alternative formalisms .</Paragraph>
    <Paragraph position="3"> Our conceptual analysis combines aspects of conceptual dependency [1], [8], case grammar [9], semanti c preferences [10], and psychology [11] . We provided a feedback loop from the conceptual analyzer to the syntactic parser for faster, more accurate analysis . We have found empirically that the current parser usually runs in linear time (on a Sun 3/260, about 0.1 second per word, regardless of sentence length) . This is a result of the feedback together with &amp;quot;look ahead&amp;quot; tests at critical decision points . Those infrequent sentences requiring more time are terminated by the parser, which returns the longest parsed substring, or a run-on analysis . The resultant loss of recall is more than compensated for in increased system throughput . MUC-3 corpus sentences that PAKTUS parse s completely (about 47% of the total corpus) average about two seconds of parse time, compared to about ten second s each for partial and run-on parses (about 51% of the total corpus) .</Paragraph>
    <Paragraph position="4"> Our first version of the domain-specific discourse pattern matcher was based on [2], but a more versatile version , based on specification-by-examples, was added during MUCK 2 development. This uses clause, sentence, and noun  phrase semantic and syntactic patterns, and was used again in MUC-3 . We have begun implementing discourse-leve l pattern matching (somewhat like scripts), but this was not sufficiently developed for use in MUC-3 .</Paragraph>
    <Paragraph position="5"> In addition to these functional NLP components, PAKTUS has a broad set of development tools, includin g grammar construction tools and debugger, a lexical acquisition interface, conceptual specification tools, domai n pattern builders, and some automatic learning capabilities . These greatly facilitated adaptation of the system to th e MUC-3 task.</Paragraph>
    <Paragraph position="6">  Processing begins with the arrival of an electronic stream of text, such as the MUC-3 corpus . The first function performed is the decomposition of the stream of characters into individual messages, message segments, sentences , and strings of words, based on document format specifications contained in a MUC-3 document specificatio n template . The words identified by the preprocessor are mapped into entries in the lexicon which contain informatio n about their syntax and semantics, as illustrated in Figure 2 for the word &amp;quot;knows&amp;quot; . The lexicon, contained in five databases, contains separate information for root words (&amp;quot;words&amp;quot; in Figure 2), concepts, and surface forms o r &amp;quot;tokens&amp;quot; . The latter are mapped into data structures based on the roots . These mappings are contained the &amp;quot;parses &amp;quot; database of Figure 2. Compound words and idioms are first mapped into synthetic tokens, and then processed like other surface forms. All this information is organized in memory in two networks : a lexical net and a concept net . These two networks are linked by conceptual associations, as illustrated in Figure 2 . The words, concepts, an d associations are brought into memory only as needed in processing text. PAKTUS includes an object-oriente d DBMS that performs these lexicon operations [12] .</Paragraph>
    <Paragraph position="7"> When words are encountered that have never been seen previously by PAKTUS, it tries to analyze these morphologically. The morphology module has information about approximately 250 affixes, one of which, -en, i s illustrated in Figure 3 . It analyzes words in terms of known roots and the affixes, although some words can be adequately analyzed without any knowledge of the root (e .g., any word not in the lexicon that ends in &amp;quot;ology&amp;quot;  denotes an &amp;quot;information domain&amp;quot;). It derives both syntactic information and semantic information, producing a n internal PAKTUS representation . In addition, for MUC-3, we added morphological heuristics for guessing syntacti c and semantic information in many cases, even when the root is unknown (e .g., an unrecognized word ending in &amp;quot;ation&amp;quot; might be an abstract noun) . If all of this fails to identify the word, the system deduces as much as it can from the syntactic and semantic context during parsing .</Paragraph>
  </Section>
  <Section position="4" start_page="192" end_page="194" type="metho">
    <SectionTitle>
FXICAL DATABASES
</SectionTitle>
    <Paragraph position="0"> 0. Word 'Knows&amp;quot; Encountered . Fetch Word Parse Of &amp;quot;Knows&amp;quot; 2. Fetch Frames Of Roots &amp;quot;Know;  The next step in processing the text is to parse the sentences syntactically, according to a grammar specification , to generate syntactic configurations . The conceptual analyzer then maps the syntactic configurations into conceptual frames (concept structures with roles filled by phrase constituents), usually resolving much ambiguity in th e process. If the syntax cannot be mapped into any conceptual frame, it is rejected and the syntactic parser trie s alternatives. The first two levels of the conceptual network are shown in Figure 4 . The conceptual roles used i n PAKTUS are shown in Figure 5 .</Paragraph>
    <Paragraph position="1"> The discourse analyzer collects all the conceptual frames for an entire narrative (i .e., an entire news report for the MUC-3 application) and produces application-specific structures that represent the information that is to be extracte d from the document, based on the discourse template specifications . These structures are then reformatted into simulated MUC-3 data base updates (i .e., filled templates) . An example is given below .</Paragraph>
    <Paragraph position="2"> There are important feedback points in this process, as shown in Figure 1 . For example, the conceptual analyzer may notify the syntactic parser that a proposed parse is semantically unacceptable, signalling that a n alternative parse should be attempted. This semantic testing is always invoked at the clause level, and sometime s sooner. In addition, when confronted with two computationally expensive paths, &amp;quot;look ahead&amp;quot; procedures tha t quickly scan the sentence are invoked to decide which to try first. For example, a past participle following a nou n  may or may not signal the beginning of a relative clause in which the noun is the direct object . In this case, a partial conceptual analysis quickly determines whether the noun can be mapped into any concept associated with th e verb. If it cannot, the relative clause path is not pursued.</Paragraph>
  </Section>
  <Section position="5" start_page="194" end_page="196" type="metho">
    <SectionTitle>
APPLICATION TO MUC- 3
</SectionTitle>
    <Paragraph position="0"> To apply PAKTUS to MUC-3, five tasks were performed . Due to the modular design of PAKTUS, these are clearly delineated and were performed by different people . Some tasks must be initiated in sequence, but may be cascaded as the corpus of text is processed, so that, except for a brief period at the beginning of the task, work proceeded in parallel . The number of changes to various knowledge bases is summarized in Figure 6 . The five tasks were: ~ Build a template specifying the formats of the input streams (dev-muc3, tstl and tst2) . This was easy andrequired about a day to perform. * Read in the documents and update the lexicon using the PAKTUS interactive graphic interface. This wasrelatively easy for those words (typically nouns) that only require categorization, but not conceptual mappin g specifications (as verbs do). The latter has often been done successfully by relying on PAKTUS default values.</Paragraph>
    <Paragraph position="1"> ~ Adapt the grammar to the sublanguage of the application . Actually changing the grammar is easy with thePAKTUS interactive graphic tools for this, but determining what is the grammar of the sublanguage may b e quite difficult, requiring much linguistic knowledge and study of the corpus . Changes for MUC-3 were minor .</Paragraph>
    <Paragraph position="2"> ~ Define the application-specific discourse templates . This is the least developed component of PAKTUS, and th eone that will receive the most attention in continuing work, such as for MUC-4 . For MUC-3, phrase an d sentence-level patterns were defined . A function unified these and mapped them into the 18-slot MUC- 3 templates.</Paragraph>
    <Paragraph position="3"> Specify and implement the interface to the application system (the MUC-3 template fills) . This was tedious , but easy compared to the other tasks . It is strictly conventional software engineering .</Paragraph>
    <Paragraph position="4">  PAKTUS's operation for MUC-3 . PAKTUS processes text sequentially, first stripping off the document header , then identifying sentences, which are processed syntactico-semantically one at a time, after which all the results are passed to the discourse component.</Paragraph>
    <Paragraph position="5">  word boundaries have been identified . Note that &amp;quot;Soviet Union&amp;quot; is treated as a single word, since it names an entit y represented in the lexicon .</Paragraph>
    <Paragraph position="6"> The lexical analysis of this sentence is shown in Figure 8 . Each word has one or more senses, represented as a root symbol, which is generally the concatenation of the English token, the &amp;quot; A&amp;quot; character, and the PAKTUS lexical category (e .g., &amp;quot;ReportAMonotrans&amp;quot;), or as a simple structure involving a root, lexical category, inflectional mark , and sometimes a conceptual derivation (e .g. the structure &amp;quot;(Report^Monotrans L^Effect-mark Base C^It-got) &amp;quot; represents the adjective sense of &amp;quot;reported&amp;quot;) . For each word, all senses in the PAKTUS lexicon are fetched or derive d at this time; disambiguation is generally delayed until the syntactic and semantic phases. An exception in this  example is the word &amp;quot;tonight&amp;quot; which has been replaced by the date from the dateline of this MUC-3 news report . The syntactic and conceptual analyses of this sentence are shown in Figures 9 and 10, respectively . Note that conceptual structures are produced for some nouns (e.g., &amp;quot;embassies&amp;quot;), not just for verbs.</Paragraph>
    <Paragraph position="7">  Figure 14 illustrates the ability of PAKTUS to deal with unknown words, which is essential in any application that continually processes new text. This shows the syntactic analysis of a sentence from one of the &amp;quot;test2&amp;quot; reports . It contains three words that can not be derived from the PAKTUS lexicon: &amp;quot;Estevez&amp;quot;, &amp;quot;MPTL&amp;quot;, and &amp;quot;supposed&amp;quot;. PAKTUS made assumptions about each word, based on morphology and syntactico-semantic context. It was able to produce a reasonably accurate parse, by guessing that &amp;quot;Estevez&amp;quot; is a Spanish name, and recognizing that &amp;quot;MPTL&amp;quot; i s a noun in apposition with the preceding noun phrase, and that &amp;quot;supposed&amp;quot; must in this case be a passive voic e monotransitive verb.</Paragraph>
  </Section>
  <Section position="6" start_page="196" end_page="197" type="metho">
    <SectionTitle>
POLICE HAVE REPORTED THAT TERRORISTS TONIGHT BOMBED TH E
EMBASSIES OF THE PRC AND THE SOVIET UNION.
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML