File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0909_metho.xml
Size: 16,461 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0909"> <Title>NLP and Industry: Transfer and Reuse of Technologies*</Title> <Section position="4" start_page="57" end_page="57" type="metho"> <SectionTitle> COAT OF BMS10-11 TYPE 1 PRIMER PER BAC5736 (F18.01) REATTACH IDENTIFICATION TAG </SectionTitle> <Paragraph position="0"> These examples exhibit misspellings, irregular punctuation and nomenclature, and direct, indirect, and mixed reference, which indicate the prospective usefulness of an NLP approach.</Paragraph> <Section position="1" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 2.3 The NLP Solution: a Process View </SectionTitle> <Paragraph position="0"> Because these are production databases and constantly undergoing change, freezing these databases entails temporarily removing them from production use, which can be a very expensive undertaking. Hence, the automated mass-change process must be able to run reliably in a very small window of time. By distributing the processing of the texts across many Unix workstations, the time required for a typical run (ranging from 6500 to 130,000 texts) has been reduced to approximately 1.5 hours, thus minimizing downtime cost.</Paragraph> <Paragraph position="1"> Figure 1 schematically represents the mass-change process. Initially, a subset of the on-line database's records are extracted and downloaded (1). The records are divided into key and text portions, made unique, and normalized (2). The plan set is then partitioned (3) according to the type of operation and/or finish material, and these partitioned sets are distributed for subsequent processing across available workstations.</Paragraph> <Paragraph position="2"> Then, for each partition, the plans undergo spelling correction (4), driven by a mutual information model \[1\] constructed by prior exposure to and generalization over large amounts of test corpora. This process, discussed in more detail in the next section, feeds the NLP system proper. The NLP system spans the continuum from lexical tokenization (5), including the use of the two-level morphology tool PCKIMMO \[2, 9, 8\] which allows for a finite-state structured lexicon, through phrase structure parsing using a hybrid syntactic-semantic grammar (6), to semantic and discourse interpretation (8), and finally to the new plan generation stage (9). The tokenization and grammatical subprocesses are implemented in the C programming language. Text strings are tokenized by employing a subsystem built around lex, a Unix lexical analysis tool \[1\]. The grammatical processing is performed by a yacclike LR(1) parser \[1, 16\] extended to include backtracking, inheritance, token-stream manipulation, and the use of semantic hierarchies, described in the next section. The semantic hierarchies (7) are also used by the later interpretation and generation modules. Most of the interpretation and generation modules are implemented in Prolog because robust inference is required. The semantic representations of those texts which fit the requirements of the change rules then undergo generation: working from the input semantic representation of an individual text and the generation rule set, a new plan is generated for each appropriate operation text. Once all texts in every partition have been fully processed, resulting in multiple sets of plans, the texts are reattached to their original keys (10) and formatted (11) to various specifications (a report to be inspected by analysts, etc.), including a database record format. The set of new database records are then uploaded to the mainframe database, and the database is again placed into production.</Paragraph> </Section> </Section> <Section position="5" start_page="57" end_page="75" type="metho"> <SectionTitle> 3 Components </SectionTitle> <Paragraph position="0"> This section describes in more detail key components of the NLP tool set. These include spelling correction, parsing, and semantic interpretation. The discussion of these three modules will similarly center on the mass-change application of the previous section, with additional comments on the interpretation component provided with respect to another application, that of a query interface to a project and program scheduling system. The mass change plan generation process is also described.</Paragraph> <Paragraph position="1"> t_,</Paragraph> <Section position="1" start_page="58" end_page="58" type="sub_section"> <SectionTitle> 3.1 Spelling Correction </SectionTitle> <Paragraph position="0"> The spelling correction process represented by node (4) in Figure 1 utilizes a statistical mutual information model \[5\] to detect and correct spelling errors, based on the observation that spelling errors are statistically abnormal patterns. The intent therefore of spelling correction is to modit3, the word sequence minimally to make it statistically normal. The approach we have pursued is to use a bigram mutual information model, created by pre-processing a huge domain-specific textual corpus (obtained perhaps, as in our case, by downloading an entire textual database), to guide spelling correction over new text within that domain (Figure 2). A new model is created each time the domain changes; this is especially important if the domains are narrowly circumscribed and company-specific. In the mass change procedure, spelling correction is applied to the new corpus en masse at node (4). Statistically unlikely words are corrected to statistically likely candidates.</Paragraph> <Paragraph position="1"> In general, there are problems inherent to the detection of spelling errors. For example, all unknown words encountered are not necessarily errors; they may simply not have been seen before. Furthermore, all known words are not necessarily correct; these are epitomized by typographic variations and incongruous word sequences. The mass change corpus exhibited the following occurrences (with intended word bracketed to the right): Other anomalies which a spelling correction routine must contend with are split words (with one or more spaces intervening) and run-on words (where no space separates using bigram model two words). In addition, there is the possibility that the error-to-correction mapping is non-invariant.</Paragraph> <Paragraph position="2"> A statistical approach to spelling correction has some advantages and some disadvantages. Among the advantages are: it corrects the majority of errors, those classified as nonwords, misspelled words, word-splits, and run-ons; the automated acquisition of domain-specific data is easily maintainable; and the use of a statistical model enforces consistent lexical usage. A disadvantage is that correlated recall and precision may not be high, i.e. some errors may be missed and some may be corrected incorrectly. However, reasonably good recall (>75%) coupled with very high accuracy (>95%) can be expected. Other disadvantages are: there is no clear strategy for multi-error detection and correction, and the tact that such a large corpus (20 megabytes in our mass change corpus) is required to create a good statistical model.</Paragraph> </Section> <Section position="2" start_page="58" end_page="75" type="sub_section"> <SectionTitle> 3.2 Parsing </SectionTitle> <Paragraph position="0"> For parsing, we use a generalized LR(1) shift/reduce parser \[16, 1, 10\]. Like yacc (which, given a grammar, generates a parser for that grammar), our parser precompiles the CFG grammar into a state-transition table. The parser exercises CFG grammar rules annotated with syntactic and semantic action routines, thus allowing for synthesized and inherited attributes. In addition to the rules, other knowledge stores integrated into the parser's processing are a thematic role hierarchy and a semantic domain network, both of which are also used by lexical entries in a morphologically partitioned lexicon. The parser uses a linked list of structured tokens (displayed in 4 below), and returns only one parse. To facilitate robust parsing, the parser also allows the developer to activate grammar-directed token dropping, token hypothesizing, and token type coercion.</Paragraph> <Paragraph position="1"> (4) Token Structure <id: numerical identifier for token surface form (i.e. actual string) for the sulfform: token rootform: value: assertions: scat: feature: nexl: root form of the token value (semrep) associated with id \[I subcategorization requirement for the token, where the scat format is (ext arg int_argl int_arg2 ...), and where each argument must be a grammar symbol (exception: int argl may be a string enclosed within # e.g.</Paragraph> <Paragraph position="2"> #into#contact~with#); ext_arg may be NULL/nil; feature associated with token ptr to next polysemous token> The parser permits arbitrary backtracking, including that over polysemous or composed tokens (idioms), over grammar rules, and over object hierarchies (entity, property, and predicate types in the hybrid domain model), though in practice time and node limits are set. The backtracking facility also includes the developerspecified cut, an operator to force the termination of a grammar rule. An example of backtracking over polysemous tokens is displayed in the following abbreviated trace from the mass-change process. As noted, we employ a hybrid syntactic-semantic grammar, primarily because such a hybrid permits generality (at higher nonterminals) and specificity (at terminals and lower nonterminals).</Paragraph> <Paragraph position="3"> Difficulty in parsing: no transition for token NUMBER\[168\] from state 83 current stack (in reverse): statestack\[l\]: \[FINISH\[124\] \]</Paragraph> <Paragraph position="5"> Difficulty in parsing: no transition for token COMPOSITION\[69\] from state current stack (in reverse): statestack\[2\]: \[COATING\[62\] nphrase\[4109\] top : discourse EOS When enabled, the token-dropping option allows a grammar rule to be matched by dropping a token (from a pre-specified set of droppable tokens), and is only applied when a sentence will not parse without dropping the token. In addition, the parser will also hypothesize a token when the input sentence will not parse strictly by using the grammar rules. Similarly, the parser will coerce the unexpected type of a token to a type which is acceptable, should the parse otherwise fail.</Paragraph> </Section> <Section position="3" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 3.3 Semantic Interpretation </SectionTitle> <Paragraph position="0"> The mass-change procedure does not require the complex referential semantics that NLIs require. The semantics and the discourse components can be simpler because the application requirements are simpler. In all our NLP applications, however, both domain-dependent and -independent information constitute the semantic model, which is jointly used by the grammatical module (written in C) and the interpretation/generation module (written in Prolog). Each token has a semantic marker which acts as an index into the semantic domain model.</Paragraph> <Paragraph position="1"> The morphologically generative lexicon is the primary knowledge store-associating the input (surface) text tbrm, its tokenization, and the semantic marker. The grammatical module uses the lexicon to drive its work, but also uses the semantic model directly to enable type inheritance and, in some cases, the type coercion of semantic markers.</Paragraph> <Paragraph position="2"> The semantic domain model consists of a set of assertions of the form object(Child, \[Relation, Parent\]) where Relation is either 'isa' or 'ispart', and the three possible roots of the hierarchies are 'entity', 'predicate', and 'property'. These are defined by a developer and entered into a the GraphEd tool \[14\], a graph editor which outputs an ascii representation of a network. The ascii form can be transformed and used by both the parser and the backend Prolog interpretation processes. The output of the grammatical module is a combined syntactic-semantic representation of the input plan text in the form of a list of binary predicates capturing the tree structure. Each predicate is of the form: predicate(skolem-constant, value) with skolem constants representing the nodes of the tree. The semantic entity markers are those items which are the values of &quot;instance' predicates, as instance(n9, person) asserts that 'n9' is an 'instance' of semantic class 'person'. The syntactic-semantic representation is then asserted as the primary knowledge store in the finai interpretation and generation module.</Paragraph> <Paragraph position="3"> Additional knowledge sources used in the Prolog interpretation and generation module are: a database of finish codes and their associated information, including the number of coats of application required, color number, color name, and material type of each relevant finish Code; a set of material-specific databases which include the materials and the associated generation requirements rules; and a task-driven tree-walker that traverses the semantic representation of a plan to extract information requested by the generator.</Paragraph> </Section> <Section position="4" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 3.4 Plan Generation </SectionTitle> <Paragraph position="0"> The text plan generator directly executes rules representing the output requirements of the new plans.</Paragraph> <Paragraph position="1"> Prior to executing these rules, however, the generator determines whether the original input plan is wellformed, valid, and consistent. Then, using the domain model, the finish code and material databases, the requirements rules, and the semantic tree-walker, the generator creates new plans.</Paragraph> <Paragraph position="2"> In other cases, the generator detects that a metaconstraint such as &quot;Only one operation should exist per plan text&quot; is violated. It flags the text as anomalous, indicating the constraint violation, but still tries to generate a reasonable output text. A post-generation process diverts constraint violations to a separate stream which results eventually in the creation of a special report. Texts which violate constraints are not changed and uploaded; instead, these are evaluated by a human domain expert, who adjudicates the suggested changes individually. For example, in (6) the original plan consists of multiple run-on sentences with no punctuation. The NLP system determines that there are actually three sentences, two of which refer to application of finishes. With this information, the generator determines that one of its meta-constraints has been violated, generates its best guess at an output text, and then annotates that text with the constraint violation message.</Paragraph> </Section> </Section> <Section position="6" start_page="75" end_page="75" type="metho"> <SectionTitle> TOUCH-UP FINISH REWORK AREA ONLY AS/IF REQUIRED PER ENG. DWG. PRIME PER F-18.01. REATTACH IDENTIFICATION TAG. 3.5 Interpretation and Other Applications </SectionTitle> <Paragraph position="0"> The mass-change application is fairly simple. More complicated NLP applications require ellipsis and pronominal resolution, and more richer referential semantics. An NLI to a relational database, for example, requires an explicit recursive semantic composition process. This is why our deeper semantics in Prolog closely parallels that which a categorial analysis would furnish, i.e:, using function application and composition over lambda forms, per treatments such as \[11, 12\] and using a semantic theory such as DRT \[7\]. Such an approach allows one to compose a semantics in a principled manner and to interpret with respect to the domain model. Nevertheless, to this point, in an intert:ace to a project and program scheduling system, we have attempted only to render semantics for scopeunderspecified quantifiers, negation, and numerical and temporal constraints. Tense and aspect (e.g., \[15\]), distinctions among plural readings of noun phrases, and a deeper lexical semantics, have so t~ not been elaborated, but are planned. In \[3\], e.g., a lexical semantics based on \[6\] will be developed. Finally, a tool like \[4\]'s Prolog-to-SQL compiler can prove useful for mapping the final referential semantics to a specific database or domain model.</Paragraph> </Section> class="xml-element"></Paper>