File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1033_metho.xml
Size: 16,871 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1033"> <Title>Pattern Matching in a Linguistically- Motivated Text Understanding System</Title> <Section position="3" start_page="0" end_page="183" type="metho"> <SectionTitle> 2. A Shift in the Community </SectionTitle> <Paragraph position="0"> Text processing systems participating in the MUC evaluations (most recently, MUC-5) perform linguistic processing to various levels. Some systems may attempt to do a deep level of understanding whenever possible \[5\], whereas others use more shallow &quot;skimming&quot; techniques \[6\], focusing only on information of interest and ignoring all other text. Similarly, systems span the spectrum in their use of finite-state pattern-matching instead of the more traditional, general, syntactic and semantic processing.</Paragraph> <Paragraph position="1"> There are several reasons for the recent shift to increased use of FS approximations. Work was published on deriving finite-state approximations from more general grammars \[7\]. Then in MUC-3 it became evident that, in certain data-extraction tasks, a system which ignored much of the input text but focussed attention on the items of interest could perform as well as other systems which emphasized deeper understanding of all the text. Once the problem of data-extraedon was perceived to only require the understanding of small fractions of the input text, some systems evolved to do more shallow processing and the use of finite-state approximations increased.</Paragraph> <Paragraph position="2"> It should be noted that incorporating finite-state elements can result in advantages that are important for achieving operational data-extraction systems. The simplicity of the finite-state formalism makes FS rules more easily understandable (and thus, modifiable) by non-experts. Since parsing finite-state grammars can be done very efficiently, another advantage is fast processing, which is desirable in many real applications.</Paragraph> <Paragraph position="3"> In so/he systems, notably SRI's and GE's, there was a dramatic shift between MUC-3 and MUC-5 towards the use of finite-state pattern-matching in all the critical linguistic components, relying heavily on domain-specific patterns. Development of GE's MUC-5 system, SHOGUN \[8\], resulted in the complete replacement of their general syntactic parser by a complex FS grammar. This new grammar encodes domain-specific information which was formerly distributed in other components. SRI's new FASTUS \[9\] relies on a cascade of finite-state transducers; the first stages find simple linguistic structures, and the final and most important stage consists of multiple levels of domain-specific finite-state patterns. Information not matched is ignored.</Paragraph> <Paragraph position="4"> 2. as a backup strategy, identify patterns that are likely to have been fragmented during regular processing.</Paragraph> <Paragraph position="5"> Although both of these systems (along with BBN's PLUM), were top performers in MUC-5, they now lack a large-coverage domain-independent syntactic and semantic model. Rather, they rely on intensive analysis of domain corpora in order to encode patterns in each new domain.</Paragraph> <Paragraph position="6"> 3. Role in PLUM's Architecture BBN's PLUM has a traditional and general-purpose processing core, where morphological analysis, syntactic parsing, semantic interpretation, and discourse analysis take place. Purely syntactic parse structures and general semantic interpretations are created during processing. When porting to a new domain, we can use our automatic procedures for learning lexical-semantic case-frame information from annotated data \[10\] to quickly obtain domain-specific understanding without using finite-state approximations. This then becomes the initial system on which more detailed development is based.</Paragraph> <Paragraph position="7"> During the development for TIPSTER, we added to the core PLUM system two new optional processing modules which do use finite-state patterns for the following specific purposes: 1. detect domain-specific simple constructions that can be identiffed on/he basis of shallow lexical information, and Figure 1 shows PLUM's architecture. Parallel possible paths are indicated where the optional pattern-matching modules appear. The two new modules, the Lexical Pattern Matcher and the Sentence-Level Pattern Matcher, use the same core finite-state pattern-matching processor, SPURT, which is described in the next section.</Paragraph> <Paragraph position="8"> The Lexical Pattern Matcher operates before parsing but after tagging by part-of-speech to recognize constructions which can be detected based on component words, their parts-of-speech, and simple properties of their lexical semantics. This is used primarily for structures that could be part of the grammar, but can be more efficiently recognized by a finite state machine. Examples are corporation/organization names, person names, and measurements.</Paragraph> <Paragraph position="9"> The Sentence-Level Pattern Matcher replaced our former fragment combining component which sought to attach contiguous fragments based on syntactic and semantic properties. The new pattern-matching component applies FS patterns to the fragments of the parsed and semantically interpreted input; the matched patterns' associated actions may modify or add new semantic information at the level of the sentence. That is, semantic relationships may be added between objects in potentially widely separated fragments of the sentence, thereby handling the example of the discontiguous constituent presented earlier.</Paragraph> <Paragraph position="10"> 4. SPURT: A Finite-State Pattern-Action</Paragraph> <Section position="1" start_page="183" end_page="183" type="sub_section"> <SectionTitle> Language </SectionTitle> <Paragraph position="0"> We defined our first version of the FS patteru-matcher and FS grammar syntax for a gisting application \[11\]. The problem there was to extract key information (e.g., plane-id, command) from the output of a speech recognizer whose input was (real) air-traffic controller and pilot interactions. This initial version of the pattern-matcher 'was also utilized, for the purpose of detecting company names, in the PLUM configuration used for the initial TIPSTER evaluations.</Paragraph> <Paragraph position="1"> Before M\[UC-5 we made the FS grammar syntax more powerful (though still finite-state) to give the rule-writer more flexibility. We also introduced optimizations to the parser and added an action component to the rules. The resulting utility is named SPURT. We first used SPURT for applying sentence-level patterns, and later replaced the simple company name recognizer by SPURT to perform general lexically-based pattern matching of various types of constructions.</Paragraph> <Paragraph position="2"> SPURT rules are finite-state patterns which can be used to search for complex patterns of information in a sentence and build semantic structures from that information. A SPURT rule has a :PATTERN component which is the expansion (the &quot;right-hand side&quot;) of a finite-state grammar rule. It optionally has an :UNDER-STANDING component which states actions to take if the pattern is matched. Examples of SPURT rules are included in subsequent sections.</Paragraph> <Paragraph position="3"> Rules are either top-level rules or sub-level (supporting-level) rules. Top-level rules indicate multiple entry points into the grammar defined by the patterns, and may invoke sub-level rules, as in a context-free grammar where the fight hand side of a non-terminal may be in terms of other non-terminals. Top-level patterns are iterated over for each sentence, and the actions corresponding to matched rules are executed.</Paragraph> <Paragraph position="4"> Rules are assigned a phase. Rules belonging to phase n+l operate on the input after it is mutated by phase n. So far we have only seen the need for up to 2 phases in our rules.</Paragraph> <Paragraph position="5"> When the SPURT rules are read in at system load time, they are compiled into a network of nodes and arcs. Arcs coming out of a node indicate multiple possible next states. Nodes contain tests, so that if the test at the end-node of an arc is successful when applied to the input at the pointer, that arc is traversed. The parser simply matches an input against the network, performing a depth-first search, and selecting a path that matches the maximal amount of input. At each decision point, arcs are tried in an order which favors paths that consume a maximal amount of input in a meaningful way (e.g., the parser only follows &quot;don't-care&quot; arcs when other possibilities are exhausted). Once a successful parse of the whole input is found the search is terminated. 1 The resulting path is then traversed to execute the corresponding actions.</Paragraph> </Section> </Section> <Section position="4" start_page="183" end_page="184" type="metho"> <SectionTitle> 4.1. Lexically-based SPURT </SectionTitle> <Paragraph position="0"> The Lexical Pattern Matcher applies SPURT patterns after morphological analysis but prior to parsing. The input consists of 1 In all-paths mode, the parser can be used to find arc probabilities based on training data. This was used in the gisting application, but has not yet been used in PLUM.</Paragraph> <Paragraph position="1"> word tokens with part-of-speech information. A pattern can test on a token's word component, its part-of-speech, its semantic type, or a top-level predicate in its lexical semantics. When a pattern is matched, the action component identifies substrings of the matched sequence to add to the temporary lexicon. These temporary definitions are active for the duration of the message.</Paragraph> <Paragraph position="2"> For example, a pattern for recognizing company names could match a sequence such as (&quot;Bfidgestone&quot; NP) (&quot;Sports&quot; NPS) (&quot;Co.&quot; NP), where NP is the tag for proper nouns, and NPS for plural proper nouns; the pattern's action results in this sequence being replaced by the singular token (&quot;Bridgestone Sports Co.&quot; NP), which is, as a side effect, defined as a lexical entry having semantic type CORPORATION.</Paragraph> <Paragraph position="3"> Figure 2 shows the roles used to match the example above. The first sub-rule, NP-PLUS, finds sequences of tokens that have been tagged as proper nouns. The XXX-CO rule finds sequences of the type {'the'} \[proper-noun\]+ {\[proper-noun-plural\]} \[corpdesignator\]. The :TERM-PRED operator appearing in this rule allows for other simple tests on the tokens. In this case, the corpdesignator? test tries to match one of a list of possible company designators, e.g., &quot;Corp.&quot;. The CO-INSTANCE rule determines the existence of a company name if one of many company patterns matches. If there is a match, the pattern assigns the tag tag-string to the sequence, and the action component creates a lexical entry for it. The lexical entry is assigned type CORPORATION and assigned the predicate NAME-OF relating the entry to a string created out of the words in the matched sequence. Finally, the top-level rule CO finds multiple instances of companies in the input.</Paragraph> <Section position="1" start_page="184" end_page="184" type="sub_section"> <SectionTitle> 4.2. Sentence-Level SPURT </SectionTitle> <Paragraph position="0"> The input to Sentence-Level SPURT is a sentence object which has already been processed through the fragment semantic interpreter.</Paragraph> <Paragraph position="1"> Its fragments' parse nodes have already been annotated with a semantic interpretation. SPURT's parser actually operates on the leaf elements (the nodes corresponding to the terminals, or words) of the fragment parses. The &quot;pointer&quot; can move along the input either at the word level, or at the level of higher structures, achieved by matching nodes that are ancestors of the leaf nodes. Thus patterns can test on words or phrases. When a word is matched, the parse pointer moves to the next word's leaf node; if a phrase is matched, it is moved to the next possible word not spanned by the tree.</Paragraph> <Paragraph position="2"> A pattern can test both syntactic and semantic information associated with the parse nodes. When a pattern is matched, the action component specifies new semantic information to be added to particular parse nodes (and thus to the fragment in which each node is contained). The new information is allowed to include predicates connecting semantic structures across different fragments-this is something the fragment semantic interpreter is unable to do, as it is a compositional operation on the individual, independent, parse fragments.</Paragraph> <Paragraph position="3"> Below is an example of a sentence-level rule which will match the example given in the introduction. This pattern matches sequences of the type \[anyword\]* \[joint-word\] \[anyword\]* \[activitycorporation-or-venture-np\] \[anyword\]*, where \[joint-word\] (or *JOINT-WORDS* as specified below) is one of a list of words such as &quot;jointly&quot; and &quot;together&quot;. The operator :AND-ENV introduces tests on phrases in a parse tree: :CAT indicates the phrase category; because some phrasetypes are recursive, :LOW (or other values) is used to indicate which level of the recursive structure is the one to be looked at; and :CONCEPT indicates the semantic type that is desired of that phrase. The simple action component of this rule adds the semantic type JOINT-VENTURE to the parse-node where the joint-word occurred. In effect this is indicating there is a joint-venture in the sentence. Note that this pattern makes no decisions regarding the role, if any, that \[activity-corporation-orventure-np\] plays in the joint venture.</Paragraph> </Section> </Section> <Section position="5" start_page="184" end_page="185" type="metho"> <SectionTitle> 5. Experiments </SectionTitle> <Paragraph position="0"> In order to measure the impact of the new FS components, we ran our MUC-5 English configurations (for English joint ventures and English microelectronics) on two test sets. The first is test data used for the TIPSTER 18-month evaluation, the second is data that was released for training, but was kept blind. For each pair of domain and test set, we ran experiments in each of 4 modes: The default configurations in the two domains share the same processing modules, the same general domain-independent grammar and semantic rules, and the same company-name recognizer. Each configuration contains its own set of domain-specific lexical-semantic definitions. A lexical-semantic definition contains the word's semantic type and (optionally) case-frames identifying semantic tests on possible arguments to the word. The semantic interpreter uses these rules in compositionally assigning semantics to parse-trees. For EJV, the initial version of the lexical semantics Was automatically generated from training data \[I0\]; it was then modified manually as needed.</Paragraph> <Paragraph position="1"> Although we tested both domains, we consider the test on EME to be more representative of the effects of the new modules. Most of the EJV development preceded the existence of the modules. In fact, for EJV we added no new rules to the lexical FS component.</Paragraph> <Paragraph position="2"> EME development, however, was able to take advantage of the new utilities almost from the start. It made heavier use of the front-end rules for some of the tricky technical constructions in that domain; it should be noted that even then, the impact of the lexical FS was minimal in that domain.</Paragraph> <Paragraph position="3"> Table 1 shows the difference in ERR for the various modes. ERR was the primary error measure used in MUC-5; to show improved performance, the goal is to minimize this measure. The new FS components, as evidenced by the Base results, improved ERR by It should be noted that the Japanese domains, JJV and JME, made heavy use of the sentence-level patterns. FS patterns for JJV gave us a quick gain in performance, but the price paid was having little carryover to the JME domain once that development began. We did not test those domains without the FS components. Based on our experience, if multiple Japanese domains are expected, we would undoubtedly build a robust domain-independent core of semantic rules, which in the long-run maximizes re-usability and minimizes effort :for each new domain. We utilized FS pattems because our Japanese expert wanted to explore the capabilities, and limits, of pattern-matching.</Paragraph> </Section> class="xml-element"></Paper>