File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1029_metho.xml
Size: 9,766 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1029"> <Title>THE ROLE OF SYNTAX IN INFORMATION EXTRACTION</Title> <Section position="3" start_page="139" end_page="139" type="metho"> <SectionTitle> THE SYSTEM </SectionTitle> <Paragraph position="0"> The goal of information extraction is to analyze a text (an article / a message) and to fill a template with information about a specified type of event. In the case of MUC-6, the task (the &quot;scenario&quot;) was to identify instances of executives being hired or fired from corporations3 Most of the stages of processing are performed one sentence at a time. First, each word in a sentence is looked up in a large English dictionary, Comlex Syntax, which provides syntactic information about each word. The system then performs several stages of pattern matching. The first stages deal primarily with name recognition -- people's names, organization names, geographic names, and names of executive positions (&quot;Executive Vice President for Recall and Precision&quot;). The next stages deal with noun and verb groups. Basically, a noun group consists of a noun and its left modifiers: &quot;the first five paragraphs&quot;, &quot;the yellow brick road&quot;; such groupings can generally be identified from syntactic information alone. A verb group consists of a verb and its related auxiliaries: &quot;sleeps&quot;, &quot;is sleeping&quot;, &quot;has been sleeping&quot;, etc. All of these stages are basically scenarioindependent (except for the recognition of executive positions, which was added for this scenario).</Paragraph> <Paragraph position="1"> Next come the scenario-specific patterns. These include, in particular, patterns to recognize the scenario events, such as &quot;Smith became president of General Motors&quot;, &quot;Smith retired as president of General Motors&quot;, and &quot;Smith succeeded Jones as president of General Motors&quot;. When such a pattern is matched, a corresponding event structure is generated, recording the type of event (for this scenario, hiring or firing) and the people and companies involved.</Paragraph> <Paragraph position="2"> The next stage of processing is reference resolution.</Paragraph> <Paragraph position="3"> At this stage, pronouns and definite noun phrases which refer back to previously mentioned people or organizations are linked to these antecedents.</Paragraph> <Paragraph position="4"> When all the sentences of an article have been analyzed, a final stage of processing assembles the information and generates a template in the format required for MUC.</Paragraph> <Paragraph position="5"> The resulting system did quite well. With a limited development time (four weeks for this MUC) we were able to develop a system which obtained a recall of 47% and a precision of 70% (with a combined F measure of 56.4) on the test corpus. This was the best F score on the scenario template task, although several other systems (mostly with similar architectures) got scores that were not significantly lower.</Paragraph> </Section> <Section position="4" start_page="139" end_page="140" type="metho"> <SectionTitle> THE ROLE OF SYNTAX </SectionTitle> <Paragraph position="0"> Although our system, and systems like it, are characterized as &quot;pattern matching&quot; systems, they really are doing a form of parsing: they analyze the sentence into a nested constituent structure. They differ from more conventional parsing systems (such as our earlier system) in * not seeking a full-sentence analysis: they only build as much structure as is needed for the information extraction task, including selected clauses relevant to the scenario parsing conservatively and deterministically: they only build structures which have a high chance of being correct, either because of syntactic clues (for noun groups) or semantic clues (for clause structures); as a result, they are much faster than traditional parsers * using semantic patterns for the final stage(s) of analysis Overall, we profited from the use of the pattern-matching approach; our analyzer was considerably faster, and we avoided some of the parsing errors which result from trying to obtain complete sentence analyses with a syntactic grammar. On the other hand, we also experienced first-hand some of the shortcomings of the semantic pattern approach. Syntax analysis provides two main benefits: it provides generalizations of linguistic structure across different semantic relations (for example, that the structure of a main clause is basically the same whether the verb is &quot;to succeed&quot; or &quot;to fire&quot;), and it captures paraphrastic relations between different syntactic structures (for example, between &quot;X succeeds Y&quot;, &quot;Y was succeeded by X&quot;, and &quot;Y, who succeeded X&quot;). These benefits are lost when we encode individual semantic structures.</Paragraph> <Paragraph position="1"> In particular, in our system, we had to separately encode the active, passive, relative, reduced relative, etc. patterns for each semantic structure. These issues are hardly new; they have been well known at least since the syntactic grammar vs. semantic grammar controversies of the 1970's.</Paragraph> <Paragraph position="2"> How, then, to gain the benefits of clause-level syntax within the context of a partial parsing system? The approach we have adopted has been to introduce clause level patterns which are expanded by metarules. 2 As a simple example of a clause-le4el pattern, con- null This specifies a clause with a subject of class Cperson, a verb of class C-run (which includes &quot;run&quot; and &quot;head&quot;), and an object of class C-company. 3 This is expanded into patterns for the active clause (&quot;Fred runs IBM&quot;), the passive clause (&quot;IBM is run by Fred.&quot;), relative clauses (&quot;Fred, who runs IBM, ...&quot; and &quot;IBM, which is headed by Fred, ...&quot;), reduced relative clauses (&quot;IBM, headed by Fred, ...&quot;) and conjoined verb phrases (&quot;... and runs IBM&quot;, &quot;and is run by Fred&quot;). The expanded patterns also include pattern elements for sentence modifiers, so that they can analyze sentences such as &quot;Fred, who last year ran IBM, ...&quot;.</Paragraph> <Paragraph position="3"> Using defclansepattern reduced the number of patterns required and, at the same time, slightly improved coverage because -- when we had been expanding patterns by hand -- we had not included all expansions in all cases.</Paragraph> <Paragraph position="4"> The defclausepattern procedure performs a rudimentary syntactic analysis of the input. In our exam2These have some kinship to the metarules of GPSG, which expand a small set of productions into a larger set involving the different clause-level structures.</Paragraph> <Paragraph position="5"> 3It also specifies that the attributes of these three constituents are to be bound to the variables person-at, verb-at, and company-at, and that the procedure when-run is to be invoked when this pattern is matched.</Paragraph> <Paragraph position="6"> pie, it determines that np-sem(C-person) is the subject, vg(C-run) is the verb, and np-sem(C-company) is the object. This is a prerequisite for generating the various restructurings, such as the passive. We intend in the near future to expand de:fclausepattern to handle (parse) a richer set of patterns, including both sentence modifiers and a wider range of complements. In this way the power of clause-level syntax is provided to the pattern writer, without requiring the pattern writer to keep these details explicitly in mind. The use of clause-level syntax to generate syntactic variants of a semantic pattern is even more important if we look ahead to the time when such patterns will be entered by users rather than computational linguists.</Paragraph> <Paragraph position="7"> We can expect a computational linguist to consider all syntactic variants, although it may be a small burden; we cannot expect the same of a typical user.</Paragraph> <Paragraph position="8"> We expect that users would enter patterns by example, and would answer queries to create variants of the initial pattern. We have just begun to create such an interface, which allows a user to begin the process of pattern creation by entering an example and the correspoding event structure to be generated.</Paragraph> <Paragraph position="9"> The example is analyzed using the low-level patterns (such as the name and noun group patterns) and then translated into a clause-level pattern. The user can then manipulate the pattern, generalizing pattern elements and dropping some pattern elements. Using defclausepatVern, the resulting pattern is then analyzed and its clause-level syntactic variants are generated. Though our initial tests are promising, a great deal of work will still be required on this interface to provide the full flexibility needed for creating a wide range of patterns.</Paragraph> <Paragraph position="10"> Our work has indicated the ways in which we can continue to obtain the benefits of syntax analysis along with the performance benefits of the pattern matching approach. While we no longer have a monolithic grammar, we are still able to take advantage of the syntactic regularities of both noun phrases and clauses. Noun group syntax remains explicit, as one phase of pattern matching. Clause syntax is now utilized in the metarules for defining patterns and in the rules which analyze example sentences to produce patterns.</Paragraph> </Section> class="xml-element"></Paper>