File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/w93-0306_metho.xml
Size: 24,960 bytes
Last Modified: 2025-10-06 14:13:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W93-0306"> <Title>NPtool~ a detector of English noun phrases *</Title> <Section position="5" start_page="49" end_page="53" type="metho"> <SectionTitle> 3 Previous work </SectionTitle> <Paragraph position="0"> This section consists of two subsections. Firstly, a performance-oriented survey of some related systems is presented. Then follows a more detailed presentation of ENGCG, a predecessor of the NPIool parser in an information retrieval system.</Paragraph> <Section position="1" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 3.1 Related systems </SectionTitle> <Paragraph position="0"> So far, I have found relatively little documentation on systems whose success in recognising or parsing noun phrases has been reported. I am aware of three systems with some relevant evaluations.</Paragraph> <Paragraph position="1"> Church's Parts of speech \[Church, 1988\] performs not only part-of-speech analysis, but it also identities the most simple kinds of noun phrases - mostly sequences of determiners, premodifiers and nominal heads - by inserting brackets around them, e.g.</Paragraph> <Paragraph position="2"> s Consider for instance the attachment of prepositions\] phrases in general and of ofphrues in particular.</Paragraph> <Paragraph position="4"> The appendix in \[Church, 1988\] lists the analysis of a small text. The performance of the system on the text is quite interesting: 0f243 noun phrase bracket, s, only five are omitted. - The performance of PaNs of speech was also very good in part-of-speech analysis on the text: 99.5% of all words got the appropriate tag. The mechanism for noun phrase identification relies on the part-of-speech analysis; the part-of-speech tagger was more successful on the text than on an average; therefore the average performance of the system in noun phrase identification may not be quite as good as the figures in the appendix of the paper suggest.</Paragraph> <Paragraph position="5"> Bourigault's LECTER \[Bourigault, 1992\] is a surface-syntactic analyser that extracts 'maximallength noun phrases' -mainly sequences of determiners, premodifiers, nominal heads, and certain kinds of postmodifying prepositional phrases and adjectives - from French texts for terminology applications. The system is reported to recognise 95% of all maximal-length noun phrases (43,500 out of 46,000 noun phrases in the test corpus), but no figures are given on how much 'garbage' the system suggests as noun phrases. It is indicated, however, that manual validation is necessary.</Paragraph> <Paragraph position="6"> Rausch, Norrback and Svensson \[1992\] have designed a noun phrase extractor that takes as its input part-of-speech analysed Swedish text, and inserts brackets around noun phrases. In the recognition of 'Nuclear Noun Phrases' - sequences of determiners, premodifiers and nominal heads - the system was able to identify 85.9% of all nuclear noun phrases in a text collection, some 6,000 words long in all, whereas some 15.7% of all the suggested noun phrases were false hits, i.e. the precision t' of the system was 84.3%. The performance of a real application would probably be lower because potential misanalyses due to previous stages of analysis (morphological analysis and part-of-speech disarnbiguation, for instance) are not accounted for by these figures.</Paragraph> </Section> <Section position="2" start_page="49" end_page="50" type="sub_section"> <SectionTitle> 3.2 ENGCG and the SIMPB. project </SectionTitle> <Paragraph position="0"> SIMPR, Structured Information Management: Processing and Retrieval, was a 64 person year ESPRIT II project (Nr. 2083, 1989-1992), whose objective was to develop new methods for the management and retrieval of large amounts of electronic texts. A central function of such a system is to recognise those words in the stored texts that represent it in a concise fashion - in short, index terms.</Paragraph> <Paragraph position="1"> Term indices created with traditional methods 7 are based on isolated, perhaps truncated words.</Paragraph> <Paragraph position="2"> eFor definitions of the terms recall and preciJion, see Section 6.</Paragraph> <Paragraph position="3"> These largely string-based statistical methods are somewhat unsatisfactory because many content identifiers consist of word sequences - compounds, head-modifier constructions, even simple verb - noun phrase sequences. One of the SIMPR objectives was also to employ more complex constructions, the recognition of which would require a shallow grammatical analysis. The Research Unit for Computational Linguistics at the University of Helsinki participated in this project, and ENGTWOL, a Twolstyled morphological analyser as well as ENGCG, a Constraint Grammar of English, were written 1989-1992 by Voutilainen, Heikkil~i and Anttila \[forthcoming\]. The resultant SIMPR system is an improvement over previous systems \[Smart (Ed.), forthcoming\] - it is not only reasonably accurate, but also it operates on more complex constructions, e.g.</Paragraph> <Paragraph position="4"> postmodifying constructions and simple verb-object constructions.</Paragraph> <Paragraph position="5"> There were also some persistent problems. The original plan was to use the output of the whole ENGCG parser for the indexing module. However, the last module of the three sequential modules in the ENGCG grammar, namely Constraint Syntax proper, was not used in the more mature versions of the indexing module - only lexical analysis and morphological disambiguation were applied. The omission of the syntactic analysis was mainly due to the somewhat high error rate (3--4% of all words lost the proper syntactic tag) and the high rate of remaining ambiguities (15-25% of all words remained syntactically ambiguous.</Paragraph> <Paragraph position="6"> Here, we will not go into a detailed analysis of the problems s, suffice it to say that the syntactic grammar scheme was unnecessarily ambitious for the relatively simple needs of the indexing application. One of the improvements in NPtoal is a more optimal syntactic grammar scheme, as will be seen in Section 5.1. 4 NPtool in outline In this section, the architecture of NPtool is presented in outline. Here is a flow chart of the system: Intersection of noun phrase sets SSee e.g. \[VoutLla/aen, HeikkilPS and AnttAIa, 1992\] for details.</Paragraph> <Paragraph position="7"> In the rest of this section, we will observe the analysis of the following sample sentence, taken from a car maintenance manual: The ~n\]e~ and exhaust manifolds are mounted on opposite sides of the cylinder head, the exhaust manifold channelling the gases to a single exhaust pipe and silencer system.</Paragraph> </Section> <Section position="3" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 4.1 Preprocessing and morphological analysis </SectionTitle> <Paragraph position="0"> The input ASCII text, preferably SGML-annotated, is subjected to a preprocessor that e.g. determines sentence boundaries, recognises fixed syntagms 9, normalises certain typographical conventions, and verticalises the text.</Paragraph> <Paragraph position="1"> This preprocessed text is then submitted to morphological analysis. ENGTWOL, a morphological analyser of English, is a Koskenniemi-style morphological description that recognises all inflections and central derivative forms of English. The present lexicon contains some 56,000 word stems, and altogether the analyser recognises several hundreds of thousands of different word-forms. The analyser also employs a detailed parsing-oriented morphosyntactic description; the feature system is largely derived from \[Quirk, Greenbaum, Leech and Svartvik, 1985\].</Paragraph> <Paragraph position="2"> Here is a small sample:</Paragraph> <Paragraph position="4"> All running-text word-forms are given on the left-hand margin, while all analyses are on indented lines of their own. The multiplicity of these lines for a word-form indicates morphological ambiguity.</Paragraph> <Paragraph position="5"> For words not represented in the ENGTWOL lexicon, there is a 99.5% reliable utility that assigns ENGTWOL-style descriptions. These predictions are based on the form of the word, but also some heuristics are involved.</Paragraph> </Section> <Section position="4" start_page="50" end_page="52" type="sub_section"> <SectionTitle> 4.2 Constraint Grammar parsing </SectionTitle> <Paragraph position="0"> The next main stage in NPtoei analysis is Constraint Grammar parsing. Parsing consists of two main phases: morphological disambiguation and Constraint syntax.</Paragraph> <Paragraph position="1"> dege.g. multiword prepositions and compounds * Morphological disambiguation. The task of the morphological disambiguator is to discard all contextually illegitimate morphological readings in ambiguous cohorts. For instance, consider the fol- null Here an unambiguous determiner is directly followed by a three-ways ambiguous word, two of the analyses being verb readings, and one, an adjective reading. - A determiner is never followed by a verbldeg; one of the 1,100-odd constraints in the disambiguation grammar \[Voutilainen, forthcoming a\] expresses this fact about English grammar; so the verb readings of single are discarded here.</Paragraph> <Paragraph position="2"> The morphological disambiguator seldom discards an appropriate morphological reading: after morphological disambiguation, 99.7-100% of all words retain the appropriate analysis. On the other hand, some 3-6% of all words remain ambiguous, e.g.</Paragraph> <Paragraph position="3"> head in this sentence. There is also an additional set of some 200 constraints - after the application of both constraint sets, 97-98% of all words become fully disambiguated, with an overall error rate of up to 0.4% \[Voutilainen, forthcoming b\]. The present disambiguator compares quite favourably with other known, typically probabilistic, disambiguators, whose maximum error rate is as high as 5%, i.e. some 17 times as high as that of the ENGCG disambiguator.</Paragraph> <Paragraph position="4"> * Constraint syntax. After morphological disambiguation, the syntactic constraints are applied. In the NPtool syntactic description, all syntactic ambiguities are introduced directly in the lexicon, so no extra lookup module is needed. Like disambiguation constraints, syntactic constraints seek to discard all contextually illegitimate syntactic function tags. Here is the syntactic analysis of our sample sentence, as produced by the current parser. To save space, most of the morphological codes are omitted.</Paragraph> <Paragraph position="6"> All syntactic-function tags are flanked with '@'. For instance, the tag '@>N' indicates that the word is a determiner or a premodifier of a nominal in the right-hand context (e.g. fhe). The second word, in#or, remains syntactically ambiguous due to a pre-modifier reading and a nominal head @NH reading - note that the ambiguity is structurally genuine, a coordination ambiguity. The tag @V is reserved for verbs and auxiliaries, cf. are as well as mounted. The syntactic description will be outlined below.</Paragraph> <Paragraph position="7"> Pasi Tapanainen 11 has recently made a new implementation of the Constraint Grammar parser that performs morphological disambiguation and syntactic analysis at a speed of more than 1,000 words per second on a Sun SparcStation 10, Model 30.</Paragraph> </Section> <Section position="5" start_page="52" end_page="52" type="sub_section"> <SectionTitle> 4.3 Treatment of remaining ambiguities </SectionTitle> <Paragraph position="0"> The Constraint Grammar parser recognises only word-level ambiguities, therefore some of the troversale through an ambiguous sentence representation may be blatantly ill-formed.</Paragraph> <Paragraph position="1"> NPtool eliminates locally unacceptable analyses by using a finite-state parser \[Tapanainen, 1991\] 1~ as a kind of 'post-processing module' that distinguishes between competing sentence readings. The parser employs a small finite-state grammar that I have written. The speed of the finite-state parser is comparable to that of the Constraint Grammar parser.</Paragraph> <Paragraph position="2"> The finite-state parser produces all sentence readings that are in agreement with the grammar. Cam sider the following two adapted readings from the beginning of our sample sentence:</Paragraph> <Paragraph position="4"> The only difference is in the analysi s of cylinder head: the first analysis reports cylinder as a noun phrase head which is followed by the verb head, while the second analysis considers cylinder head as a noun phrase. Now the last remaining problem is, how to deal with ambiguous analyses like these: should cylinder be reported as a noun phrase, or is cylinder head the unit to be extracted? The present system provides all proposed noun phrase candidates in the output, but each with an indication of whether the candidate noun phrase is unambiguously analysed as such, or not. In this solution, I do not use all of the multiple analyses proposed by the finite-state parser. For each sentence, no more than two competing analyses are selected for further processing: one with the highest number of words as part of a maximally long noun phrase analysis, and the other with the lowest number of words as part of a maximally short noun phrase analysis.</Paragraph> <Paragraph position="5"> This 'weighing' can be done during finite-state parsing: the formalism employs a mechanism for imposing penalties on regular expressions, e.g. on tags. nKesearch Unit for Computational Linguistics, University of Helsinki 12For other work m this approach, see also \[Koskenniemi, 1990; Koskenniemi, Tapanalnen and Voutilainen, 1992; Voutilainen and Tapanalnen, 1993\].</Paragraph> <Paragraph position="6"> A penalised reading is not discarded as ungrammatical, only the parser returns all accepted analyses in an order where the least penalised analyses are produced first and the 'worst' ones last.</Paragraph> <Paragraph position="7"> Thus there is an 'NP-hostile' finite-state parser that penalises noun phrase readings; this would prefer the sentence reading with cylinder/@NH head/@V. The 'NP-friendly' parser, on the other hand, penalises all readings which are not part of a noun phrase reading, so it would prefer the analysis with eylinder/@>N head/@NIY. Of all analyses, the selected two parses are maximally dissimilar with regard to NP-hood. The motivation for selecting maximally conflicting analyses in this respect is that a candidate noun phrase that is agreed upon as a noun phrase by the two finite-state parsers systems just as it is - neither longer nor shorter - is likely to be an unambiguously identified noun phrase. The comparison of the outputs of the two competing finite-state parsers is carried out during the extraction phase.</Paragraph> </Section> <Section position="6" start_page="52" end_page="53" type="sub_section"> <SectionTitle> 4.4 Extraction of noun phrases </SectionTitle> <Paragraph position="0"> An unambiguous sentence reading is a linear sequence of symbols, and extracting noun phrases from this kind of data is a simple pattern matching task.</Paragraph> <Paragraph position="1"> In the present version of the system, I have used the gawk program that allows the use of regular expressions. With gawk's gsub function, the boundaries of the longest non-overlapping expressions that satisfy the search key can be marked. If we formulate our search query as something like the following stands for one or more occurrences of its argument, stands for zero or more occurrences of its axgmnen$, stands for premodifiers, stands for determiners and premodifiers, stands for nominal heads except pronouns, stands for prepositions starting a poszmodifying prepositional phrase, and do some additional formatting and 'cleaning', the above two finite-state analyses will look like the following13: the np: inlet and exhaust manifold 13Note that the noun phrase heads are here ~ven in the bLse form, hence the absence of the plural form of e.g. 'manifold'.</Paragraph> <Paragraph position="2"> are mounted on np: opposite side of the cylinder head, the np: exhaust manifold channelling the np: gas to a np: single exhaust pipe and np: silencer system the np: inlet and exhaust manifold are mounted on np: opposite side of the cylinder head , the np: exhaust manifold channelling the np: gas to a np: single exhaust pipe and np: silencer system The proposed noun phrases are given on indented lines, each marked with the symbol 'np:'. The candidate noun phrases are then subjected to further routines: all candidate noun phrases with at least one occurrence in the output of both the NP-hostile and NP-friendly parsers are labelled with the symbol 'ok:', and the remaining candidates are labelled as uncertain, with the symbol '?:'. From the outputs given above, the following list can be produced: ok: inlet and exhaust manifold ok: exhaust manifold ok: gas ok: single exhaust pipe ok: silencer system ?: opposite side of the cylinder ?: opposite side of the cylinder head The linguistic analysis is relatively neutral as to what is to be extracted from it. Here we have concentrated on noun phrase extraction, but from this kind of input, also many other types of construction could be extracted, e.g. simple verb-argument structures. null</Paragraph> </Section> </Section> <Section position="6" start_page="53" end_page="54" type="metho"> <SectionTitle> 5 The syntactic description </SectionTitle> <Paragraph position="0"> This section outlines the syntactic description that I have written for 2gPtool purposes. The ENGTWOL lexicon or the disambiguation constraints will not be described further in this paper; they have been documented extensively elsewhere (see the relevant articles in Karlsson & al. \[forthcoming\]).</Paragraph> <Paragraph position="1"> According to the SIMP/t experiences, the vast majority of index terms represent relatively few constructions. By far the most common construction is a nominal head with optional, potentially coordinated premodifiers and postmodifying prepositional phrases, typically of phrases. The remainder, less than 10%, consists almost entirely of relatively simple verb-NP patterns.</Paragraph> <Paragraph position="2"> The syntactic description used in SIMPR employed some 30 dependency-oriented syntactic function tags, which differentiate (to some extent) between various kinds of verbal constructions, syntactic functions of nominal heads, and so on. Some of the ambiguity that survives ENGCG parsing is in part due to these distinctions \[Anttila, forthcoming\]. The relatively simple needs of an index term extraction utility on the one hand, and the relative abundance of distinctions in the ENGCG syntactic description on the other, suggest that a less distinctive syntactic description might be more optimal for the present purposes: a more shallow description would entail less remaining ambiguity without unduly compromising its usefulness e.g. for an indexing application.</Paragraph> <Section position="1" start_page="53" end_page="54" type="sub_section"> <SectionTitle> 5.1 Syntactic tags </SectionTitle> <Paragraph position="0"> I have designed a new syntactic grammar scheme that employs seven function tags. These tags capitalise on the opposition between noun phrases and other constructions on the one hand, and between heads and modifiers, on the other. Here we will not go into details; a gloss with a simple illustration will suf~ce.</Paragraph> <Paragraph position="1"> * ~V represents auxiliary and main verbs as well as the infinitive marker to in both finite and non-finite constructions. For instance: She should/C/V know/@V what to/QV do/(c)V.</Paragraph> <Paragraph position="2"> * ~NH represents nominal heads, especially nouns, pronouns, numerals, abbreviations and -ingforms. Note that of adjectival categories, only those with the morphological feature <Nominal>, e.g. English, are granted the @NH status: all other adjectives (and -ed-forms) are regarded as too unconventional nominal heads to be granted this status in the present description. An example: The English/@Ne may like the conventional.</Paragraph> <Paragraph position="3"> * Q>N represents determiners and premodifiers of nominals (the angle-bracket '>' indicates the direction in which the head is to be found). The head is the following nominal with the tag @NH, or a pre-modifier in between. An example: the/@>N fat/@>l |butchsr's/@>N wife * ON< represents prepositional phrases that unambiguously postmodify a preceding nominal head.</Paragraph> <Paragraph position="4"> Such unambiguously postmodifying constructions are typically of two types: (i) in the absence of certain verbs like 'accuse', postnominal of-phrases and (ii) preverbal NP--PP sequences, e.g.</Paragraph> <Paragraph position="5"> The man in/C/~< 'the moon had a glass of/@N< ale.</Paragraph> <Paragraph position="6"> Currently the description does not account for other types of postmodifier, e.g. postmodifying adjectives, numerals, other nominals, or clausal constructions. * ~CC and @CS represent co-ordinating and subordinating conjunctions, respectively: Either/CCC you or/CCC I will go if/COS necessary.</Paragraph> <Paragraph position="7"> * @AH represents the 'residual': adjectival heads, adverbials of various kinds, adverbs (also intensifiers), and also those of the prepositional phrases that cannot be dependably analysed as a postmodifier.</Paragraph> <Paragraph position="8"> An example is in order: There/CAH have al~ays/(c)AH been very/CAH many people in/QAH Shis area.</Paragraph> </Section> <Section position="2" start_page="54" end_page="54" type="sub_section"> <SectionTitle> 5.2 Syntactic constraints </SectionTitle> <Paragraph position="0"> The syntactic grammar contains some 120 syntactic constraints. Like the morphological disambiguation constraints, these constraints are essentially negative partial linear-precedence definitions of the syntactic categories.</Paragraph> <Paragraph position="1"> The present grammar is a partial expression of four general grammar statements: 1. Part of speech determines the order of determiners and modifiers.</Paragraph> <Paragraph position="2"> 2. Only likes coordinate.</Paragraph> <Paragraph position="3"> 3. A determiner or a modifier has a head. 4. An auxiliary is followed by a main verb. We will give only one illustration of how these general statements can be expressed in Constraint Grammar. Let us give a partial paraphrase of the statement Part of speech determines the order of de. termiuers and modifiers: 'A premodifying noun occurs closest to its head'. In other words, premodifiers from other parts of speech do not immediately follow a premodifying noun. Therefore, a noun in the nominative immediately followed by an adjective is not a premodifier. Thus a constraint in the grammar would discard the @>N tag of Harry in the following sample sentence, where Harry is directly followed by an unambiguous adjective: We require that the noun in question is a nominative because premodifying nouns in the genitive can occur also before adjectival premodifiers; witness Harry's in Harry's foolish self.</Paragraph> </Section> <Section position="3" start_page="54" end_page="54" type="sub_section"> <SectionTitle> 5.3 Evaluation </SectionTitle> <Paragraph position="0"> The present syntax has been applied to large amounts of journalistic and technical text (newspapers, abstracts on electrical engineering, manuals on car maintenance, etc.), and the analysis of some 20,000-30,000 words has been proofread to get an estimate of the accuracy of the parser.</Paragraph> <Paragraph position="1"> After the application of the NPtool syntax, some 93-96% of all words become syntactically unambiguous, with an error rate of less than i% 14 .</Paragraph> <Paragraph position="2"> To find out how much ambiguity remains at the sentence level, I also applied a 'NP-neutral' version 15 of the finite-state parser on a 25,500 word text from The Grolier Electronic Encyclopaedia. The results are given in Figure 1.</Paragraph> <Paragraph position="3"> in a text of 1,495 sentences (25,500 words). R indicates the number of analyses per sentence, and F indicates the frequency of these sentences.</Paragraph> <Paragraph position="4"> Some 64% (960) of the 1,495 sentences became syntactically unambiguous, while only some 2% of all sentences analyses contain more than ten readings, the worst ambiguity being due to 72 analyses. This compares favourably with the ENGCG performance: after ENGCG parsing, 23.5% of all sentences remained ambiguous due to a number of sentence readings greater than the worst case in NPtool syntax. null</Paragraph> </Section> </Section> class="xml-element"></Paper>