File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/94/c94-1074_abstr.xml
Size: 18,877 bytes
Last Modified: 2025-10-06 13:48:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1074"> <Title>A &quot;not-so-shallow&quot; parser for collocational analysis</Title> <Section position="2" start_page="447" end_page="450" type="abstr"> <SectionTitle> 2. A &quot;not-so-shallow&quot; parsing technique </SectionTitle> <Paragraph position="0"> Our syntactic analyzer (hereafter SSA) extracts partial syntactic structures from corpora. The analyzer, based on discontinuous grammar (Dahl,1989), is able to detect binary and ternary syntactic relations among words, that we call elementary slmtactic lil~k,~ (esl), The framework of discontinuous grammars has several advantages: it allows a simple notation, and exhibits portability among different logic programming styles. The presence of skip rules makes it possible to detect long distance dependencies between co-occurring words. This is particularly important in many texts, for the presence of long coordinate constructions, nested clauses, lists, parenthesised clauses.</Paragraph> <Paragraph position="1"> The partial parsing strategy described hereafter requires in input few more than a morphologic lexicon (section 2.1). Post morphologic processing, as described in section 2.2, is not strictly required, though obviously it increases the reliability of the detected word relations. The lexicon used is purely morphologic, unlike for the Fidditch parser, neither it requires training, like in n-gram based models. This means that the shallow analyzer is portable by minimum changes over different domains. This is not the case with the deterministic partial parsing used in similar works. Furthermore the grammar rules are easy to tune to different linguistic subdomains. The analyzer enables the detection of different types of syntactic links among words: noun-verb, verbnoun, noun-preposition-noun, etc. This information is richer than just SVO triples, in that phrase structures are partitioned in more granular units.</Paragraph> <Paragraph position="2"> The parsing method has been implemented for different corpora, which exhibit very different linguistic styles: a corpus of commercial activities (CD), in telegraphic style, a legal domain (LD) on taxation norms and lows, and remote sensing (RSD) abstracts. The latter is in English, while the former two are in Italian.</Paragraph> <Paragraph position="3"> The English application is rather less developed (a smaller morphologic lexicon, no postmorphology, etc.), however it is useful here to demonstrate that the approach is language independent. In this paper we use many examples from the RSD.</Paragraph> <Section position="1" start_page="447" end_page="448" type="sub_section"> <SectionTitle> 2.1 Morphology </SectionTitle> <Paragraph position="0"> The morphologic analyzer (Marziali, 1992) derives from the work on a generative approach to the Italian morphology (Russo, 1987), first used in DANTE, a NLP system for analysis of short narrative texts in the financial domain (Antonacci et al. 1989). Tile analyzer includes over 7000 elementary lemmata (stems without affixes, e.g. flex is the elementary lemma for de- null flex, in-flex, re-fiex) anti has been experimented since now on economic, financial, commercial and legal domains. Elementary lemmata cover much more than 70(}0 words, since many words have an affix.</Paragraph> <Paragraph position="1"> An entry in the lexicon is as follows: lexicon(len~na, stem, ending_class, syntactic feature) where l emma iS the elementary lemma (e.g.</Paragraph> <Paragraph position="2"> ancora for ancor-aggio (anchor-age)), stem is the lemma without ending (ancor), ending_class iS one over about 60 types of inflections. For example, ancora belongs to the class ec cosa, since it inflects like the word cosa (thinq,). The Italian morphologic lexicon and grammars are fully general. This means that the analyzer has a tendency to overgenerate. For example, the word agente (agent, in the sense of dealer), is interpreted as a i~.oun and as the present participle of the verb agire (to act), though this type of inflected form is never found in both Italian domains. This problem is less evident in English, that is less inflected.</Paragraph> <Paragraph position="3"> Overgeneration is a common problem with grammar based approaches to morphology, as opposed to part of speech (pos) taggers. On the other side, pos taggers need manual work for corpus training every since a new domain is to be analyzed.</Paragraph> <Paragraph position="4"> To quantitatively evaluate the phenomenon of overgeneration, we conskfered a test set of 25 sentences in the LD, including about 800 words. Of these 800, there were 546 different nouns, adjectives anti verbs (i.e. potentially ambiguous words). The analyzer provided 631 interpretations of the 546 words. There were 76 ambiguous words. The overall estimated ambiguity is 76/546:0,139, while the overgeneration ratio is better evaluated by:</Paragraph> <Paragraph position="6"> 2.2. Post morphological processing The purpose of this module is to analyse compound expressions and numbers, such as compound verbs, dates, numeric expressions, and super!atives. Ad-hoc context free grammar have been defined. Post morphological processing includes also simple (but generally valid) heuristic rules to reduce certain types of ambiguity. &quot;Ihere are two group of such rules: (i) Rules to disambiguate ambiguous noun-adjective (N/Agg) interpretations (e.g. acid) (ii) Rules to disambiguate ambiguous verb-noun (V/N) interpretations (e.g. study) One example of heuristics for N/Agg is: If N/Agg is neither preceded nor followed by a noun, or N/Agg, before a verb is reached, Then it is a noun.</Paragraph> <Paragraph position="7"> Ex:... and sulphuric ~ was detected Though examples are in English, post morphology has not been developed for the English language at the time we are w,'iting. After post-morphologic analysis, the 546 nouns, verbs anti adjectives produced only 562 interpretations. The new overgeneration ratio is then O':(562-(546-76))/76=92/76=1,2 The estimated efficacy of the postrnorphology, is 161/92=1,75, about 50% .'eduction of the initial ambiguity.</Paragraph> </Section> <Section position="2" start_page="448" end_page="449" type="sub_section"> <SectionTitle> 2.3. The parser </SectionTitle> <Paragraph position="0"> The SSA syntactic analysis is a rewriting procedure of a single sentence into a set of ~!_1 ~meme~!-y_~y~ iPS\]jg_jin!~ (esl). The SSA is based on a discontinuous grammar, described more formally in (Basili et al. 1992a). In tiffs section we provide a qualitative clescription of the rules by which esl's are generated.</Paragraph> <Paragraph position="1"> Examples of esl's generated by the parser are: N_V (the subject-verb relation), V N (the direct object_verb relation), N P N (noun preposition noun), V P N (verb preposition noun), N_Adj (adjective noun), N N (conq)ound) etc. Overall, we identify over 20 different esl's. There is a discontinuous grammar rule for each esl. A description of a rule used to derive N P N links is in Figure 1. This description applies by straightforward modifications to any other esl type (though some esl rules include a concordance test).</Paragraph> <Paragraph position="2"> As remfirked at the beginning of this section, skip rules are the key to extract long distance syntactic relations and to approximate the behaviour of a full parser. The first predicate LOOK RIGItT of Figure 1 skips over the string X until it finds a preposition (prep(w2)). The second LOOK_RIG\[ IT skips over Y until it finds a noun (noun(w3)).</Paragraph> <Paragraph position="3"> Given an initial string NL_segment, BACKTRACK force the system to analyse all the possible solutions of the predicate LOOKRIGHT (i.e. one-step rigth skips) to derive all the N P N groups, headed by the first norm (i.e. wl). For example, given the string: low concentrations of acetone and ethyl alchool in acqueous solutions the following N_PN are generated: concentration of acetone, concentration of alchool, concentration in solution, acetone in solution, alchooI in solution, all of which are syntactically correct.</Paragraph> <Paragraph position="4"> An uncontrolled application of skip rules would however produce unacceptable noise. The TEST_ON0 are ad hoc heuristic rules that avoid uncontrolled skips. For example, TEST2.ON(X) in Figure 1 verifies that the string X does notinclude a verb. Hence, in the sentence: ... the atmospheric code contpared favourably with results ...</Paragraph> <Paragraph position="5"> the N P_N(code,with,results) is ~ generated. In general, there is one-two different heuristic rule for each esl rule. Heuristic rules are designed to take efficient decisions by exploiting purely syntactic constraints. Such constraints are simple and require a minimum computational effort (essentialy, unification among simple structures). In some case, a lower recall is tolerated to avoid overgeneration. For example, the second TEST ON(Y) rule of Figure 1 verifies that no more than two prepositions are skipped in the string Y. This rule stems from the observation that words located more than three prepositions apart, are rarely semantically related, though a full syntactic parser would eventually detect a relation. Hence, in the NL segment: 1% accuracy on the night side of the Earth with stars down to visual magnitude tree the triple (accuracy, to, tree) is la_(gt generated, though syntactically correct.</Paragraph> <Paragraph position="6"> The derivation of esl's is enabled for non adjacent word by virtue of skip rules. However, interesting information can be lost in presence of more complex phenomena as nested relative clauses or coordination of phrase structures. To cope with these phenomena, a post syntactic processor has been developed to extract links stemming from coordination among previously detected links. This processing significantly increases the set of collected esl, and the quality of the derived lexical information. The contribution of this post syntactic processing device depends heavily on the structure of incoming sentences. In this phase, simple unification .mechanisms are used, rather than heuristics.</Paragraph> <Paragraph position="7"> 3. Performance evaluation</Paragraph> </Section> <Section position="3" start_page="449" end_page="450" type="sub_section"> <SectionTitle> Recall and Precision </SectionTitle> <Paragraph position="0"> M,'my algorithms evaluate their recall and precision against a human reference performer.</Paragraph> <Paragraph position="1"> This pose many problems, like finding a &quot;fair&quot; test material, using a large number of judges to render the evaluation less subjective, and finally interpreting the results. One example of the latter problem is the following: in (Smadja 1993) the nature of a syntactic link between two associated words is detected a posteriori. The performance of the system, called XTRACT, we evaluated by letting human judges compare their choice against that of the system. The reported performances are about 80% precision, 90% recall. One such evaluation experiment is, in our view, questionable, since both the human judges and XTRACT make a decision outside the context of a sentence. The interpretation of the results then does not take into account how much XTRACT succeeds in identifying syntactic relations as they actually occurred in the test suite.</Paragraph> <Paragraph position="2"> Another problem is that, a human judge ntay consider not correct a syntactic association on the ground of semantic knowledge 1. Instead, the performance of a syntactic parser should be evaluated only on a syntactic ground.</Paragraph> <Paragraph position="3"> We define the linguistic performance of SSA as its ability to approximate the generation of the full set of elementary syntactic links derivable by a complete grammar of the domain.</Paragraph> <Paragraph position="4"> Given the set I2 of all syntactically valid esl and the set m of esl derived applying SSA, the precision of the system can be defined as the</Paragraph> <Paragraph position="6"> while its recall can be expressed by: cardinality(co n ~2) / cardinality(~}), Global evaluations of the precision and recall are estimated by the mean values over the whole corpora.</Paragraph> <Paragraph position="7"> We designed for testing purposes a full attribute grammar of the Italian legal language, and we selected 150 sentences for which the full grammar was proved correct. For each parsed sentence, a program automatically computes the esrs globally identified (without repetitions) by the parse trees of each sentence, and compares them with those generated by SSA for the same sentence. The following Table gives a measure of To fully appreciate these results, we must consider, first, that the evaluation is on a purely syntactic ground (many collocations detected by 1 It is tmclear whether Smadja considered Otis problem in his evaluation experiment the full grammar and not detected by the SSA are in fact semantically wrong), second, that the domain is particularly complex. There is an average of 23 trees per sentences in the test set. In particvlar, the low performances of N_V groups (i.e. the subject relation) is influenced by the very frequent (almost 80'}'0) presence of nested relatives (ex: The income that was perceived during 1988i..)is included..) and inversions (ex: si considerano esenti da tassei redditi..=*it is considered tax-free the income..). No partial parser could cope with these entangled structttres.</Paragraph> <Paragraph position="8"> One interesting aspect is that these results seem very stable for the domain. In fact, incrementally adding new groups of sentences, the perfoemance values do not change significantly.</Paragraph> <Paragraph position="9"> l'or completeness, we also evaluated the English grammar. In this case, the evaluation was carried entirely by hand, since no full grammar of English was available to automatically derive the complete set of esl's. F'irst, a test set of 10 remote sensing abstracts (about 1400 words, 67 sentences) was selected at random. The results are the following: Here the recall is rather high, since sentences have a much simple structure.</Paragraph> <Paragraph position="10"> However, there are many valid long distance pp attachments that for example most existing partial parses would not detect. The precision is lower because the English parser does not have post morphokGy as yet. One major source of error at detecting N V pairs are, as expected, comIxmnds.</Paragraph> <Paragraph position="11"> The most important factors that influence the time complexity are: the number N of sentences (words) of the corpus and the number k of different discontinuous rules (about 20, as we said).</Paragraph> <Paragraph position="12"> The global rewriting procedure of SSA depends on the length n of the incoming text segment according to the following expression: *t i=l where e(x) is the cost of the application of a grammar rule, as for in Figure 1, to a segment of length x. e(x) is easily seen to depend on: 1. Predicates that test the syntactic category of a word (e.g. noun(w1)), whose cost is equal to that of a simple unification procedure i.e. &quot;t; 2. TEST ON predicates, whose cost is not greater than &quot;~*n, where n is the substring length. We can thus say that the expression e(x) of the complexity of SSA syntactic rules verifies the following inequality: e(n) <- 3r+ 2'rn = O(n) Hence, the global cost is:</Paragraph> <Paragraph position="14"> A significant information is that the processing time needed on a Sun Sparc station by the full grammar to parse the test set of 150 sentences is 6 hours, while SSA takes only 10 minutes.</Paragraph> <Paragraph position="15"> Portability and scalability These two aspects are obviously related. The question is: How much, in terms of time and resources, is needed to switch to a different domain, or to update a given domain? Since we developed three entirely different applications, we can provide some reliable estimate of these parameters. The estimate of course is strongly dependent upon the specific system we implemented, however we will frame our evaluation in a way that broadly applies to any system that uses similar techniques.</Paragraph> <Paragraph position="16"> Morphology: Our experience when switching front the commercial to the legal domain was that, when running the analyzer over the new corpus, about 30,000 words could not be analyzed. This required the insertion of about 1,500 new elementary lemmata. Accounting for a new word requires entering the stem without affixes, the elementary lemma of the word and the ending class (see section 2.1). Entering a new word takes about 5-10 minutes when the linguist is provided with some onqine help, for example a list of ending classes, browsing and testing facilities, etc. With these facilities, updating the lexicon is a relatively easy job, that does not require a specialized linguist to be performed.</Paragraph> <Paragraph position="17"> Clearly, when implementing several applications, the global updating effort tends to zero. This is not the case for statistically based part of speech taggers, that require always a fixed effort to train on a new corpus. On the long run, it seems that grammar based approaches to morphology have an advantage over pos taggers, in terms of portability.</Paragraph> <Paragraph position="18"> Our experience is that adding a new rule takes about one-two man days. First, one must detect the linguistic pattern that is not accounted for in the grammar, and verify whether it can be reasonably accounted for, given the intrinsic limitations of the parsing mechanism adopted.</Paragraph> <Paragraph position="19"> If the linguist decides that, indeed, adding a new rule is necessary and feasible, he/she implements the rule and test its effects.</Paragraph> <Paragraph position="20"> Grammar modifications are required to: * Select the esl types of interests; * Define the heuristic rules (TEST ON), as discussed in Section 2.3.</Paragraph> <Paragraph position="21"> One positive aspect of SSA is that its complexity is O(k) with respect to the number k of grammar rules. Hence adding new rules does not affect the complexity class of the method.</Paragraph> <Paragraph position="22"> In summary, portability is an essential feature of SSA. While other parsers need a non trivial effort to be tuned on clifferent linguistic domains, we need only minimal adjustment to ensure the required coverage of the morphologic lexicon. However, the activity of lexical extension is needed with every approach.</Paragraph> <Paragraph position="23"> Portability is also guarantied by the modularity of the apl)roach.</Paragraph> </Section> </Section> class="xml-element"></Paper>