File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/p99-1018_evalu.xml
Size: 8,970 bytes
Last Modified: 2025-10-06 14:00:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1018"> <Title>Ordering Among Premodifiers</Title> <Section position="6" start_page="138" end_page="141" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We applied the three ordering algorithms proposed in this paper to the two corpora separately for adjectives and adjectives plus nouns.</Paragraph> <Paragraph position="1"> For our first technique of directly using evidence from a separate training corpus, we filled the Count matrix (see Section 3.1) with the frequencies of each ordering for each pair of pre-modifiers using the training corpora. Then, we calculated which of those pairs correspond to a true underlying order relation, i.e., pass the statistical test of Section 3.1 with the probability given by equation (2) less than or equal to 50%.</Paragraph> <Paragraph position="2"> We then examined each instance of ordered pre-modifiers in the corresponding test corpus, and counted how many of those the direct evidence method could predict correctly. Note that if A and B occur sometimes as A -~ B and some- null corpora. In each case, overall accuracy is listed first in bold, and then, in parentheses, the percentage of the test pairs that the method has an opinion for (rather than randomly assign a decision because of lack of evidence) and the accuracy of the method within that subset of test cases. times as B -< A, no prediction method can get all those instances correct. We elected to follow this evaluation approach, which lowers the apparent scores of our method, rather than forcing each pair in the test corpus to one unambiguous category (A -< B, B -< A, or arbitrary).</Paragraph> <Paragraph position="3"> Under this evaluation method, stage one of our system achieves on adjectives in the medical domain 98.47% correct decisions on pairs for which a determination of order could be made.</Paragraph> <Paragraph position="4"> Since 11.80% of the total pairs in the test corpus involve previously unseen combinations of adjectives and/or new adjectives, the overall accuracy is 92.67%. The corresponding accuracy on data for which we can make a prediction and the overall accuracy is 98.35% and 88.79% for adjectives plus nouns in the medical domain, 98.37% and 75.41% for adjectives in the WSJ data, and 95.27% and 65.93% for adjectives plus nouns in the WSJ data. Note that the WSJ corpus is considerably more sparse, with 64.24% unseen combinations of adjective and noun premoditiers in the test part. Using lower thresholds in equation (2) results in a lower percentage of cases for which the system has an opinion but a higher accuracy for those decisions. For example, a threshold of 25% results in the ability to predict 83.72% of the test adjective pairs in the medical corpus with 99.01% accuracy for these cases.</Paragraph> <Paragraph position="5"> We subsequently applied the transitivity stage, testing the three semiring models discussed in Section 3.2. Early experimentation indicated that the or-and model performed poorly, which we attribute to the extensive propagation of decisions (once a decision in favor of the existence of an ordering relationship is made, it cannot be revised even in the presence of conflicting evidence). Therefore we report results below for the other two semiring models.</Paragraph> <Paragraph position="6"> Of those, the min-plus semiring achieved higher performance. That model offers additional predictions for 9.00% of adjective pairs and 11.52% of adjective-plus-noun pairs in the medical corpus, raising overall accuracy of our predictions to 94.93% and 90.67% respectively. Overall accuracy in the WSJ test data was 80.77% for adjectives and 71.04% for adjectives plus nouns.</Paragraph> <Paragraph position="7"> Table 1 summarizes the results of these two stages.</Paragraph> <Paragraph position="8"> Finally, we applied our third, clustering approach on each data stratum. Due to data sparseness and computational complexity issues, we clustered the most frequent words in each set of premodifiers (adjectives or adjectives plus nouns), selecting those that occurred at least 50 times in the training part of the corpus being analyzed. We report results for the adjectives selected in this manner (472 frequent adjectives from the medical corpus and 307 adjectives from the WSJ corpus). For these words, the information collected by the first two stages of the system covers most pairs. Out of the 111,176 (=472.471/2) possible pairs in the medical data, the direct evidence and transitivity stages make predictions for 105,335 (94.76%); the corresponding number for the WSJ data is 40,476 out of 46,971 possible pairs (86.17%).</Paragraph> <Paragraph position="9"> The clustering technique makes ordering predictions for a part of the remaining pairs--on average, depending on how many clusters are created, this method produces answers for 80% of the ordering cases that remained unanswered after the first two stages in the medical corpus, and for 54% of the unanswered cases in the WSJ corpus. Its accuracy on these predictions is 56% on the medical corpus, and slightly worse than the baseline 50% on the WSJ corpus; this latter, aberrant result is due to a single, very fiequent pair, chief executive, in which executive is consistently mistagged as an adjective by the part-of-speech tagger.</Paragraph> <Paragraph position="10"> Qualitative analysis of the third stage's output indicates that it identifies many interesting relationships between premodifiers; for example, the pair of most similar premodifiers on the basis of positional information is left and right, which clearly fall in a class similar to the semantic classes manually constructed by linguists. Other sets of adjectives with strongly similar members include {mild, severe, significant} and {cardiac, pulmonary, respiratory}.</Paragraph> <Paragraph position="11"> We conclude our empirical analysis by testing whether a separate model is needed for predicting adjective order in each different domain. We trained the first two stages of our system on the medical corpus and tested them on the WSJ corpus, obtaining an overall prediction accuracy of 54% for adjectives and 52% for adjecrives plus nouns. Similar results were obtained when we trained on the financial domain and tested on medical data (58% and 56%). These results are not much better than what would have been obtained by chance, and are clearly inferior to those reported in Table 1. Although the two corpora share a large number of adjectives (1,438 out of 5,703 total adjectives in the medical corpus and 8,240 in the WSJ corpus), they share only 2 to 5% of the adjective pairs. This empirical evidence indicates that adjectives are used differently in the two domains, and hence domain-specific probabilities must be estimated, which increases the value of an automated procedure for the prediction task.</Paragraph> <Paragraph position="12"> (b) Output of the generator with our ordering module.</Paragraph> <Paragraph position="13"> which can be easily incorporated into the over-all generation architecture. We have integrated the function compute_order(A, B) into our multimedia presentation system MAGIC \[Dalai et al. 1996\] in the medical domain and resolved numerous premodifier ordering tasks correctly.</Paragraph> <Paragraph position="14"> Example cases where the statistical prediction module was helpful in producing a more fluent description in MAGIC include placing age information before ethnicity information and the latter before gender information, as well as specific ordering preferences, such as &quot;thick&quot; before &quot;yellow&quot; and &quot;acute&quot; before &quot;severe&quot;. MAGIC'S output is being evaluated by medical doctors, who provide us with feedback on different components of the system, including the fluency of the generated text and its similarity to human-produced reports.</Paragraph> <Paragraph position="15"> Lexicalization is inherently domain dependent, so traditional lexica cannot be ported across domains without major modifications.</Paragraph> <Paragraph position="16"> Our approach, in contrast, is based on words extracted from a domain corpus and not on concepts, therefore it can be easily applied to new domains. In our MAGIC system, aggregation operators, such as conjunction, ellipsis, and transformations of clauses to adjectival phrases and relative clauses, are performed to combine related clauses together and increase conciseness \[Shaw 1998a; Shaw 1998b\]. We wrote a function, reorder_premod(... ), which is called after the aggregation operators, takes the whole lexicalized semantic representation, and reorders the premodifiers right before the linguistic realizer is invoked. Figure i shows the difference in the output produced by our gener- null ator with and without the ordering component.</Paragraph> </Section> class="xml-element"></Paper>