File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-2031_evalu.xml
Size: 7,305 bytes
Last Modified: 2025-10-06 13:58:33
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2031"> <Title>Assigning Function Tags to Parsed Text*</Title> <Section position="6" start_page="237" end_page="239" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="237" end_page="237" type="sub_section"> <SectionTitle> 5.1 Baselines </SectionTitle> <Paragraph position="0"> There are, it seems, two reasonable baselines for this and future work. First of all, most constituents in the corpus have no tags at all, so obviously one baseline is to simply guess no tag for any constituent. Even for the most common type of function tag (grammatical), this method performs with 87% accuracy. Thus the with-null accuracy of a function tagger needs to be very high to be significant here.</Paragraph> <Paragraph position="1"> The second baseline might be useful in examining the no-null accuracy values (particularly the recall): always guess the most common tag in a category. This means that every constituent gets labelled with '-SBJ-THP-TPC-CLR' (meaning that it is a topicalised temporal sub-ject that is 'closely related' to its verb). This combination of tags is in fact entirely illegal by the treebank guidelines, but performs adequately for a baseline. The precision is, of course, abysmal, for the same reasons the first baseline did so well; but the recall is (as one might expect) substantial. The performances of the two baseline measures are given in Table</Paragraph> </Section> <Section position="2" start_page="237" end_page="237" type="sub_section"> <SectionTitle> 5.2 Performance in individual categories </SectionTitle> <Paragraph position="0"> In table 2, we give the results for each category.</Paragraph> <Paragraph position="1"> The first column is the with-null accuracy, and the precision and recall values given are the no-null accuracy, as noted in section 4.</Paragraph> <Paragraph position="2"> Grammatical tagging performs the best of the four categories. Even using the more difficult no-null accuracy measure, it has a 96% accuracy. This seems to reflect the fact that grammatical relations can often be guessed based on constituent labels, parts of speech, and high-frequency lexical items, largely avoiding sparse-data problems. Topicalisation can similarly be guessed largely on high-frequency information, and performed almost as well (93%).</Paragraph> <Paragraph position="3"> On the other hand, we have the form/function tags and the 'miscellaneous' tags. These are characterised by much more semantic information, and the relationships between lexical items are very important, making sparse data a real problem. All the same, it should be noted that the performance is still far better than the baselines.</Paragraph> </Section> <Section position="3" start_page="237" end_page="238" type="sub_section"> <SectionTitle> 5.3 Performance with other feature trees </SectionTitle> <Paragraph position="0"> The feature tree given in figure 4 is by no means the only feature tree we could have used. In- null deed, we tried a number of different trees on the development corpus; this tree gave among the best overall results, with no category performing too badly. However, there is no reason to use only one feature tree for all four categories; the best results can be got by using a separate tree for each one. One can thus achieve slight (one to three point) gains in each category.</Paragraph> </Section> <Section position="4" start_page="238" end_page="239" type="sub_section"> <SectionTitle> 5.4 Overall performance </SectionTitle> <Paragraph position="0"> The overall performance, given in table 3, appears promising. With a tagging accuracy of about 87%, various information retrieval and knowledge base applications can reasonably expect to extract useful information.</Paragraph> <Paragraph position="1"> The performance given in the first row is (like all previously given performance values) the function-tagger's performance on the correctly-labelled constituents output by our parser. For comparison, we also give its performance when run directly on the original treebank parse; since the parser's accuracy is about 89%, working directly with the treebank means our statistics are over roughly 12% more constituents. This second version does slightly better.</Paragraph> <Paragraph position="2"> The main reason that tagging does worse on the parsed version is that although the constituent itself may be correctly bracketed and labelled, its exterior conditioning information can still be incorrect. An example of this that actually occurred in the development corpus (section 24 of the treebank) is the 'that' clause in the phrase 'can swallow the premise that the rewards for such ineptitude are six-figure salaries', correctly diagrammed in figure 5. The function tagger gave this SBAR an ADV tag, indicating an unspecified adverbial function. This seems extremely odd, given that its conditioning information (nodes circled in the figure) clearly show that it is part of an NP, and hence probably modifies the preceding NN. Indeed, the statistics give the probability of an ADV tag in this conditioning environment as vanishingly small.</Paragraph> <Paragraph position="3"> However, this was not the conditioning information that the tagger received. The parser had instead decided on the (incorrect) parse in figure 6. As such, the tagger's decision makes much more sense, since an SBAR under two VPs whose heads are VB and MD is rather likely to be an ADV. (For instance, the 'although' clause of the sentence 'he can help, although he doesn't want to.' has exactly the conditioning environment given in figure 6, except that its predecessor is a comma; and this SBAR would be correctly tagged ADV.) The SBAR itself is correctly bracketed and labelled, so it still gets counted in the statistics. Happily, this sort of case seems to be relatively rare.</Paragraph> <Paragraph position="4"> Another thing that lowers the overall performance somewhat is the existence of error and inconsistency in the treebank tagging. Some tags seem to have been relatively easy for the human treebank taggers, and have few errors. Other tags have explicit caveats that, however welljustified, proved difficult to remember for the taggers--for instance, there are 37 instances of a PP being tagged with LGS (logical subject) in spite of the guidelines specifically saying, '\[LGS\] attaches to the NP object of by and not to the PP node itself.' (Bies et al., 1995) Each mistagging in the test corpus can cause up to two spurious errors, one in precision and one in recall. Still another source of difficulty comes when the guidelines are vague or silent on a specific issue.</Paragraph> <Paragraph position="5"> To return to logical subjects, it is clear that 'the loss' is a logical subject in 'The company was hurt by the loss', but what about in 'The company was unperturbed by the loss' ? In addition, a number of the function tags are authorised for 'metaphorical use', but what exactly constitutes such a use is somewhat inconsistently marked.</Paragraph> <Paragraph position="6"> It is as yet unclear just to what degree these tagging errors in the corpus are affecting our results.</Paragraph> </Section> </Section> class="xml-element"></Paper>