File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-1612_relat.xml
Size: 2,186 bytes
Last Modified: 2025-10-06 14:15:59
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1612"> <Title>Learning Information Status of Discourse Entities</Title> <Section position="9" start_page="99" end_page="100" type="relat"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Toourknowledge, therearenootherstudiesonthe automatic assignment of information status in English. Recently, (Postolache et al., 2005) have reported experiments on learning information structure in the Prague TreeBank. The Czech tree-bank is annotated following the Topic-Focus articulation theory (HajiVcov'a et al., 1998). The theoretical definitions underlying the Prague Treebank and the corpus we are using are different, with the former giving a more global picture of information structure, and the latter a more entity-specific one. For this reason, and due to the fact that Postolache et al.'s experiments are on Czech (with a freer word order than English), comparing results is not straightforward.</Paragraph> <Paragraph position="1"> Their best system (C4.5 decision tree) achieves an accuracy of 90.69% on the topic/focus identification task. This result is comparable with the result we obtain when training and testing on the corpus where mediated and new entities are not distinguished (93.1%). Postolache and colleagues also observe a slowly flattening learning curve after a very small amount of data (even 1%, in their case). Therefore, they predict an increase in performance will mainly come from better features rather than more training data. This is likely to be true in our case as well, also because our feature set is currently small and we will further benefit from incorporating additional features. Postolache et al. use a larger feature set, which also includes coreference information. The corpus we use has manually annotated coreference links. However, because we see anaphoricity determination as a task that could benefit from automatic information status assignment, we decided not to exploit this information in the current experiments. Moreover, we did not want our model to rely too heavily on a feature that is not easy to obtain automatically.</Paragraph> </Section> class="xml-element"></Paper>