File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/w04-3210_concl.xml

Size: 2,136 bytes

Last Modified: 2025-10-06 13:54:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3210">
  <Title>Automatic Paragraph Identification: A Study across Languages and Domains</Title>
  <Section position="6" start_page="0" end_page="0" type="concl">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper, we investigated whether it is possible to predict paragraph boundaries automatically using a supervised approach which exploits textual, syntactic and discourse cues. We achieved accuracies between 71.83% and 83.92%. These were in all but one case significantly higher than the best baseline.</Paragraph>
    <Paragraph position="1"> We conducted our study in three different domains and languages and found that the best features for the news and parliamentary proceedings domains are based on word co-occurrence, whereas features that exploit punctuation are better predictors for the fiction domain. Models which incorporate syntactic and discourse cue features do not lead to significant improvements over models that do not. This means that paragraph boundaries can be predicted by relying on low-level, language independent features. The task is therefore feasible even for languages for which parsers or cue word lists are not readily available.</Paragraph>
    <Paragraph position="2"> We also experimented with training sets of different sizes and found that more training data does not necessarily lead to significantly better results and that it is possible to beat the best baseline comfortably even with a relatively small training set.</Paragraph>
    <Paragraph position="3"> Finally, we examined how well humans do on this task. Our results indicate that humans achieve an average accuracy of about 77.45% to 88.58%, where some domains seem to be easier than others.</Paragraph>
    <Paragraph position="4"> Our models achieved accuracies of within 6% of human performance.</Paragraph>
    <Paragraph position="5"> In the future, we plan to apply our model to new domains (e.g., broadcast news or scientific papers), to non-Indo-European languages such as Arabic and Chinese, and to machine generated texts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML