File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3210_evalu.xml
Size: 11,393 bytes
Last Modified: 2025-10-06 13:59:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3210"> <Title>Automatic Paragraph Identification: A Study across Languages and Domains</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> BoosTexter is parametrised with respect to the number of training iterations. In all our experiments, this parameter was optimised on the development set; BoosTexter was initially trained for 500 iterations, and then re-trained with the number of iterations that led to the lowest error rate on the development set. Throughout this paper all results are reported on the unseen test set and were obtained using models optimised on the development set. We report the models' accuracy at predicting the right label (i.e., paragraph starting or not) for each sentence. null English German Greek feature fiction news parl. fiction news parl. fiction news parl.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 The Influence of Non-syntactic Features </SectionTitle> <Paragraph position="0"> In the first set of experiments, we ran BoosTexter on all 9 corpora using non-syntactic and language modelling features. To evaluate the contribution of individual features to the classification task, we built one-feature classifiers in addition to a classifier that combined all features. Table 2 shows the test set classification accuracy of the individual features and their combination (allns lcms). The length of the language and character models was optimised on the development set. The test set accuracy of the optimised models is shown as BestLMp and BestLMpwe (language models) and BestCMp (character models).7 The results for the three best performing one-feature classifiers and the combined classifier are shown in boldface.</Paragraph> <Paragraph position="1"> BoosTexter's classification accuracy was further compared against two baselines. A distance-based baseline (Bd) was obtained by hypothesising a paragraph break after every d sentences. We estimated d in the training data by counting the average number of sentences between two paragraphs. Our second baseline, Bm, defaults to the majority class, i.e., assumes that the text does not have paragraph breaks.</Paragraph> <Paragraph position="2"> For all languages and domains, the combined models perform better than the best baseline. In order to determine whether this difference is significant, we applied kh2 tests. The diacritic a0 (a1 a0 ) in Ta7Which language and character models perform best varies slightly across corpora but no clear trends emerge.</Paragraph> <Paragraph position="3"> ble 2 indicates whether a given model is (not) significantly different from the best baseline. Significant results are achieved across the board with the exception of German fiction. We believe the reason for this lies in the corpus itself, as it is very heterogeneous, containing texts whose publication date ranges from 1766 to 1999 and which exhibit a wide variation in style and orthography. This makes it difficult for any given model to reliably identify paragraph boundaries in all texts.</Paragraph> <Paragraph position="4"> In general, the best performing features vary across domains but not languages. Word features (W1-W3, Wall) yield the best classification accuracies for news and parliamentary domains, whereas for fiction, quotes and punctuation seem more useful. The only exception is the German fiction corpus, which consists mainly of 19th century texts.</Paragraph> <Paragraph position="5"> These contain less direct speech than the two fiction corpora for English and Greek (which contain contemporary texts). Furthermore, while examples of direct speech in the English corpus often involve short dialogues, where a paragraph boundary is introduced after each speaker turn, the German corpus contains virtually no dialogues and examples of direct speech are usually embedded in a longer narrative and not surrounded by paragraph breaks.</Paragraph> <Paragraph position="6"> Note that the distance in words from the previous paragraph boundary (Distw) is a good indicator for a paragraph break in the English news domain.</Paragraph> <Paragraph position="7"> However, this feature is less useful for the other two languages. An explanation might be that the English news corpus is very homogeneous (i.e., it contains articles that not only have similar content but are also structurally alike). The Greek news corpus is relatively homogeneous; it mainly contains financial news articles but also some interviews, so there is greater variation in paragraph length, which means that the distance feature is overtaken by the word-based features. Finally, the German news corpus is highly heterogeneous, containing not only news stories but also weather forecasts, sports results and cinema listings. This leads to a large variation in paragraph length, which in turn means that the distance feature performs worse than the best baseline.</Paragraph> <Paragraph position="8"> The heterogeneity of the German news corpus may also explain another difference: while the final punctuation of the previous sentence (FinPun) is among the less useful features for English and Greek (albeit still outperforming the baseline), it is the best performing feature for German. The German news corpus contains many &quot;sentences&quot; that end in atypical end-of-sentence markers such as semi-colons (which are found often in cinema listings). Atypical markers will often not occur before paragraph breaks, whereas typical markers will. This fact renders final punctuation a better predictor of paragraph breaks in the German corpus than in the other two corpora.</Paragraph> <Paragraph position="9"> The language models behave similarly across domains and languages. With the exception of the news domain, they do not seem to be able to out-perform the majority baseline by more than 1%.</Paragraph> <Paragraph position="10"> The word entropy rate yields the worst performance, whereas character-based models perform as well as word-based models. In general, our results show that language modelling features are not particularly useful for this task.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 The Influence of Syntactic Features </SectionTitle> <Paragraph position="0"> Our second set of experiments concentrated solely on the English data and investigated the usefulness of the syntactic features (see Table 3). Again, we created one-feature classifiers and a classifier that combined all features, i.e., language and character models, non-syntactic, and syntactic features (allns lcm syn). Table 3 also repeats the performance of the two baselines (Bd and Bm) and the combined non-syntactic models (allns lcm). The accuracies of the three best performing one-feature models and the combined model are again shown in boldface.</Paragraph> <Paragraph position="1"> As can be seen, syntactic features do not contribute very much to the overall performance. They only increase the accuracy by about 1%. A kh2 test revealed that the difference between allns lcm and allns lcm syn is not statistically significant (indicated by a1+ in Table 3) for any of the three domains.</Paragraph> <Paragraph position="2"> The syntactic features seem to be less domain dependent than the non-syntactic ones. In general, the part-of-speech signature features (Sign, Signp) are a good predictor, followed by the syntactic labels of the children of the top nodes (Childrs, Childrs1).</Paragraph> <Paragraph position="3"> The number of NPs (Numnp) and their branching factor (Branchnp) are also good indicators for some domains, particularly the news domain. This is plausible since paragraph initial sentences in the Wall Street Journal often contain named entities, such as company names, which are parsed as flat NPs, i.e., have a relatively high branching factor.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 The Effect of Training Size </SectionTitle> <Paragraph position="0"> Finally, we examined the effect of the size of the training data on the learner's classification accuracy.</Paragraph> <Paragraph position="1"> We conducted our experiments solely on the English data, however we expect the results to generalise to German and Greek. From each English training set we created ten progressively smaller data sets, the first being identical to the original set, the second containing 9/10 of sentences in the original train- null fication task ing set, the third containing 8/10, etc. The training instances in each data set were selected randomly. BoosTexter was trained on each of these sets (using all features), as described previously, and tested on the test set.</Paragraph> <Paragraph position="2"> Figure 1 shows the learning curves obtained this way. The curves are more or less flat, i.e., increasing the amount of training data does not have a large effect on the performance of the model. Furthermore, even the smallest of our training sets is big enough to outperform the best baseline. Hence, it is possible to do well on this task even with less training data. This is important, given that for spoken texts, paragraph boundaries may have to be obtained by manual annotation. The learning curves indicate that relatively modest effort would be required to obtain training data were it not freely available.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Human Evaluation </SectionTitle> <Paragraph position="0"> We established an upper bound against which our automatic methods could be compared by conducting an experiment that assessed how well humans agree on identifying paragraph boundaries. Five participants were given three English texts (one from each domain), selected randomly from the test corpus. Each text consisted of approximately a tenth of the original test set (i.e., 200-400 sentences).</Paragraph> <Paragraph position="1"> The participants were asked to insert paragraph breaks wherever it seemed appropriate to them. No other instructions were given, as we wanted to see whether they could independently perform the task without any specific knowledge regarding the domains and their paragraphing conventions.</Paragraph> <Paragraph position="2"> We measured the agreement of the judges using the Kappa coefficient (Siegel and Castellan, 1988) but also report percentage agreement to facilitate comparison with our models. In all cases, we compute pairwise agreements and report the mean. Our results are shown in Table 4.</Paragraph> <Paragraph position="3"> As can be seen, participants tend to agree with each other on the task. The least agreement is observed for the news domain. This is somewhat expected as the Wall Street Journal texts are rather difficult to process for non-experts. Also remember, that our subjects were given no instructions or training. In all cases our models yield an accuracy lower than the human agreement. For the fiction domain the best model is 5.67% lower than the upper bound, for the news domain it is 5.62% and for the parliament domain it is 5.42% (see Tables 4 and 3).</Paragraph> </Section> </Section> class="xml-element"></Paper>