File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3210_metho.xml
Size: 11,706 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3210"> <Title>Automatic Paragraph Identification: A Study across Languages and Domains</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Our Approach </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Corpora </SectionTitle> <Paragraph position="0"> Our study focused on three languages: English, German, and Greek. These languages differ in terms of word order (fixed in English, semi-free in German, fairly flexible in Greek). Greek and German also have richer morphology than English. Additionally, Greek has a non-Latin writing system.</Paragraph> <Paragraph position="1"> For each language we created corpora representative of three domains: fiction, news, and parliamentary proceedings. Previous work on the role of paragraph markings (Stark, 1988) has focused exclusively on fiction texts, and has shown that humans can identify paragraph boundaries in this domain reliably. It therefore seemed natural to test our automatic method on a domain for which the task has been shown to be feasible. We selected news texts since most summarisation methods today focus on this domain and we can therefore explore the relevance of our approach for this application. Finally, parliamentary proceedings are transcripts of speech, and we can examine whether a method that relies solely on textual cues is also useful for spoken texts.</Paragraph> <Paragraph position="2"> For English, we used the whole Hansard section of the BNC, as our corpus of parliamentary proceedings. We then created a fiction corpus of similar size by randomly selecting prose files from the fiction part of the BNC. In the same way a news corpus was created from the Penn Treebank.</Paragraph> <Paragraph position="3"> For German, we used the prose part of Project Gutenberg's e-book collection3 as our fiction corpus and the complete Frankfurter Rundschau part of the ECI corpus4 as our news corpus. The corpus of parliamentary proceedings was obtained by randomly selecting a subset of the German section from the Europarl corpus (Koehn, 2002).</Paragraph> <Paragraph position="4"> For Greek, a fiction corpus was compiled from the ECI corpus by selecting all prose files that contained paragraph markings. Our news corpus was downloaded from the WWW site of the Modern Greek newspaper Eleftherotypia and consists of financial news from the period of 2001-2002. A corpus of parliamentary proceedings was again created by randomly selecting a subset of the Greek section of the Europarl corpus (Koehn, 2002).</Paragraph> <Paragraph position="5"> Parts of the data were further pre-processed to insert sentence boundaries. We trained a publicly available sentence splitter (Reynar and Ratnaparkhi, 1997) on a small manually annotated sample (1,000 sentences per domain per language) and applied it to our corpora. Table 1 shows the corpus sizes. All corpora were split into training (72%), development (24%) and test set (4%).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Machine Learning </SectionTitle> <Paragraph position="0"> We used BoosTexter (Schapire and Singer, 2000) as our machine learning system. BoosTexter was originally developed for text categorisation and combines a boosting algorithm with simple decision rules. For all domains and languages our training examples were sentences. Class labels encoded for each sentence whether it was starting a paragraph or not.</Paragraph> <Paragraph position="1"> The features we used fall broadly into three different areas: non-syntactic features, language modelling features and syntactic features. The latter were only applied to English as we did not have suitable parsers for German and Greek.</Paragraph> <Paragraph position="2"> The values of our features are numeric, boolean or &quot;text&quot;. BoosTexter applies unigram models when forming classification hypotheses for features with &quot;text&quot; values. These can be simply words or annotations such as part-of-speech tags.</Paragraph> <Paragraph position="3"> We deliberately did not include anaphora-based features. While anaphors can help determine paragraph boundaries (paragraph initial sentences tend to contain few or no anaphors), anaphora structure is dependent on paragraph structure rather than the other way round. Hence, in applications which manipulate texts and thereby potentially &quot;mess-up&quot; the anaphora structure (e.g., multi-document summarisation), anaphors are not a reliable cue for paragraph identification.5 Distance (Ds, Dw): These features encode the distance of the current sentence from the previous paragraph break. We measured distance in terms of the number of intervening sentences (Ds) as well as in terms of the number of intervening words (Dw). If paragraph breaks were driven purely by aesthetics one would expect this feature to be among the most successful ones.6 Sentence Length (Length): This feature encodes the number of words in the current sentence. Average sentence length is known to vary with text position (Genzel and Charniak, 2003) and it is possible that it also varies with paragraph position.</Paragraph> <Paragraph position="4"> Relative Position (Pos): The relative position of a sentence in the text is calculated by dividing the current sentence number by the number of sentences in the text. The motivation for this feature is that paragraph length may vary with text position. For example, it is possible that paragraphs at the beginning and end of a text are shorter than paragraphs in the middle and hence a paragraph break is more likely at the two former text positions.</Paragraph> <Paragraph position="5"> Quotes (Quotep, Quotec, Quotei): These features encode whether the previous or current sentence contain a quotation (Quotep and Quotec, respectively) and whether the current sentence continues a quotation that started in a preceding sentence (Quotei). The presence of quotations can provide cues for speaker turns, which are often signalled by paragraph breaks.</Paragraph> <Paragraph position="6"> Final Punctuation (FinPun): This feature keeps track of the final punctuation mark of the previous sentence. Some punctuation marks may provide hints as to whether a break should be introduced.</Paragraph> <Paragraph position="7"> For example, in the news domain, where there is hardly any dialogue, if the previous sentence ended in a question mark, it is likely that the current sentence supplies an answer to this question, thus making a paragraph break improbable.</Paragraph> <Paragraph position="8"> Words (W1, W2, W3, Wall ): These text-valued features encode the words in the sentence. Wall takes the complete sentence as its value. W1, W2 and W3 encode the first word, the first two words and the first three words, respectively.</Paragraph> <Paragraph position="9"> previous sentences as a feature (as in part-of-speech tagging); however, we leave this to future research.</Paragraph> <Paragraph position="10"> Our motivation for including language modelling features stems from Genzel and Charniak's (2003) work where they show that the word entropy rate is lower for paragraph initial sentences than for non-paragraph initial ones. We therefore decided to examine whether word entropy rate is a useful feature for the paragraph prediction task. Using the training set for each language and domain, we created language models with the CMU language modelling toolkit (Clarkson and Rosenfeld, 1997). We experimented with language models of variable length (i.e., 1-5) and estimated two features: the probability of a given sentence according to the language model (LMp) and the per-word entropy rate (LMpwe). The latter was estimated by dividing the sentence probability as assigned by the language model by the number of sentence words (see Genzel and Charniak (2003)).</Paragraph> <Paragraph position="11"> We additionally experimented with character level n-gram models. Such models are defined over a relatively small vocabulary and can be easily constructed for any language without pre-processing. Character level n-gram models have been applied to the problem of authorship attribution and obtained state-of-the art results (Peng et al., 2003). If some characters are more often attested in paragraph starting sentences (e.g., &quot;A&quot; or &quot;T&quot;), then we expect these sentences to have a higher probability compared to non-paragraph starting ones. Again, we used the CMU toolkit for building the character level n-gram models. We experimented with models whose length varied from 2 to 8 and estimated the probability assigned to a sentence according to the character level model (CMp).</Paragraph> <Paragraph position="12"> For the English data we also used several features encoding syntactic complexity. Genzel and Charniak (2003) suggested that the syntactic complexity of sentences varies with their position in a paragraph. Roughly speaking, paragraph initial sentences are less complex. Hence, complexity measures may be a good indicator of paragraph boundaries. To estimate complexity, we parsed the texts with Charniak's (2001) parser and implemented the following features: Parsed: This feature states whether the current sentence could be parsed. While this is not a real measure of syntactic complexity it is probably correlated with it.</Paragraph> <Paragraph position="13"> Number of phrases (nums, numvp, numnp, numpp): These features measure syntactic complexity in terms of the number of S, VP, NP, and PP constituents in the parse tree.</Paragraph> <Paragraph position="14"> Signature (Sign, Signp): These text-valued features encode the sequence of part-of-speech tags in the current sentence. Sign only encodes word tags, while Signp also includes punctuation tags.</Paragraph> <Paragraph position="15"> Children of Top-Level Nodes (Childrs1, Childrs): These text-valued features encode the top-level complexity of a parse tree: Childrs1 takes as its value the sequence of syntactic labels of the children of the S1-node (i.e., the root of the parse tree), while Childrs encodes the syntactic labels of the children of the highest S-node(s). For example, Childrs1 may encode that the sentence consists of one clause and Childrs may encode that this clause consists of an NP, a VP, and a PP.</Paragraph> <Paragraph position="16"> Branching Factor (Branchs, Branchvp, Branchnp, Branchpp): These features express the average number of children of a given non-terminal constituent (cf. Genzel and Charniak (2003)). We compute the branching factor for S, VP, NP, and PP constituents. null Tree Depth: We define tree depth as the average length of a path (from root node to leaf node).</Paragraph> <Paragraph position="17"> Cue Words (Cues, Cuem, Cuee): These features are not strictly syntactic but rather discourse-based. They encode discourse cues (such as because) at the start (Cues), in the middle (Cuem) and at the end (Cuee) of the sentence, where &quot;start&quot; is the first word, &quot;end&quot; the last one, and everything else is &quot;middle&quot;. We keep track of all cue word occurrences, without attempting to distinguish between their syntactic and discourse usages.</Paragraph> <Paragraph position="18"> For English, there are extensive lists of discourse cues (we used Knott (1996)), but such lists are not widely available for German and Greek. Hence, we only used this feature on the English data.</Paragraph> </Section> </Section> class="xml-element"></Paper>