File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1004_metho.xml
Size: 7,728 bytes
Last Modified: 2025-10-06 14:14:34
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1004"> <Title>A Maximum Entropy Approach to Identifying Sentence Boundaries</Title> <Section position="4" start_page="16" end_page="16" type="metho"> <SectionTitle> 3 Our Approach </SectionTitle> <Paragraph position="0"> \Y=e present two systems for identifying sentence boundaries. One is targeted a.t high performance and uses some knowledge about the structure of English financial newspaper text which may not be applical)le t.o text from other genres or in other languages. The other system uses no domain-specific knowledge and is aimed at being portable across English t, ext genres and Roman alphabet languages.</Paragraph> <Paragraph position="1"> Pot.ential sentence boundaries are identified by scamfing the text tbr sequences of characters sepaa'ated by whitespace (tokens) containing one of the symbols !, . or ?. We use information about the tollen containing the potential sentence boundary, as well as contextual information about the tokens immediately to the left and to the right. We also conducted tests using wider contexts, but performance did not, improve.</Paragraph> <Paragraph position="2"> We call the token containing the symbol which marks a putative sentence boundary the Candidate.</Paragraph> <Paragraph position="3"> 'Phe portion of the Candidate preceding the potent.ial sentence boundary is called the Prefix and the portion following it is called the Suffix. The system that focused on maximizing performance used the following hints, or contextual &quot;templates&quot;: The templates specify only the form of the information. The ~J~acl intbrmation used by the maximum entropy model \['or the potential sentence boundary marked by . in Col7~. in Example 1 would be: PreviousWordIsCapitalized, Prefix=Corp, Suffix=NULL, PrefixFeature=C.orporateDesignator.</Paragraph> <Paragraph position="4"> (1) ANLP Corp. chairman Dr. Smith resigned.</Paragraph> <Paragraph position="5"> The highly portable system uses only the identity of the C',andidate and its neighboring words, and a list of abbreviations induced froln the training data. 2 Specifically, the &quot;templates&quot; used are: The intbrmation this model would use for Example 1 would be: PreviousWord=ANLP, Following-Word=chairman, Prefix=Corp, Suffix=NULL, PrefixFeature=InducedAbbreviation. null The abbreviation list is automatically produced from the training data., and the contextual questions are also automat.ically generated by scanning the training data with question templates. As a. result, no hand-crafted rules or lists are required by the highly portable system and it can be easily re-trained for other languages or text genres.</Paragraph> </Section> <Section position="5" start_page="16" end_page="17" type="metho"> <SectionTitle> 4 Maxilnum Entropy </SectionTitle> <Paragraph position="0"> The model used here for sentence-boundary detection is based on the maximum entropy model used for POS tagging in (Ratnaparkhi, 1996). For each potential sentence boundary token (., ?, and !), we estimate a joint probability distribution p of the token and it.s surrounding context, both of which are denoted by c, occurring as an actual sentence I)oundary. The (list, ribul.ioll is given by: I,. f,(b,c)</Paragraph> <Paragraph position="2"> viation if it is preceded and followed by whitespace, and it contains a . that is not a sentence boundary.</Paragraph> <Paragraph position="3"> the ctj's are the unknown parameters of the model, and where each c U corresponds to a fj, or a feature. Thus the probability of seeing an actual sentence boundary in the context c is given by p(yes, e). The contextual information deemed useful for sentence-boundary detection, which we described earlier, must be encoded using features. For exampie, a useful feature might be:</Paragraph> <Paragraph position="5"> This feature will allow the model to discover that the period at the end of the word Mr. seldom occurs as a sentence boundary. Therefore the parameter corresponding to this feature will hopefully boost the probability p(no, c) if the Prefix is Mr. The parameters are chosen to maximize the likelihood of the I.raining data using the Generalized Iterative Scaling (Darroeh and Ratcliff, 1972) algorithm.</Paragraph> <Paragraph position="6"> The model also can be viewed under the Maximum Entropy framework, in which we choose a dist.ribution p that maximizes the entropy H(p)</Paragraph> <Paragraph position="8"> where iS(b, c) is the observed distribution of sentenceboundaries and contexts in the training data. As a result, the model in practice tends not to commit towards a particular outcome (yes or no) unless it ha~s seen sufficient evidence for that outcome; it is maximally uncertain beyond meeting the evidence.</Paragraph> <Paragraph position="9"> All experiments use a simple decision rule to elassi\[y each potential sentence boundary: a potential sentence boundary is an actual sentence boundary if and only if p(yeslc ) > .5, where p(yes, c) p(yeslc ) = p(yes, c) -I- p(no, c) and where c is the context including the potential sentence boundary.</Paragraph> </Section> <Section position="6" start_page="17" end_page="18" type="metho"> <SectionTitle> 5 System Performance </SectionTitle> <Paragraph position="0"> We trained our system on 39441 sentences (898737 words) of Wall Street Journal text from sections 00 through 24 of the second release of the Penn Treebank 3 (Marcus, Santorini, and Marcinkiewicz, :~We did not train on files which overlapped with Pahner and Hearst's test data, namely sections 03, 04, 05 and 06.</Paragraph> <Paragraph position="1"> 1993). We corrected punctuation mistakes and erroneous sentence boundaries in the training data. Performance figures for our best performing system, which used a hand-crafted list of honorifics and corporate designators, are shown in Table 1. The first test set, WSJ, is Pahner and Hearst's initial test data and the second is the entire Brown corpus. We present the Brown corpus performance to show the importance of training on the genre of text on which testing will be performed. Table 1 also shows the number of sentences in each corpus, the lmmber of candidate punctuation marks, the accuracy over potential sentence boundaries, the nmnber of false positives and the number of false negatives. Performance on the WSJ corpus was, as we expected, higher than perforlnance on the Brown corpus since we trained the model on financial newspaper text.</Paragraph> <Paragraph position="2"> Possibly more significant than the system's performance is its portability to new domains and languages. A trimmed down system which used no information except that derived from the training corpus performs nearly as well, and requires no resources other than a training corpus. Its performance on the same two corpora is shown in Table 2. the highly portable system.</Paragraph> <Paragraph position="3"> Since 39441 training sentences is considerably more than might exist ill a new dolnail~ or a language other than English, we experimented with the quantity of training data required to maintain perforlnance. Table 3 shows performance on the WSJ corpus as a flmction oft, raining set size using the best performing system and the more portable system.</Paragraph> <Paragraph position="4"> As can seen fl'om the table, performance degrades as the quantity of training data decreases, but even with only 500 exalnple sentences performance is bet-I(,~' lhan the baselines of 64.0% if a sentence bound~l\; is guessed at every potential site and 78.4(K, if only token-final instances of sentence-ending punctuation are assumed to be boundaries.</Paragraph> </Section> class="xml-element"></Paper>