File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/a94-1013_intro.xml
Size: 6,578 bytes
Last Modified: 2025-10-06 14:05:34
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1013"> <Title>Adaptive Sentence Boundary Disambiguation</Title> <Section position="3" start_page="0" end_page="79" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Labeling of sentence boundaries is a necessary prerequisite for many natural language processing (NLP) tasks, including part-of-speech tagging (Church, 1988), (Cutting et al., 1991), and sentence alignment (Gale and Church, 1993), (Kay and RSscheisen, 1993). End-of-sentence punctuation marks are ambiguous; for example, a period can denote an abbreviation, the end of a sentence, or both, as shown in the examples below: (1) The group included Dr. J.M. Freeman and T.</Paragraph> <Paragraph position="1"> Boone Pickens Jr.</Paragraph> <Paragraph position="2"> (2) &quot;This issue crosses party lines and crosses philosophical lines!&quot; said Rep. John Rowland (R., Conn.). Riley (1989) determined that in the Tagged Brown corpus (Francis and Kucera, 1982) about 90% of pe null riods occur at the end of sentences, 10% at the end of abbreviations, and about 0.5% as both abbreviations and sentence delimiters. Note from example (2) that exclamation points and question marks are also ambiguous, since they too can appear at locations other than sentence boundaries.</Paragraph> <Paragraph position="3"> Most robust NLP systems, e.g., Cutting et al.</Paragraph> <Paragraph position="4"> (1991), find sentence delimiters by tokenizing the text stream and applying a regular expression grammar with some amount of look-ahead, an abbreviation list, and perhaps a list of exception rules. These approaches are usually hand-tailored to the particular text and rely on brittle cues such as capitalization and the number of spaces following a sentence delimiter. Typically these approaches use only the tokens immediately preceding and following the punctuation mark to be disambiguated. However, more context can be necessary, such as when an abbreviation appears at the end of a sentence, as seen in (3a-b): (3a) It was due Friday by 5 p.m. Saturday would be too late.</Paragraph> <Paragraph position="5"> (3b) She has an appointment at 5 p.m. Saturday to get her car fired.</Paragraph> <Paragraph position="6"> or when punctuation occurs in a subsentence within quotation marks or parentheses, as seen in Example (2). Some systems have achieved accurate boundary determination by applying very large manual effort. For example, at Mead Data Central, Mark Wasson and colleagues, over a period of 9 staff months, developed a system that recognizes special tokens (e.g., non-dictionary terms such as proper names, legal statute citations, etc.) as well as sentence boundaries. From this, Wasson built a stand-alone boundary recognizer in the form of a grammar converted into finite automata with 1419 states and 18002 transitions (excluding the lexicon). The resulting system, when tested on 20 megabytes of news and case law text, achieved an accuracy of 99.7% at speeds of 80,000 characters per CPU second on a mainframe computer. When tested against upper-case legal text the algorithm still performed very well, achieving accuracies of 99.71% and 98.24% on test data of 5305 and 9396 periods, respectively. It is not likely, however, that the results would be this strong on lower-case data. 1 Humphrey and Zhou (1989) report using a feed-forward neural network to disambiguate periods, although they use a regular grammar to tokenize the text before training the neural nets, and achieve an accuracy averaging 93~. 2 Riley (1989) describes an approach that uses regression trees (Breiman et al., 1984) to classify sentence boundaries according to the following features: The method uses information about one word of context on either side of the punctuation mark and thus must record, for every word in the lexicon, the probability that it occurs next to a sentence boundary. Probabilities were compiled from 25 million words of pre-labeled training data from a corpus of AP newswire. The results were tested on the Brown corpus achieving an accuracy of 99.8%. 3 Miiller (1980) provides an exhaustive analysis of sentence boundary disambiguation as it relates to lexical endings and the identification of words surrounding a punctuation mark, focusing on text written in English. This approach makes multiple passes through the data and uses large word lists to determine the positions of full stops. Accuracy rates of 95-98% are reported for this method tested on over 75,000 scientific abstracts. (In contrast to Riley's Brown corpus statistics, Mfiller reports sentence-ending to abbreviation ratios ranging from 92.8%/7.2% to 54.7%/45.3%. This implies a need for an approach that can adapt flexibly to the characteristics of different text collections.) Each of these approaches has disadvantages to overcome. We propose that a sentence-boundary disambiguation algorithm have the following characteristics: null 1All information about Mead's system is courtesy of a personal communication with Mark Wasson.</Paragraph> <Paragraph position="7"> 2Accuracy results were obtained courtesy of a personal communication with Joe Zhou.</Paragraph> <Paragraph position="8"> ~Time for training was not reported, nor was the amount of the Brown corpus against which testing was performed; we assume the entire Brown corpus was used. * The approach should be robust, and should not require a hand-built grammar or specialized rules that depend on capitalization, multiple spaces between sentences, etc. Thus, the approach should adapt easily to new text genres and new languages.</Paragraph> <Paragraph position="9"> * The approach should train quickly on a small training set and should not require excessive storage overhead.</Paragraph> <Paragraph position="10"> * The approach should be very accurate and efficient enough that it does not noticeably slow down text preprocessing.</Paragraph> <Paragraph position="11"> * The approach should be able to specify &quot;no opinion&quot; on cases that are too difficult to disambiguate, rather than making underinformed guesses.</Paragraph> <Paragraph position="12"> In the following sections we present an approach that meets each of these criteria, achieving performance close to solutions that require manually designed rules, and behaving more robustly. Section 2 describes the algorithm, Section 3 describes some experiments that evaluate the algorithm, and Section 4 summarizes the paper and describes future directions.</Paragraph> </Section> class="xml-element"></Paper>