File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/j97-2002_abstr.xml

Size: 6,078 bytes

Last Modified: 2025-10-06 13:48:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-2002">
  <Title>The MITRE Corporation</Title>
  <Section position="2" start_page="0" end_page="242" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Recent years have seen a dramatic increase in the availability of on-line text collections, which are useful in many areas of computational linguistics research. One active area of research is the development of algorithms for aligning sentences in parallel corpora.</Paragraph>
    <Paragraph position="1"> The success of most natural language processing (NLP) algorithms, including multi-lingual sentence alignment algorithms (Kay and R6scheisen 1993; Gale and Church 1993), 1 part-of-speech taggers (Cutting et al. 1991), and parsers, depends on prior knowledge of the location of sentence boundaries.</Paragraph>
    <Paragraph position="2"> Segmenting a text into sentences is a nontrivial task, however, since in English and many other languages the end-of-sentence punctuation marks are ambiguous. 2 A period, for example, can denote a decimal point, an abbreviation, the end of a sentence, or even an abbreviation at the end of a sentence. Exclamation points and question marks can occur within quotation marks or parentheses as well as at the end of a sentence. Ellipsis, a series of periods (...), can occur both within sentences and at * 202 Burlington Road, Bedford, MA 01730. E-maih palmer@mitre.org. Some of the work reported here was done while the author was at the University of California, Berkeley. The views and opinions in this paper are those of the authors and do not reflect the MITRE Corporation's current work position.  possible &amp;quot;end-of-sentence punctuation marks,&amp;quot; and all references to &amp;quot;ptmctuation marks&amp;quot; will refer to these three. Although the colon, the semicolon, and conceivably the comma can also delimit grammatical sentences, their usage is beyond the scope of this work.</Paragraph>
    <Paragraph position="3"> (~) 1997 Association for Computational Linguistics Computational Linguistics Volume 23, Number 2 sentence boundaries. The ambiguity of these punctuation marks is illustrated in the following difficult cases:  (1) The group included Dr. J. M. Freeman and T. Boone Pickens Jr.</Paragraph>
    <Paragraph position="4"> (2) &amp;quot;This issue crosses party lines and crosses philosophical lines!&amp;quot; said Rep. John Rowland (R., Conn.).</Paragraph>
    <Paragraph position="5"> (3) Somit entsprach ein ECU am 17. 9. 1984 0.73016 US$ (vgl. Tab. 1).</Paragraph>
    <Paragraph position="6"> (4) Crdd au ddbut des ann~es 60 ...... par un gouvernement conservateur : ... cet  Office s'~tait vu accorder six ans ...</Paragraph>
    <Paragraph position="7"> The existence of punctuation in grammatical subsentences suggests the possibility of a further decomposition of the sentence boundary problem into types of sentence boundaries, one of which would be &amp;quot;embedded sentence boundary.&amp;quot; Such a distinction might be useful for certain applications that analyze the grammatical structure of the sentence. However, in this work we will only address the simpler problem of determining boundaries between sentences, finding that which Nunberg (1990) calls the &amp;quot;text-sentence.&amp;quot; In examples (1-4), the word immediately preceding and the word immediately following a punctuation mark provide important information about its role in the sentence. However, more context may be necessary, such as when punctuation occurs in a subsentence within quotation marks or parentheses, as seen in example (2), or when an abbreviation appears at the end of a sentence, as seen in (5a): (5)a. It was due Friday by 5 p.m. Saturday would be too late.</Paragraph>
    <Paragraph position="8"> (5)b. She has an appointment at 5 p.m. Saturday to get her car fixed.</Paragraph>
    <Paragraph position="9"> Examples (5a-b) also show some problems inherent in relying on brittle features, such as capitalization, when determining sentence boundaries. The initial capital in Saturday does not necessarily indicate that Saturday is the first word in the sentence. As a more dramatic example, some important kinds of text consist only of upper-case letters, thus thwarting any system that relies on capitalization rules. Another obstacle to systems that rely on brittle features is that many texts are not well-formed. One such class of texts are those that are the output of optical character recognition (OCR); typically these texts contain many extraneous or incorrect characters.</Paragraph>
    <Paragraph position="10"> This article presents an efficient, trainable system for sentence boundary disambiguation that circumvents these obstacles. The system, called Satz, makes simple estimates of the parts of speech of the tokens immediately preceding and following each punctuation mark, and uses these estimates as input to a machine learning algorithm that determines whether the punctuation mark is a sentence boundary or serves another purpose in the sentence. Satz is very fast in both training and sentence analysis; training is accomplished in less than one minute on a workstation, and it can process 10,000 sentences per minute. The combined robustness and accuracy of the system surpasses existing techniques, consistently producing an error rate less than 1.5% on a range of corpora and languages. It requires only a small lexicon (which can be less than 5,000 words) and a training corpus of 300-500 sentences.</Paragraph>
    <Paragraph position="11"> The following sections discuss related work and the criteria used to evaluate such work, describe our system in detail, and present the results of applying the system to a variety of texts. The transferability of the system from English to other languages is also demonstrated on French and German text. Finally, the learning-based system is  Palmer and Hearst Multilingual Sentence Boundary shown to be able to improve the results of a more conventional system on especially difficult cases.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML