File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/97/j97-2002_relat.xml

Size: 11,258 bytes

Last Modified: 2025-10-06 14:16:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-2002">
  <Title>The MITRE Corporation</Title>
  <Section position="3" start_page="242" end_page="244" type="relat">
    <SectionTitle>
2. Related Work
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="242" end_page="243" type="sub_section">
      <SectionTitle>
2.1 Evaluation of Related Work
</SectionTitle>
      <Paragraph position="0"> An important consideration when discussing related work is the mode of evaluation.</Paragraph>
      <Paragraph position="1"> To aid our evaluation, we define a lower bound, an objective score which any reasonable algorithm should be able to match or better. In our test collections, the ambiguous punctuation mark is used much more often as a sentence boundary marker than for any other purpose. Therefore, a very simple, successful algorithm is one in which every potential boundary marker is labeled as the end-of-sentence. Thus, for the task of sentence boundary disambiguation, we define the lower bound of a text collection as the percentage of possible sentence-ending punctuation marks in the text that indeed denote sentence boundaries.</Paragraph>
      <Paragraph position="2"> Since the use of abbreviations in a text depends on the particular text and text genre, the number of ambiguous punctuation marks, and the corresponding lower bound, will vary dramatically depending on text genre. For example, Liberman and Church (1992) report on a Wall Street Journal corpus containing 14,153 periods per million tokens, whereas in the Tagged Brown corpus (Francis and Kucera 1982), the figure is only 10,910 periods per million tokens. Liberman and Church also report that 47% of the periods in the WSJ corpus denote abbreviations (thus a lower bound of 53%), compared to only 10% in the Brown corpus (lower bound 90%) (Riley 1989). In contrast, Mfiller, Amerl, and Natalis (1980) reports lower bound statistics ranging from 54.7% to 92.8% within a corpus of scientific abstracts. Such a range of lower bound figures suggests the need for a robust approach that can adapt rapidly to different text requirements.</Paragraph>
      <Paragraph position="3"> Another useful evaluation technique is the comparison of a new algorithm against a strong baseline algorithm. The baseline algorithm should perform better than the lower bound and should represent a strong effort or a standard method for solving the problem at hand.</Paragraph>
      <Paragraph position="4"> Although sentence boundary disambiguation is an essential preprocessing step of many natural language processing systems, it is a topic rarely addressed in the literature and there are few public-domain programs for performing the segmentation task. For our studies we compared our system against the results of the UNIX STYLE program (Cherry and Vesterman 1991). 3 The STYLE program, which attempts to provide a stylistic profile of writing at the word and sentence level, reports the length and structure for all sentences in a document, thereby indicating the sentence boundaries. STYLE defines a sentence as a string of words ending in one of: period, exclamation point, question mark, or backslash-period (the latter of which can be used by an author to mark an imperative sentence ending). The program handles numbers with embedded decimal points and commas and makes use of an abbreviation list with 48 entries. It also uses the following heuristic: initials cause a sentence break only if the next word begins with a capital letter and is found in a dictionary of function words.</Paragraph>
      <Paragraph position="5"> In an evaluation on a sample of 20 documents, the developers of the program found it to incorrectly classify sentence boundaries 204 times out of 3287 possible (an error rate of 6.3%).</Paragraph>
    </Section>
    <Section position="2" start_page="243" end_page="244" type="sub_section">
      <SectionTitle>
2.2 Regular Expressions and Heuristic Rules
</SectionTitle>
      <Paragraph position="0"> The method currently widely used for determining sentence boundaries is a regular grammar, usually with limited lookahead. In the simplest implementation of this method, the grammar rules attempt to find patterns of characters, such as &amp;quot;periodspace-capital letter,&amp;quot; which usually occur at the end of a sentence. More elaborate implementations, such as the STYLE program discussed above, consider the entire word preceding and following the punctuation mark and include extensive word lists and exception lists to attempt to recognize abbreviations and proper nouns. There are a few examples of rule-based and heuristic systems for which performance numbers are available, discussed in the remainder of this subsection.</Paragraph>
      <Paragraph position="1"> The Alembic information extraction system (Aberdeen et al. 1995) contains a very extensive regular-expression-based sentence boundary disambiguation module, created using the lexical scanner generator Flex (Nicol 1993). The boundary disambiguation module is part of a comprehensive preprocess pipeline that utilizes a list of 75 abbreviations and a series of over 100 hand-crafted rules to identify sentence boundaries, as well as titles, date and time expressions, and abbreviations. The sentence boundary module was developed over the course of more than six staff months. On the Wall Street Journal corpus described in Section 4, Alembic achieved an error rate of 0.9%.</Paragraph>
      <Paragraph position="2"> Christiane Hoffmann (1994) used a regular expression approach to classify punctuation marks in a corpus of the German newspaper die tageszeitung with a lower bound (as defined above) of 92%. She used the UNIX tool LEX (Lesk and Schmidt 1975) and a large abbreviation list to classify occurrences of periods. Her method incorrectly classified less than 2% of the sentence boundaries when tested on 2,827 periods from the corpus. The method was developed specifically for the tageszeitung corpus, and Hoffmann reports that success in applying her method to other corpora would be dependent on the quality of the available abbreviation lists. Her work would therefore probably not be easily transportable to other corpora or languages.</Paragraph>
      <Paragraph position="3"> Mark Wasson and colleagues invested nine staff months developing a system that recognizes special tokens (e.g., nondictionary terms such as proper names, legal statute citations, etc.) as well as sentence boundaries. From this, Wasson built a stand-alone boundary recognizer in the form of a grammar converted into finite automata with 1,419 states and 18,002 transitions (excluding the lexicon). The resulting system, when tested on 20 megabytes of news and case law text, achieved an error rate of 0.3% at speeds of 80,000 characters per CPU second on a mainframe computer. When tested against upper-case legal text the system still performed very well, achieving error rates of 0.3% and 1.8% on test data of 5,305 and 9,396 punctuation marks, respectively.</Paragraph>
      <Paragraph position="4"> According to Wasson, it is not likely, however, that the results would be this strong on lower-case-only data. 4 Although the regular grammar approach can be successful, it requires a large manual effort to compile the individual rules used to recognize the sentence boundaries. Such efforts are usually developed specifically for a text corpus (Liberman and Church 1992; Hoffmann 1994) and would probably not be portable to other text genres. Because of their reliance on special language-specific word lists, they are also not portable to other natural languages without repeating the effort of compiling extensive lists and rewriting rules. In addition, heuristic approaches depend on having a 4 This work has not been published. All information about this system is courtesy of a personal communication with Mark Wasson. Wasson's reported processing time cannot be compared directly to the other systems since it was obtained from a mainframe computer and was estimated in terms of characters rather than sentences.</Paragraph>
      <Paragraph position="5">  Palmer and Hearst Multilingual Sentence Boundary well-behaved corpus with regular punctuation and few extraneous characters, and they would probably not be very successful with texts obtained via optical character recognition (OCR).</Paragraph>
      <Paragraph position="6"> Miiller, Amerl, and Natalis (1980) provides an exhaustive analysis of sentence boundary disambiguation as it relates to lexical endings and the identification of abbreviations and words surrounding a punctuation mark, focusing on text written in English. This approach makes multiple passes through the data to find recognizable suffixes and thereby filters out words that are not likely to be abbreviations. The morphological analysis makes it possible to identify words not otherwise present in the extensive word lists used to identify abbreviations. Error rates of 2-5% are reported for this method tested on over 75,000 scientific abstracts, with lower bounds ranging from 54.7% to 92.8%.</Paragraph>
    </Section>
    <Section position="3" start_page="244" end_page="244" type="sub_section">
      <SectionTitle>
2.3 Approaches Using Machine Learning
</SectionTitle>
      <Paragraph position="0"> There have been two other published attempts to apply machine-learning techniques to the sentence boundary disambiguation task. Both make use of the words in the context found around the punctuation mark.</Paragraph>
      <Paragraph position="1">  2.3.1 Regression Trees. Riley (1989) describes an approach that uses regression trees (Breiman et al. 1984) to classify periods according to the following features: * Probability\[word preceding &amp;quot;.&amp;quot; occurs at end of sentence\] * Probability\[word following &amp;quot;.&amp;quot; occurs at beginning of sentence\] * Length of word preceding ....</Paragraph>
      <Paragraph position="2"> * Length of word after &amp;quot;.&amp;quot; * Case of word preceding &amp;quot;.': Upper, Lower, Cap, Numbers * Case of word following &amp;quot;.&amp;quot;: Upper, Lower, Cap, Numbers * Punctuation after &amp;quot;.&amp;quot; (if any) * Abbreviation class of words with &amp;quot;.&amp;quot;  The method uses information about one word of context on either side of the punctuation mark and thus must record, for every word in the lexicon, the probability that it occurs next to a sentence boundary. Probabilities were compiled from 25 million words of prelabeled training data from a corpus of AP newswire. The probabilities were actually estimated for the beginning and end of paragraphs rather than for all sentences, since paragraph boundaries were explicitly marked in the AP corpus, while the sentence boundaries were not. The resulting classification tree was used to identify whether a word ending in a period is at the end of a declarative sentence in the Brown corpus, and achieved an error rate of 0.2%. 5 Although this is an impressive error rate, the amount of training data (25 million words) required is prohibitive for a problem that acts as a preprocessing step to other natural language processing tasks; it would be impractical to expect this amount of data to be available for every corpus and language to be tagged.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML