File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/j97-2002_metho.xml

Size: 49,153 bytes

Last Modified: 2025-10-06 14:14:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-2002">
  <Title>The MITRE Corporation</Title>
  <Section position="4" start_page="244" end_page="245" type="metho">
    <SectionTitle>
5 Time for training was not reported, nor was the amount of the Brown corpus against which testing
</SectionTitle>
    <Paragraph position="0"> was performed; we assume the entire Brown corpus was used. Furthermore, no estimates of scalability were given, so we are unable to report results with a smaller set.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 23, Number 2  forward neural network to disambiguate periods, and achieve an error rate averaging 7%. They use a regular grammar to tokenize the text before training the neural nets, but no further details of their approach are available. 6</Paragraph>
    <Section position="1" start_page="245" end_page="245" type="sub_section">
      <SectionTitle>
2.4 Our Approach
</SectionTitle>
      <Paragraph position="0"> Each of the approaches described above has disadvantages to overcome. In the following sections we present an approach that avoids the problems of previous approaches, yielding a very low error rate and behaving more robustly than solutions that require manually designed rules. We present results of testing our system on several corpora in three languages: English, German, and French.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="245" end_page="245" type="metho">
    <SectionTitle>
3. The Satz System
</SectionTitle>
    <Paragraph position="0"> This section describes the structure of our adaptive sentence boundary disambiguation system, known as Satz. 7 The Satz system represents the context surrounding a punctuation mark as a sequence of vectors, where the vector constructed for each context word represents an estimate of the part-of-speech distribution for the word, obtained from a lexicon containing part-of-speech frequency data. This use of part-of-speech estimates of the context words, rather than the words themselves, is a unique aspect of the Satz system, and is responsible in large part for its efficiency and effectiveness.</Paragraph>
    <Paragraph position="1"> The context vectors, which we call descriptor arrays, are input to a machine learning algorithm trained to disambiguate sentence boundaries. The output of the learning algorithm is then used to determine the role of the punctuation mark in the sentence.</Paragraph>
    <Paragraph position="2"> The architecture of the system is shown in Figure 1. The Satz system works in two modes--learning mode and disambiguation mode. In learning mode, the input text is a training text with all sentence boundaries manually labeled, and the parameters in the learning algorithm are dynamically adjusted during training. Once learning mode is completed, the parameters in the learning algorithm remain fixed. Training of the learning algorithm is therefore necessary only once for each language, although training can be repeated for a new corpus or genre within a language, if desired. In disambiguation mode, the input is the text whose sentence boundaries have not been marked up yet and need to be disambiguated.</Paragraph>
    <Paragraph position="3"> The essence of the Satz system lies in how machine learning is used, rather than in which particular method is used. In this article we report results using two different learning methods: neural networks and decision trees. The two methods are almost equally effective for this task, and both train and run quickly using small resources.</Paragraph>
    <Paragraph position="4"> For some applications, one may be more appropriate than another, (e.g., the scores produced by a neural net may be useful for another processing step in a natural language program), so we do not consider either learning algorithm to be the &amp;quot;correct&amp;quot; one to use. Therefore, when we refer to the Satz system, we refer to the use of machine learning with a small training corpus, representing the word context surrounding each punctuation mark in terms of estimates of the parts of speech of those words, where these estimates are derived from a very small lexicon.</Paragraph>
  </Section>
  <Section position="6" start_page="245" end_page="252" type="metho">
    <SectionTitle>
6 Results were obtained courtesy of a personal communication with Joe Zhou. 7 &amp;quot;Satz&amp;quot; is the German word for &amp;quot;sentence.&amp;quot;
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="246" end_page="246" type="sub_section">
      <SectionTitle>
3.1 Tokenization
</SectionTitle>
      <Paragraph position="0"> The first stage of the process is lexical analysis, which breaks the input text (a stream of characters) into tokens. The Satz tokenizer is implemented using the UNIX tool LEX (Lesk and Schmidt 1975) and is modeled on the tokenizer used by the PARTS part-of-speech tagger (Church 1988). The tokens returned by the LEX program can be a sequence of alphabetic characters, a sequence of digits, 8 or a sequence of one or more non-alphanumeric characters such as periods or quotation marks.</Paragraph>
    </Section>
    <Section position="2" start_page="246" end_page="248" type="sub_section">
      <SectionTitle>
3.2 Part-of-Speech Lookup
</SectionTitle>
      <Paragraph position="0"> The individual tokens are next assigned a series of possible parts of speech, based on a lexicon and simple heuristics described below.</Paragraph>
      <Paragraph position="1">  resented in various ways. The most straightforward is to use the individual words preceding and following the punctuation mark, as in this example: at the plant. He had thought Using this approach, a representation of an individual word's position in a context must be made for every word in the language. Compiling these representations for each word is undesirable due to the large amount of training data, training time, and storage overhead required, especially since it is unlikely that such information will be useful to later stages of processing.</Paragraph>
      <Paragraph position="2"> As an alternative, the context could be approximated by using a single part of speech for each word. The above context would then be represented by the following part-of-speech sequence: preposition article noun pronoun verb verb However, requiring a single part-of-speech assignment for each word introduces a processing circularity: because most part-of-speech taggers require predetermined sentence boundaries, the boundary disambiguation must be done before tagging. But if 8 Numbers containing periods acting as decimal points are considered a single token. This eliminates one possible ambiguity of the period at the lexical analysis stage.</Paragraph>
      <Paragraph position="3">  Computational Linguistics Volume 23, Number 2 the disambiguation is done before tagging, no part-of-speech assignments are available for the boundary-determination system. To avoid this circularity, we approximate each word's part of speech in one of two ways: (1) by the prior probabilities of all parts of speech for that word, or (2) by a binary value for each possible part of speech for that word.</Paragraph>
      <Paragraph position="4"> In the case of prior probabilities, each word in the context is represented by the probability that the word occurs as each part of speech, with all part-of-speech probabilities in the vector summing to 1.0. Continuing the example, the context becomes (and for simplicity, suppressing the parts of speech with value 0.0):</Paragraph>
      <Paragraph position="6"> This denotes that at and the have a probability of 1.0 of occurring as a preposition and article respectively, plant has a probability of 0.8 of occurring as a noun and a probability of 0.2 of occurring as a verb, and so on. These probabilities, which are more accurately &amp;quot;scaled frequencies,&amp;quot; are based on occurrences of the words in a pretagged corpus, and are therefore corpus dependent. 9 In the case of binary part-of-speech assignment, for each possible part of speech, the vector is assigned the value 1 if the word can ever occur as that part of speech (according to the lexicon), and the value 0 if it cannot. In this case the sum of all items in the vector is not predefined, as it is with probabilities. Continuing the example with binary POS vectors (and, for simplicity, suppressing the parts of speech with value 0), the context becomes:</Paragraph>
      <Paragraph position="8"> The part-of-speech data necessary to construct probabilistic and binary vectors is often present in the lexicon of a part-of-speech tagger or other existing NLP tool, or it can easily be obtained from word lists; the data would thus be readily available and would not require excessive storage overhead. It is also possible to estimate part-of-speech data for new or unknown words. For these reasons, we chose to approximate the context in our system by using the prior part-of-speech information. In Section 4.7 we give the results of a comparative study of system performance with both probabilistic and binary part-of-speech vectors.</Paragraph>
      <Paragraph position="9">  ing part-of-speech frequency data from which the descriptor arrays are constructed. Words in the lexicon are followed by a series of part-of-speech tags and associated frequencies, representing the possible parts of speech for that word and the frequency with which the word occurs as each part of speech. The frequency information can be obtained in various ways, as discussed in the previous section. The lexical lookup stage of the Satz system finds a word in the lexicon (if it is present) and returns the possible parts of speech. For the English word well, for example, the lookup module might return the tags JJ/15 NN/18 QL/68 RB/634 UH/22 VB/5 9 The frequencies can be obtained from an existing corpus tagged manually or automatically; the corpus does not need to be tagged specifically for this task.</Paragraph>
      <Paragraph position="10">  Palmer and Hearst Multilingual Sentence Boundary indicating that, in the corpus on which the lexicon is based, the word well occurred 15 times as an adjective, 18 as a singular noun, 68 as a qualifier, 634 as an adverb, 22 as an interjection, and 5 as a singular verb) deg 3.2.3 Heuristics for Unknown Words. If a word is not present in the lexicon, the Satz system contains a set of heuristics that attempt to assign the most reasonable parts of speech to the word. A summary of these heuristics is listed below.</Paragraph>
      <Paragraph position="11"> Unknown tokens containing a digit (0-9) are assumed to be numbers.</Paragraph>
      <Paragraph position="12"> Any token beginning with a period, exclamation point, or question mark is assigned a &amp;quot;possible end-of-sentence punctuation&amp;quot; tag. This catches common sequences like &amp;quot;?!&amp;quot; and &amp;quot;... &amp;quot;.</Paragraph>
      <Paragraph position="13"> Common morphological endings are recognized and the appropriate part(s)-of-speech is assigned to the entire word.</Paragraph>
      <Paragraph position="14"> Words containing a hyphen are assigned a series of tags and frequencies equally distributed between adjective, common noun, and proper noun.</Paragraph>
      <Paragraph position="15"> Words containing an internal period are assumed to be abbreviations.</Paragraph>
      <Paragraph position="16"> A capitalized word is not always a proper noun, even when it appears somewhere other than in a sentence's initial position (e.g., the word American is often used as an adjective). Those words not present in the lexicon are assigned a certain language-dependent probability (0.9 for English) of being a proper noun, and the remainder is distributed uniformly among adjective, common noun, verb, and abbreviation, the most likely tags for unknown words, n Capitalized words appearing in the lexicon but not registered as proper nouns can nevertheless still be proper nouns. In addition to the part-of-speech frequencies present in the lexicon, these words are assigned a certain probability of being a proper noun (0.5 for English) with the probabilities already assigned to that word redistributed proportionally in the remaining 0.5. The proportion of words falling into this category varies greatly depending on the style of the text and the uniformity of capitalization.</Paragraph>
      <Paragraph position="17"> As a last resort, the word is assigned the tags for common noun, verb, adjective, and abbreviation with a uniform frequency distribution.</Paragraph>
      <Paragraph position="18"> These heuristics can be easily modified and adapted to the specific needs of a new language, 12 although we obtained low error rates without changing the heuristics.</Paragraph>
    </Section>
    <Section position="3" start_page="248" end_page="249" type="sub_section">
      <SectionTitle>
3.3 Descriptor Array Construction
</SectionTitle>
      <Paragraph position="0"> A vector, or descriptor array, is constructed for each token in the input text. The lexicon may contain as many as several hundred very specific tags, which we first need to map into more general categories. For example, the Brown corpus tags of present tense verb,  Elements of the descriptor array assigned to each incoming token.</Paragraph>
      <Paragraph position="1"> past participle, and modal verb are all mapped into the more general &amp;quot;verb&amp;quot; category. The parts of speech returned by the lookup module are thus mapped into the 18 general categories given in Figure 2, and the frequencies for each category are summed. In the case of a probabilistic vector described in Section 3.2.1, the 18 category frequencies for the word are then converted to probabilities by dividing the frequencies for each by the total frequency for the word. For a binary vector, all categories with a nonzero frequency count are assigned a value of 1, and all others are assigned a value of 0. In addition to the 18 category frequencies, the descriptor array also contains two additional flags that indicate if the word begins with a capital letter and if it follows a punctuation mark, for a total of 20 items in each descriptor array. These last two flags allow the system to include capitalization information when it is available without having to require that this information be present.</Paragraph>
    </Section>
    <Section position="4" start_page="249" end_page="252" type="sub_section">
      <SectionTitle>
3.4 Classification by a Learning Algorithm
</SectionTitle>
      <Paragraph position="0"> The descriptor arrays representing the tokens in the context are used as the input to a machine learning algorithm. To disambiguate a punctuation mark given a context of k surrounding words (referred to in this article as k-context), a window of k + 1 tokens and their descriptor arrays is maintained as the input text is read. The first k/2 and final k/2 tokens of this sequence represent the context in which the middle token appears. If the middle token is a potential end-of-sentence punctuation mark, the descriptor arrays for the context tokens are input to the learning algorithm and the output result indicates whether the punctuation mark serves as a sentence boundary or not. In learning mode, the descriptor arrays are used to train the parameters of the learning algorithm. We investigated the effectiveness of two separate algorithms: (1) back-propagation training of neural networks, and (2) decision tree induction. The learning algorithms are described in the next two sections, and the results obtained with the algorithms are presented in Section 4.</Paragraph>
      <Paragraph position="1"> 3.4.1 Neural Network. Artificial neural networks have been successfully applied for many years in speech recognition applications (Bourland and Morgan 1994; Lippmann 1989), and more recently in NLP tasks such as word category prediction (Nakamura et al. 1990) and part-of-speech tagging (Schmid 1994). Neural networks in the context of machine learning provide a well-tested training algorithm (back-propagation) that has achieved high success rates in pattern-recognition problems similar to the problem posed by sentence boundary disambiguation (Hertz, Krogh, and Palmer 1991).</Paragraph>
      <Paragraph position="2"> For Satz, we used a fully-connected feed-forward neural network, as shown in  input layer is fully connected to a hidden layer consisting of j hidden units; the hidden units in turn feed into one output unit that indicates the results of the function. In a traditional back-propagation network, the input to a node is the sum of the outputs of the nodes in the previous layer multiplied by the weights between the layers. This sum is then passed through a &amp;quot;squashing&amp;quot; function to produce a node output between 0 and 1. A commonly-used squashing function--due to its mathematical properties, which assist in network training--is the sigrnoidal function, given byf(hi) = ~'1 where hi is the node input and T is a constant to adjust the slope of the sigmoid.</Paragraph>
      <Paragraph position="3"> In the Satz system we use a sigrnoidal squashing function on all hidden nodes and the single output node of the neural network. The output of the network is thus a single value between 0 and 1, and represents the strength of the evidence that a punctuation mark occurring in its context is indeed the end of a sentence. Two adjustable sensitivity thresholds, to and tl, are used to classify the results of the disambiguation. If the output is less than to, the punctuation mark is not a sentence boundary; if the output is greater than or equal to tl, it is a sentence boundary. Outputs which fall between the thresholds cannot be disambiguated by the network (which may indicate that the mark is inherently ambiguous) and are marked accordingly, so they can be treated specially in later processing. 13 For example, the sentence alignment algorithm in Gale and Church (1993) allows a distinction between hard and soft boundaries, where soft boundaries are movable by the alignment program. In our case, punctuation marks remaining ambiguous after processing by Satz can be treated as soft boundaries while unambiguous punctuation marks (as well as paragraph boundaries) can be treated as hard boundaries, thus allowing the alignment program greater flexibility.</Paragraph>
      <Paragraph position="4"> A neural network is trained by presenting it with input data paired with the desired output. For Satz, the input is the context surrounding the punctuation mark to be disambiguated, and the output is a score indicating how much evidence there is that the punctuation mark is acting as an end-of-sentence boundary. The nodes are connected via links that have weights assigned to them, and if the network produces an incorrect score, the weights are adjusted using an algorithm called back-propagation (Hertz, Krogh, and Palmer 1991) so that the next time the same input is presented to the network, the output should more closely match the desired score. This training procedure is often iterated many times in order to allow the weights to adjust 13 When to ---- tl, no punctuation mark is left ambiguous.</Paragraph>
      <Paragraph position="5">  Computational Linguistics Volume 23, Number 2 appropriately, and the same input data is presented multiple times. Each round of presenting the same input data is called an epoch; of course, it is desirable to require as few training epochs and as little training data as possible. If one trains the network too often on the same data, overfitting can occur, meaning that the weights become too closely aligned with the particular training data that has been presented to the network, and so may not correspond well to new examples that will come later. For this reason, training should be accompanied by cross-validation (Bourland and Morgan 1994), a check against a held-out set of data to be sure that the weights are not too closely tailored to the training text. This will be described in more detail below.</Paragraph>
      <Paragraph position="6"> Training data for the neural network consist of two sets of text in which all sentence boundaries have been manually disambiguated. The first text, the training text, contains 300-600 test cases, where a test case is an ambiguous punctuation mark. The weights of the neural network are trained on the training text using the standard back-propagation algorithm (Hertz, Krogh, and Palmer 1991). The second set of texts used in training is the cross-validation set, whose contents are separate from the training text and which contains roughly half as many test cases as the training text. Training of the weights is not performed on this text; the cross-validation text is instead used to increase the generalization of the training, such that when the total training error over the cross-validation text reaches a minimum, training is halted. TM Testing is then performed on texts independent of the training and cross-validation texts. We measure the speed of training by the number of training epochs required to complete training, where an epoch is a single pass through all the training data. Training times for all experiments reported in this article were less than one minute and were obtained on a DEC Alpha 3000 workstation, unless otherwise noted.</Paragraph>
      <Paragraph position="7"> In Sections 4.1-4.9 we present results of testing the Satz system with a neural network, including investigations of the effects of varying network parameters such as hidden layer size, threshold values, and amount of training data.</Paragraph>
      <Paragraph position="8">  al. 1989) have been successfully applied to NLP problems such as parsing (Resnik 1993; Magerman 1995) and discourse analysis (Siegel and McKeown 1994; Soderland and Lehnert 1994). We tested the Satz system using the c4.5 (Quinlan 1993) decision tree induction program as the learning algorithm and compared the results to those obtained previously with the neural network. These results are discussed in Section 4.10.</Paragraph>
      <Paragraph position="9"> The induction algorithm proceeds by e;caluating the information content of a series of binary attributes and iteratively building a tree from the attribute values, with the leaves of the decision tree being the values of the goal attributes. At each step in the learning procedure, the evolving tree is branched on the attribute that divides the data items with the highest gain in information. Branches are added to the tree until the decision tree can classify all items in the training set. Overfitting is also possible in decision tree induction, resulting in a tree that can very accurately classify the training data but may not be able to accurately classify new examples. To reduce the effects of overfitting, the c4.5 learning algorithm prunes the tree after the entire decision tree has been constructed. It recursively examines each subtree to determine whether replacing it with a leaf or a branch would reduce the number of errors. This pruning produces a decision tree better able to classify data different from the training data.</Paragraph>
      <Paragraph position="10"> 14 The training error is the least mean squares error, one-half the sum of the squares of all the errors, where the error of a particular item is the difference between the desired output and the actual output of the neural net.</Paragraph>
      <Paragraph position="11">  Integrating the decision tree induction algorithm into the Satz system was simply a matter of defining the input attributes as the k descriptor arrays in the context, with a single goal attribute representing whether the punctuation mark is a sentence boundary or not. Training data for the induction of the decision tree were identical to the training set used to train the neural network.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="252" end_page="262" type="metho">
    <SectionTitle>
4. Experiments with English Texts
</SectionTitle>
    <Paragraph position="0"> We first tested the Satz system using English texts from the Wall Street Journal portion of the ACL/DCI collection (Church and Liberman 1991). We constructed a training text of 573 test cases and a cross-validation text of 258 test cases. 15 We then constructed a separate test text consisting of 27,294 test cases, with a lower bound of 75.0%. The baseline system (UNIX STYLE) achieved an error rate of 8.3% on the sentence boundaries in the test set. The lexicon and thus the frequency counts used to calculate the descriptor arrays were derived from the Brown corpus (Francis and Kucera 1982). In initial experiments we used the extensive lexicon from the PARTS part-of-speech tagger (Church 1988), which contains 30,000 words. We later experimented with a much smaller lexicon, and these results are discussed in Section 4.4. In Sections 4.1-4.9 we describe the results of our experiments with the Satz system using the neural network as the learning algorithm. Section 4.10 describes results using decision tree induction.</Paragraph>
    <Section position="1" start_page="252" end_page="252" type="sub_section">
      <SectionTitle>
4.1 Context Size
</SectionTitle>
      <Paragraph position="0"> In order to determine how much context is necessary to accurately disambiguate sentence boundaries in a text, we varied the size of the context from which the neural network inputs were constructed and obtained the results in Table 1. The number in the Training Epochs column is the number of passes through, the training data required to learn the training set; the number in the Testing Errors cohnnn is the number of errors on the 27,294 item test set the system made after training with the corresponding context size. From these data we concluded that a 6-token context, 3 preceding the punctuation mark and 3 following, produces the best results.</Paragraph>
    </Section>
    <Section position="2" start_page="252" end_page="253" type="sub_section">
      <SectionTitle>
4.2 Hidden Units
</SectionTitle>
      <Paragraph position="0"> The number of hidden units in a neural network can affect its performance. To determine the size of the hidden layer in the neural network that produced the lowest output error rate, we experimented with various hidden layer sizes and obtained the results in Table 2. From these data we concluded that the lowest error rate in this case is possible using a neural network with two nodes in its hidden layer.</Paragraph>
      <Paragraph position="1"> 15 Note that &amp;quot;constructing&amp;quot; a training, cross-validation, or test text simply involves manually disambiguating the sentence boundaries by inserting a unique character sequence at the end of each sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="253" end_page="254" type="sub_section">
      <SectionTitle>
4.3 Sources of Errors
</SectionTitle>
      <Paragraph position="0"> As described in Sections 4.1 and 4.2, the best results were obtained with a context size of 6 tokens and a hidden layer with 2 units. This configuration produced a total of 409 errors out of 27,294 test cases, for an error rate of 1.5%. These errors fall into two major categories: (i) false positive, i.e., a punctuation mark the method erroneously labeled as a sentence boundary, and (ii) false negative, i.e., an actual sentence boundary that the method did not label as such. Table 3 contains a summary of these errors.</Paragraph>
      <Paragraph position="1"> These errors can be decomposed into the following groups: (37.6%) false positive at an abbreviation within a title or name, usually because the word following the period exists in the lexicon with other parts of speech (Mr. Gray, Col. North, Mr. Major, Dr. Carpenter, Mr. Sharp).</Paragraph>
      <Paragraph position="2">  (22.5%) false negative due to an abbreviation at the end of a sentence, most frequently Inc., Co., Corp., or U.S., which all occur within sentences as well.</Paragraph>
      <Paragraph position="3"> (11.0%) false positive or negative due to a sequence of characters including a period and quotation marks, as this sequence can occur both within and at the end of sentences.</Paragraph>
      <Paragraph position="4"> (9.2%) false negative resulting from an abbreviation followed by quotation marks; related to the previous two types.</Paragraph>
      <Paragraph position="5"> (9.8%) false positive or false negative resulting from presence of ellipsis (...), which can occur at the end of or within a sentence.</Paragraph>
      <Paragraph position="6"> (9.9%) miscellaneous errors, including extraneous characters (dashes,  asterisks, etc.), ungrammatical sentences, misspellings, and parenthetical sentences.</Paragraph>
      <Paragraph position="7"> The first two items indicate that the system is having difficulty recognizing the function of abbreviations. We attempted to counter this by dividing the abbreviations in the lexicon into two distinct categories, title abbreviations such as Mr. and Dr., which almost never occur at the end of a sentence, and all other abbreviations. This new classification, however, significantly increased the training time and eliminated only 12 of the 409 errors (2.9%).</Paragraph>
      <Paragraph position="8"> The third and fourth items demonstrate the difficulty of distinguishing subsentences within a sentence. This problem may be addressed by creating a new classification for punctuation marks, the &amp;quot;embedded end-of-sentence,&amp;quot; as suggested in Section 1. The fifth class of error may similarly be addressed by creating a new classification for ellipses, and then attempting to determine the role of the ellipses independent of the sentence boundaries.</Paragraph>
    </Section>
    <Section position="4" start_page="254" end_page="254" type="sub_section">
      <SectionTitle>
4.4 Lexicon Size
</SectionTitle>
      <Paragraph position="0"> The results in Sections 4.1-4.3 depended on a very large lexicon with more than 30,000 words. It is not always possible to obtain or build a large lexicon, so it is important to understand the impact of a smaller lexicon on the training time and error rate of the system. We altered the size of the English lexicon used in training and testing by removing large sections of the original lexicon and obtained the results in Table 4.</Paragraph>
      <Paragraph position="1"> These data demonstrate that a larger lexicon provides faster training and a lower error rate, although the performance with the smaller lexica was still almost as accurate. In the experiments describecl in Sections 4.5--4.10, we used a 5,000 word lexicon.</Paragraph>
      <Paragraph position="2"> It is important to note, however, that in reducing the size of the lexicon as a whole, the number of abbreviations remained constant (at 206). Recognizing abbreviations gives important evidence as to the location of sentence boundaries, and reducing the number of abbreviations in the lexicon naturally reduces the accuracy of the system.</Paragraph>
      <Paragraph position="3"> Most existing boundary disambiguation systems, such as the STYLE program, depend heavily on abbreviation lists and would be relatively ineffective without information about abbreviations. However, the robustness of the Satz system allows it to still produce a relatively high accuracy without relying on extensive abbreviation lists.</Paragraph>
      <Paragraph position="4"> To demonstrate this robustness, we removed all abbreviations from the lexicon after reducing it in size to 5,000 words. The resulting Satz error rate was 4.9%, which was still significantly better than the STYLE baseline error rate of 8.3%, which was obtained with a 48 entry abbreviation list.</Paragraph>
    </Section>
    <Section position="5" start_page="254" end_page="255" type="sub_section">
      <SectionTitle>
4.5 Single-Case Results
</SectionTitle>
      <Paragraph position="0"> A major advantage of the Satz approach to sentence boundary recognition is its robustness. In contrast to many existing systems, which depend on brittle parameters such as capitalization and spacing, Satz is able to adapt to texts that are not well-formed, such as single-case texts. The two descriptor array flags for capitalization, discussed in Section 3.3, allow the system to include capitalization information when it is available.</Paragraph>
      <Paragraph position="1"> When this information is not available, the system is nevertheless able to adapt and produce a low error rate. To demonstrate this robustness, we converted the training, cross-validation, and test texts used in previous testing to a lower-case-only format,</Paragraph>
    </Section>
    <Section position="6" start_page="255" end_page="255" type="sub_section">
      <SectionTitle>
4.6 Results on OCR Texts
</SectionTitle>
      <Paragraph position="0"> A large and ever-increasing source of on-line texts is texts obtained via optical character recognition (OCR). These texts require very robust processing methods, as they contain a large number of extraneous and incorrect characters. The robustness results of the Satz system in the absence of an abbreviation list and capitalization suggest that it would be well suited for processing OCR texts as well. To test this, we prepared a small corpus of raw OCR data containing 1,157 punctuation marks. The STYLE program produced an error rate of 11.7% over the OCR texts; the Satz system, using a neural network trained on mixed-case WSJ texts, produced an error rate of 4.2%.</Paragraph>
      <Paragraph position="1"> In analyzing the sources of the errors produced by Satz over the raw OCR data, it was clear that many errors came from areas of high noise in the texts, such as the line in example (6), which contains an extraneous question mark and three periods. These areas probably represented charts or tables in the source text and would most likely need to be eliminated anyway, as it is doubtful any text-processing program would be able to productively process them. We therefore applied a simple filter to the raw  OCR data to locate areas of high noise and remove them from the text. In the resulting text of 1,115 punctuation marks, the STYLE program had an error rate of 9.6% while the Satz system improved to 1.9%.</Paragraph>
      <Paragraph position="2"> (6) e:)i. i)'e;y',?;.i#i TCE grades' are' (7) newsprint. Furthermore, shoe presses have Using rock for granite roll  Two years ago we reported on advances in While the low error rate on OCR texts is encouraging, it should not be viewed as an absolute figure. One problem with OCR texts is that periods in the original text may be scanned as commas or dropped from the text completely. Our system is unable to detect these cases. Similarly, the definition of a sentence boundary is not necessarily absolute, as large parts of texts may be incorrectly or incompletely scanned by the OCR program. The resulting &amp;quot;sentences&amp;quot; may not correspond to those in the original text, as can be seen in example (7). Such problems cause a low error rate to have less significance in OCR texts than in more well-formed texts such as the WSJ corpus.</Paragraph>
    </Section>
    <Section position="7" start_page="255" end_page="256" type="sub_section">
      <SectionTitle>
4.7 Probabilistic vs. Binary Inputs
</SectionTitle>
      <Paragraph position="0"> In the discussion of methods of representing context in Section 3.2.1, we suggested two ways of approximating the part-of-speech distribution of a word, using prior probabilities and binary features. The results reported in the previous sections were all obtained using the prior probabilities in the descriptor arrays for all tokens. Our experiments in comparing probabilisfic inputs to binary feature inputs, given in Table 5, indicate that using binary feature inputs significantly improves the performance of the system on both mixed-case and single-case texts, as well as decreasing the training 16 The difference in results with upper-case-only and lower-case-only formats can probably be attributed to the capitalization flags in the descriptor arrays, as these flags would always be on in one case and off in the other.</Paragraph>
      <Paragraph position="1">  times. The lower error rate and faster training time suggest that the simpler approach of using binary feature inputs to the neural network is better than the frequency-based inputs previously used.</Paragraph>
    </Section>
    <Section position="8" start_page="256" end_page="256" type="sub_section">
      <SectionTitle>
4.8 Thresholds
</SectionTitle>
      <Paragraph position="0"> As described in Section 3.4.1, the output of the neural network (after passing through the sigmoidal squashing function) is used to determine the function of a punctuation mark based on its value relative to two sensitivity thresholds, with outputs that fall between the thresholds denoting that the function of the punctuation mark is still ambiguous. These are shown in the Not Labeled column of Table 6, which gives the results of a systematic experiment with the sensitivity thresholds. As the thresholds were moved from the initial values of 0.5 and 0.5, certain items that had been classified as False Pos or False Neg fell between the thresholds and became Not Labeled.</Paragraph>
      <Paragraph position="1"> At the same time, however, items that had been correctly labeled also fell between the thresholds, and these are shown in the Were Correct column. 17 There is thus a tradeoff: decreasing the error percentage by adjusting the thresholds also decreases the percentage of cases correctly labeled and increases the percentage of items left ambiguous.</Paragraph>
    </Section>
    <Section position="9" start_page="256" end_page="259" type="sub_section">
      <SectionTitle>
4.9 Amount of Training Data
</SectionTitle>
      <Paragraph position="0"> To obtain the results in Sections 4.1-4.8, we used very small training and cross-validation sets of 573 and 258 items, respectively. The training and cross-validation sets could thus be constructed in a few minutes, and the resulting system error rate was very low. To determine the system improvement with more training data, we 17 Note that the number of items in the Were Correct column is a subset of those in the Not Labeled column.</Paragraph>
      <Paragraph position="1">  removed a portion of the test data and incrementally added it to the training and cross-validation sets. We found that, after an initial increase in the error rate, which can probably be accounted for by the fact that the new training data came from a different part of the corpus, increasing the size of the training and cross-validation sets to 2,514/1,266 reduced the error percentage to 1.1%, as can be seen in Table 7. The trade-off for this decreased error rate is a longer training time (often more than 10 minutes) as well as the extra time required to construct the larger sets.</Paragraph>
      <Paragraph position="2">  We next compared the Satz system error rate obtained using the neural network with results using a decision tree. We were able to use some of the previous results, specifically the optimality of a 6-context and the effectiveness of a smaller lexicon and binary feature vectors, to obtain a direct comparison with the neural net results. We used the c4.5 decision tree induction program (Quinlan 1993) and a 5,000 word lexicon to pro- null duce all decision tree results.</Paragraph>
      <Paragraph position="3"> 4.10.1 Size of Training Set. As we showed in Section 4.9, the size of the training set used for the neural network affected the overall system error rate. The same was true with the decision tree induction algorithm, as seen in Figure 4. The lowest error rate (1.0%) was obtained with a training set of 6,373 items.</Paragraph>
      <Paragraph position="4"> 4.10.2 Mixed-Case Results. One advantage of decision tree induction is that the al- null gorithm clearly indicates which of the input attributes are most important. While the 6-context descriptor arrays present 120 input attributes to the algorithm, c4.5 induced a decision tree utilizing only 10 of the attributes, when trained on the same mixed-case WSJ text used to train the neural network. The 10 attributes for mixed-case English texts, as seen in the induced decision tree in Figure 5, are (where t - 1 is the token preceding the punctuation mark, t + 1 is the token following, and so on):</Paragraph>
      <Paragraph position="6"> The decision tree created from the small training set of 622 items resulted in an error rate of 1.6%. This result was slightly higher than the lowest error rate (1.5%) obtained with the neural network trained with a similar training set and a 5,000 word lexicon. The lowest error rate obtained using a larger training set to induce the decision tree (1.0%), however, is better than the lowest error rate (1.1%) for the neural network trained on a larger set.</Paragraph>
      <Paragraph position="7"> 4.10.3 Single-Case Results. Running the induction algorithm on upper-case-only and lower-case-only texts both produced the same decision tree, shown in Figure 6. An interesting feature of this tree is that it reduced the 120 input attributes to just 4 important ones. Note the similarity of this tree to the algorithm used by the STYLE program as discussed in Section 2.</Paragraph>
      <Paragraph position="8"> Trained on 622 items, this tree produced 527 errors over the 27,294 item test set, an error rate of 1.9%, for both upper-case-only and lower-case-only texts. This error rate is lower than the best result for the neural network (3.3%) on single-case texts, despite the small size of the training set used.</Paragraph>
      <Paragraph position="9"> 5. Adaptation to Other Languages Since the disambiguation component of the sentence boundary recognition system, the learning algorithm, is language independent, the Satz system can be easily adapted to natural languages with punctuation systems similar to English. Adaptation to other  Decision tree induced for mixed-case English texts. Leaf nodes labeled with 1 indicate that the punctuation mark is determined to be a sentence boundary.</Paragraph>
      <Paragraph position="10"> languages involves obtaining (or building) a small lexicon containing the necessary part-of-speech data and constructing small training and cross-validation texts. We have successfully adapted the Satz system to German and French, and the results are described below.</Paragraph>
    </Section>
    <Section position="10" start_page="259" end_page="261" type="sub_section">
      <SectionTitle>
5.1 German
</SectionTitle>
      <Paragraph position="0"> The German lexicon was built from a series of public-domain word lists obtained from the Consortium for Lexical Research. In the resulting lexicon of 17,000 German adjectives, verbs, prepositions, articles, and 156 abbreviations, each word was assigned only the parts of speech for the lists from which it came, with a frequency of I for each part of speech. As the lack of actual frequency data in the lexicon made construction of a probabilistic descriptor array impossible, we performed all German experiments using binary vectors. The part-of-speech tags used were identical to those from the English lexicon, and the descriptor array mapping remained unchanged. This lexicon was used in testing with two separate corpora. The total development time required to  adapt Satz to German, including building the lexicon and constructing training texts, was less than one day. We tested the system with two separate German corpora.</Paragraph>
      <Paragraph position="1">  eral megabytes of on-line texts from the German newspaper. TM We constructed a training text of 520 items from the Sfiddeutsche Zeitung corpus, and a cross-validation text of 268 items. Training was performed in less than five minutes on a Next workstation. 19 Testing on a sample of 3,184 separate items from the same corpus resulted in error rates less than 1.3%, as summarized in Table 8. A direct comparison to the UNIX STYLE program is not possible for German texts, as the STYLE program is only effective for English texts. The SZ corpus did have a lower bound of 79.1%, which was similar to the 75.0% lower bound of the WSJ corpus.</Paragraph>
      <Paragraph position="2">  of public-domain German articles distributed internationally by the University of Ulm. We constructed a training text of 268 potential sentence boundaries from the corpus, as well as a cross-validation text of 150 potential sentence boundaries, and the training time was less than one minute in all cases. A separate portion of the corpus was used for testing the system and contained over 5,037 potential sentence boundaries from the months July-October 1994, with a &amp;quot;baseline system performance of 96.7%. Results of testing on the German News corpus are given in Table 8 and show a very low error rate for both mixed-case and single-case texts. Repeating the testing with a smaller lexicon containing less than 2,000 words still produced an error rate lower than 1% with a slightly higher training time.</Paragraph>
      <Paragraph position="3">  decision trees from a 788 item German training set (from the SZ corpus) resulted in a tree utilizing 11 of the 120 attributes. The error rates over the SZ test set were 2.3% for mixed-case texts and 1.9% for single-case texts, both noticeably higher than the best error rate (1.3%) achieved with the neural network on the SZ corpus. The decision tree  induced for the German News corpus utilized only three attributes (t - 1 can be an abbreviation, t - 1 can be a number, t + 1 can be a noun) and produced a 0.7% error rate in all cases.</Paragraph>
    </Section>
    <Section position="11" start_page="261" end_page="262" type="sub_section">
      <SectionTitle>
5.2 French
</SectionTitle>
      <Paragraph position="0"> The French lexicon was compiled from the part-of-speech data obtained by running the Xerox PARC part-of-speech tagger (Cutting et al. 1991) on a portion of the French half of the Canadian Hansards corpus. The lexicon consisted of less than 1,000 words assigned parts of speech by the tagger, including 20 French abbreviations appended to the 206 English abbreviations available from the lexicon used in obtaining the results described in Section 4. The part-of-speech tags in the lexicon were different from those used in the English implementation, so the descriptor array mapping had to be adjusted accordingly. The development time required to adapt Satz to French was two days. A training text of 361 potential sentence boundaries was constructed from the Hansards corpus, and a cross-validation text of 137 potential sentence boundaries, and the training time was less than one minute in all cases. The results of testing the system on a separate set of 3,766 potential sentence boundaries (also from the Hansards corpus) with a baseline algorithm performance of 80.1% are given in Table 9, including a comparison of results with both probabilistic and binary feature inputs.</Paragraph>
      <Paragraph position="1"> These data show a very low system error rate on both mixed and single-case texts. In addition, the decision tree induced from 498 French training items by the c4.5 program produced a lower error rate (0.4%) in all cases.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="262" end_page="262" type="metho">
    <SectionTitle>
6. Improving Performance on the Difficult Cases
</SectionTitle>
    <Paragraph position="0"> To demonstrate the performance of our system within a large-scale NLP application, we integrated the Satz system into an existing information extraction system, the Alembic system (Aberdeen et al. 1995) described in Section 2.2. On the same WSJ corpus used to test Satz in Section 4, the Alembic system alone achieved an error rate of only 0.9% (the best error rate achieved by Satz was 1.0%). A large percentage of the errors made by the Alembic module fell into the second category described in Section 4.3, where one of the five abbreviations Co., Corp., Ltd., Inc., or U.S. occurred at the end of a sentence. We decided to see if Satz could be applied in such a way that it improved the results on the hard cases on which the hand-written rules were unable to perform as well as desired.</Paragraph>
    <Paragraph position="1"> We trained Satz on 320 of the problematic examples described above, taken from the WSJ training corpus. The remaining 19,034 items were used as test data. The Alembic module was used to disambiguate the sentence boundaries in all cases except when one of the five problematic abbreviations was encountered; in these cases, Satz (in neural network mode) was used to determine the role of the period following the abbreviation. The hybrid disambiguation system reduced the total number of sentence boundary errors by 46% and the error rate for the whole corpus fell from 0.9% to 0.5%.</Paragraph>
    <Paragraph position="2"> We trained Satz again, this time using the decision tree learning method, in order to see what types of rules were acquired for the problematic sentences. The decision tree shown in Figure 7 is the result; the performance of Satz with this decision tree was nearly identical to the performance with the neural network. From this tree it can be seen that context farther away from the punctuation mark is important, and extensive use is made of part-of-speech information.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML