File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1013_metho.xml

Size: 10,847 bytes

Last Modified: 2025-10-06 14:13:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1013">
  <Title>Adaptive Sentence Boundary Disambiguation</Title>
  <Section position="4" start_page="79" end_page="80" type="metho">
    <SectionTitle>
2 Our Solution
</SectionTitle>
    <Paragraph position="0"> We have developed an efficient and accurate automatic sentence boundary labeling algorithm which overcomes the limitations of previous solutions. The method is easily trainable and adapts to new text types without requiring rewriting of recognition rules. The core of the algorithm can be stated concisely as follows: the part-of-speech probabilities of the tokens surrounding a punctuation mark are used as input to a feed-forward neural network, and the network's output activation value determines what label to assign to the punctuation mark.</Paragraph>
    <Paragraph position="1"> The straightforward approach to using contextual information is to record for each word the likelihood that it appears before or after a sentence boundary. However, it is expensive to obtain probabilities for likelihood of occurrence of all individual tokens in the positions surrounding the punctuation mark, and most likely such information would not be useful to any subsequent processing steps in an NLP system. Instead, we use probabilities for the part-of-speech categories of the surrounding tokens, thus making training faster and storage costs negligible for a system that must in any case record these probabilities for use in its part-of-speech tagger.</Paragraph>
    <Paragraph position="2"> This approach appears to incur a cycle: because most part-of-speech taggers require pre-determined sentence boundaries, sentence labeling must be done before tagging. But if sentence labeling is done before tagging, no part-of-speech assignments are available for the boundary-determination algorithm. Instead of assigning a single part-of-speech to each word, our algorithm uses ~he prior probabilities of all parts-of-speech for that word. This is in contrast to Riley's method (Riley, 1989) which requires probabilities to be found for every lexical item (since it records the number of times every token has been seen before and after a period). Instead, we suggest making use of the unchanging prior probabilities for each word already stored in the system's lexicon.</Paragraph>
    <Paragraph position="3"> The rest of this section describes the algorithm in more detail.</Paragraph>
    <Section position="1" start_page="79" end_page="79" type="sub_section">
      <SectionTitle>
2.1 Assignment of Descriptors
</SectionTitle>
      <Paragraph position="0"> The first stage of the process is lexical analysis, which breaks the input text (a stream of characters) into tokens. Our implementation uses a slightlymodified version of the tokenizer from the PARTS part-of-speech tagger (Church, 1988) for this task.</Paragraph>
      <Paragraph position="1"> A token can be a sequence of alphabetic characters, a sequence of digits (numbers containing periods acting as decimal points are considered a single token), or a single non-alphanumeric character. A lookup module then uses a lexicon with part-of-speech tags for each token. This lexicon includes information about the frequency with which each word occurs as each possible part-of-speech. The lexicon and the frequency counts were also taken from the PARTS tagger, which derived the counts from the Brown corpus (Francis and Kucera, 1982). For the word adult, for example, the lookup module would return the tags &amp;quot;JJ/2 NN/24,&amp;quot; signifying that the word occurred 26 times in the Brown corpus - twice as an adjective and 24 times as a singular noun.</Paragraph>
      <Paragraph position="2"> The lexicon contains 77 part-of-speech tags, which we map into 18 more general categories (see Figure 1). For example, the tags for present tense verb, past participle, and modal verb all map into the more general &amp;quot;verb&amp;quot; category. For a given word and category, the frequency of the category is the sum of the frequencies of all the tags that are mapped to the category for that word. The 18 category frequencies for the word are then converted to probabilities by dividing the frequencies for each category by the total number of occurrences of the word.</Paragraph>
      <Paragraph position="3"> For each token that appears in the input stream, a descriptor array is created consisting of the 18 probabilities as well as two additional flags that indicate if the word begins with a capital letter and if it follows a punctuation mark.</Paragraph>
    </Section>
    <Section position="2" start_page="79" end_page="80" type="sub_section">
      <SectionTitle>
2.2 The Role of the Neural Network
</SectionTitle>
      <Paragraph position="0"> We accomplish the disambiguation of punctuation marks using a feed-forward neural network trained with the back propagation algorithm (Hertz et al., 1991). The network accepts as input k * 20 input units, where k is the number of words of context surrounding an instance of an end-of-sentence punctuation mark (referred to in this paper as &amp;quot;k-context&amp;quot;), and 20 is the number of elements in the descriptor array described in the previous subsection. The  input layer is fully connected to a hidden layer consisting of j hidden units with a sigmoidal squashing activation function. The hidden units in turn feed into one output unit which indicates the results of the function. 4 The output of the network, a single value between 0 and 1, represents the strength of the evidence that a punctuation mark occurring in its context is indeed the end of the sentence. We define two adjustable sensitivity thresholds to and tl, which are used to classify the results of the disambiguation.</Paragraph>
      <Paragraph position="1"> If the output is less than to, the punctuation mark is not a sentence boundary; if the output is greater than or equal to Q, it is a sentence boundary. Outputs which fall between the thresholds cannot be disambiguated by the network and are marked accordingly, so they can be treated specially in later processing. When to : tl, every punctuation mark is labeled as either a boundary or a non-boundary.</Paragraph>
      <Paragraph position="2"> To disambiguate a punctuation mark in a kcontext, a window of k+l tokens and their descriptor arrays is maintained as the input text is read. The first k/2 and final k/2 tokens of this sequence represent the context in which the middle token appears. If the middle token is a potential end-of-sentence punctuation mark, the descriptor arrays for the context tokens are input to the network and the output result indicates the appropriate label, subject to the thresholds to and t 1.</Paragraph>
      <Paragraph position="3"> Section 3 describes experiments which vary the size of k and the number of hidden units.</Paragraph>
    </Section>
    <Section position="3" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
2.3 Heuristics
</SectionTitle>
      <Paragraph position="0"> A connectionist network can discover patterns in the input data without using explicit rules, but the input must be structured to allow the net to recognize these patterns. Important factors in the effectiveness of these arrays include the mapping of part-of-speech tags into categories, and assignment of parts-of-speech to words not explicitly contained in the lexicon.</Paragraph>
      <Paragraph position="1"> As previously described, we map the part-of-speech tags in the lexicon to more general categories. This mapping is, to an extent, dependent on the range of tags and on the language being analyzed.</Paragraph>
      <Paragraph position="2"> In our experiments, when all verb forms in English are placed in a single category, the results are strong (although we did not try alternative mappings). We speculate, however, that for languages like German, 4The context of a punctuation mark can be thought of as the sequence of tokens preceding and following it. Thus this network can be thought of roughly as a Time-Delay Neural Network (TDNN) (Hertz et al., 1991), since it accepts a sequence of inputs and is sensitive to positional information wRhin the sequence. However, since the input information is not really shifted with each time step, but rather only presented to the neural net when a punctuation mark is in the center of the input stream, this is not technically a TDNN.</Paragraph>
      <Paragraph position="3"> noun verb article modifier conjunction pronoun preposition proper noun number comma or semicolon left parentheses right parentheses non-punctuation character possessive colon or dash abbreviation  the verb forms will need to be separated from each other, as certain forms occur much more frequently at the end of a sentence than others do. Similar issuse may arise in other languages.</Paragraph>
      <Paragraph position="4"> Another important consideration is classification of words not present in the lexicon, since most texts contain infrequent words. Particularly important is the ability to recognize tokens that are likely to be abbreviations or proper nouns. M/iller (1980) gives an argument for the futility of trying to compile an exhaustive list of abbreviations in a language, thus implying the need to recognize unfamiliar abbreviations. We implement several techniques to accomplish this. For example, we attempt to identify initials by assigning an &amp;quot;abbreviation&amp;quot; tag to all sequences of letters containing internal periods and no spaces. This finds abbreviations like &amp;quot;J.R.&amp;quot; and &amp;quot;Ph.D.&amp;quot; Note that the final period is a punctuation mark which needs to be disambiguated, and is therefore not considered part of the word.</Paragraph>
      <Paragraph position="5"> A capitalized word is not necessarily a proper noun, even when it appears somewhere other than in a sentence's initial position (e.g., the word &amp;quot;American&amp;quot; is often used as an adjective). We require a way to assign probabilities to capitalized words that appear in the lexicon but are not registered as proper nouns. We use a simple heuristic: we split the word's probabilities, assigning a 0.5 probability that the word is a proper noun, and dividing the remaining 0.5 according to the proportions of the probabilities of the parts of speech indicated in the lexicon for that word.</Paragraph>
      <Paragraph position="6"> Capitalized words that do not appear in the lexicon at all are generally very likely to be proper nouns; therefore, they are assigned a proper noun probability of 0.9, with the remaining 0.1 probability distributed equally among all the other parts-ofspeech. These simple assignment rules are effective for English, but would need to be slightly modified for other languages with different capitalization rules (e.g., in German all nouns are capitalized).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML