File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2035_metho.xml

Size: 21,203 bytes

Last Modified: 2025-10-06 14:07:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2035">
  <Title>Tagging Sentence Boundaries</Title>
  <Section position="4" start_page="265" end_page="265" type="metho">
    <SectionTitle>
2 New Handling of Periods
</SectionTitle>
    <Paragraph position="0"> In the traditional Treebank schema, abbreviations are tokenized together with their trailing periods and, thus, stand-alone periods unambiguously signal end-of-sentence. For handling the SBD task we suggest tokenizing periods separately from their abbreviations and treating a period as an ambiguous token which can be marked as a fullstop ('. ' ), partof-abbreviation (' A') or both (' *'). An example of such markup is displayed on Figure 1. Such markup allows us to treat the period similarly to all other words in the text: a word can potentially take on one of a several POS tags and the job of a tagger is to resolve this ambiguity.</Paragraph>
    <Paragraph position="1"> In our experiments we used the Brown Corpus and the Wall Street Journal corpus both taken from the Penn Treebank (Marcus, Marcinkiewicz, and Santorini, 1993). We converted both these corpora from the original format to our XML format (as displayed on Figure 1), split the final periods from the abbreviations and assigned them with C= ' A ' and C= ' * ' tags according to whether or not the abbreviation was the last token in a sentence. There were also quite a few infelicities in the original tokenization and tagging of the Brown Corpus which we corrected by hand.</Paragraph>
    <Paragraph position="2"> Using such markup it is straightforward to train a POS tagger which also disambiguates sentence boundaries. There is, however, one difference in the implementation of such tagger. Normally, a POS tagger operates on a text-span which forms a sentence and this requires performing the SBD before tagging. However, we see no good reason why such a text-span should necessarily be a sentence, because almost all the taggers do not attempt to parse a sentence and operate only in the local window of two to three tokens.</Paragraph>
    <Paragraph position="3"> The only reason why the taggers traditionally operate on the sentence level is because there exists a technical issue of handling long text spans. Sentence length of 30-40 tokens seems to be a reasonable limit and, thus, having sentences pre-chunked before tagging simplifies life. This issue, however, can be also addressed by breaking the text into short text-spans at positions where the previous tagging history does not affect current decisions. For instance, a bigram tagger operates within a window of two tokens, and thus a sequence of word-tokens can be terminated at an unambiguous word because this unambiguous word token will be the only history used in tagging of the next token. A trigram tagger operates within a window of three tokens, and thus a sequence of word-tokens can be terminated when two unambiguous words follow each other.</Paragraph>
  </Section>
  <Section position="5" start_page="265" end_page="267" type="metho">
    <SectionTitle>
3 Tagging Experiment
</SectionTitle>
    <Paragraph position="0"> Using the modified treebank we trained a tri-gram POS tagger (Mikheev, 1997) based on a combination of Hidden Markov Models (HMM) and Maximum Entropy (ME) technologies. Words were clustered into ambiguity classes (Kupiec, 1992) according to sets of POS tags they can take on. This is a standard technique that was also adopted by the SATZ system 1. The tagger predictions were based on the ambiguity class of the current word together with  the POS trigrams: hypothesized current POS tag and partially disambiguated POS tags of two previous word-tokens. We also collected a list of abbreviations as explained later in this paper and used the information about whether a word is an abbreviation, ordinary word or potential abbreviation (i.e. a word which could not be robustly classified in the first two categories). This tagger employed Maximum Entropy models for tag transition and emission estimates and Viterbi algorithm (Viterbi, 1967) for the optimal path search.</Paragraph>
    <Paragraph position="1"> Using the forward-backward algorithm (Baum, 1972) we trained our tagger in the unsupervised mode i.e. without using the annotation available in the Brown Corpus and the WSJ. For evaluation purposes we trained our tagger on the Brown Corpus and applied it to the WSJ corpus and vice versa. We preferred this method to ten-fold cross-validation because this allowed us to produce only two tagging models instead of twenty and also this allowed us to test the tagger in harsher conditions when it is applied to texts which are very distant from the ones it was trained on.</Paragraph>
    <Paragraph position="2"> In this research we concentrated on measuring the performance only on two categories of word-tokens: on periods and other sentence-ending punctuation and on word-tokens in mandatory positions. Mandatory positions are positions which might require a word to be capitalized e.g. after a period, quotes, brackets, in all-capitalized titles, etc. At the evaluation we considered proper nouns (NNP), plural proper nouns (NNPS) and proper adjectives 2 (JJP) to signal a proper name, all all other categories were considered to signal a common word or punctuation. We also did not consider as an error the mismatch between &amp;quot;.&amp;quot; and &amp;quot;*&amp;quot; categories because both of them signal that a period denotes the end of sentence and the difference between them is only whether this period follows an abbreviation or a regular word.</Paragraph>
    <Paragraph position="3"> In all our experiments we treated embedded sentence boundaries in the same way as normal sentence boundaries. The embedded sentence boundary occurs when there is a sentence inside a sentence. This 2These are adjectives like &amp;quot;American&amp;quot; which are always written capitalized. We identified and marked them in the WSJ and Brown Corpus, can be a quoted direct speech sub-sentence inside a sentence, this can be a sub-sentence embedded in brackets, etc. We considered closing punctuation of such sentences equal to closing punctuation of ordinary sentences.</Paragraph>
    <Paragraph position="4"> There are two types of error the tagger can make when disambiguating sentence boundaries. The first one comes from errors made by the tagger in identifying proper names and abbreviations. The second one comes from the limitation of the POS tagging approach to the SBD task. This is when an abbreviation is followed by a proper name: POS information normally is not sufficient to disambiguate such cases and the tagger opted to resolve all such cases as &amp;quot;not sentence boundary&amp;quot;. There are about 5-7% of such cases in the Brown Corpus and the WSJ and the majority of them, indeed, do not signal a sentence boundary.</Paragraph>
    <Paragraph position="5"> We can estimate the upper bound for our approach by pretending that the tagger was able to identify all abbreviations and proper names with perfect accuracy. We can sinmlate this by using the information available in the treebank. It turned out that the tagger marked all the cases when an abbreviation is followed by a proper name, punctuation, non-capitalized word or a number as &amp;quot;not sentence boundary&amp;quot;. All other periods were marked as sentence-terminal. This produced 0.01% error rate on the Brown Corpus and 0.13% error rate on the WSJ as displayed in the first row of Table 1.</Paragraph>
    <Paragraph position="6"> In practice, however, we cannot expect the tagger to be 100% correct and the second row of Table 1 displays the actual results of applying our POS tagger to the Brown Corpus and tile WSJ. General tagging performance on both our corpora was a bit better than a 4% error rate which is in line with the standard performance of POS taggers reported on these two corpora. On the capitalized words in mandatory positions the tagger achieved a 3.1-4.7% error rate which is an improvement over the lexical lookup approach by 2-3 times. On the sentence breaking punctuation the tagger performed extremely well an error rate of 0.39% on the WSJ and 0.25% on the Brown Corpus. If we compare these results with the upper bound we see that the errors made by the tagger on the capitalized words and abbreviations  instigated about a 0.25% error rate on the sentence boundaries.</Paragraph>
    <Paragraph position="7"> We also applied our tagger to single-case texts.</Paragraph>
    <Paragraph position="8"> We converted the WSJ and the Brown Corpus to upper-case only. In contrast to the mixed case texts where capitalization together with the syntactic information provided very reliable evidence, syntactic information without capitalization is not sufficient to disambiguate sentence boundaries. For the majority of POS tags there is no clear preference as to whether they are used as sentence starting or sentence internal. To minimize the error rate on single case texts, our tagger adopted a strategy to mark all periods which follow al)breviations as &amp;quot;non-sentence boundaries&amp;quot;. This gave a 1.98% error rate on the WSJ and a 0.51% error rate on the Brown Corpus.</Paragraph>
    <Paragraph position="9"> These results are in line with the results reported for the SATZ system on single case texts.</Paragraph>
  </Section>
  <Section position="6" start_page="267" end_page="267" type="metho">
    <SectionTitle>
4 Enhanced Feature Set
</SectionTitle>
    <Paragraph position="0"> (Mikheev, 1999) described a new approach to the disambiguation of capitalized words in mandatory positions. Unlike POS tagging, this approach does not use local syntactic context, but rather it applies the so-called document-centered approach.</Paragraph>
    <Paragraph position="1"> The essence of the document-centered approach is to scan the entire document for the contexts where the words in question are used unambiguously. Such contexts give the grounds for resolving ambiguous contexts.</Paragraph>
    <Paragraph position="2"> For instance, for the disambiguation of capitalized words in mandatory positions the above reasoning can be crudely summarized as follows: if we detect that a word has been used capitalized in an unambiguous context (not in a mandatory position), this increases the chances for this word to act as a proper name in mandatory positions in the same document. And, conversely, if a word is seen only lowercased, this increases the chances to downcase it in mandatory positions of the same document. By collecting sequences and unigrams of unambiguously capitalized and lowercased words in the document and imposing special ordering of their applications (Mikheev, 1999) reports that the document-centered approach achieved a 0.4-0.7% error rate with coverage of about 90% on the disambiguation of capitalized words in mandatory positions.</Paragraph>
    <Paragraph position="3"> We decided to combine this approach with our POS tagging system in the hope of achieving better accuracy on capitalized words after the periods and therefore improving the accuracy of sentence splitting. Although the document-centered approach to capitalized words proved to be more accurate than POS tagging, the two approaches are complimentary to each other since they use different types of information. Thus, the hybrid system can bring at least two advantages. First, unassigned by the document-centered approach 10% of the ambiguously capitalized words can be assigned using a standard POS tagging method based on the local syntactic context. Second, the local context can correct some of the errors made by the document-centered approach.</Paragraph>
    <Paragraph position="4"> To implement this hybrid approach we incorporated the assignments made by the document-centered approach to the words in mandatory positions to our POS tagging model by simple linear interpolation.</Paragraph>
    <Paragraph position="5"> The third row of Table 1 displays the results of the application of the extended tagging model. We see an improvement on proper name recognition by about 1.5%: overall error rate of 1.87% on the Brown Corpus and overall error rate 3.22% on the WSJ.</Paragraph>
    <Paragraph position="6"> This in its turn allowed for better tagging of sentence boundaries : a 0.20% error rate on the Brown Corpus and a 0.31% error rate on the WSJ, which corresponds to about 20% cut in the error rate in comparision to the standard POS tagging.</Paragraph>
  </Section>
  <Section position="7" start_page="267" end_page="269" type="metho">
    <SectionTitle>
5 Handling of Abbreviations
</SectionTitle>
    <Paragraph position="0"> Information about whether a word is an abbreviation or not is absolutely crucial for sentence splitting. Unfortunately, abbreviations do not form a closed set, i.e., one cannot list all possible abbreviations.</Paragraph>
    <Paragraph position="1"> It gets even worse - abbreviations can coincide with ordinary words, i.e., &amp;quot;in&amp;quot; can denote an abbreviation for &amp;quot;inches&amp;quot;, &amp;quot;no&amp;quot; can denote an abbreviation for &amp;quot;number&amp;quot;, &amp;quot;bus&amp;quot; can denote an abbreviation for &amp;quot;business&amp;quot;, etc.</Paragraph>
    <Paragraph position="2"> Obviously, a practical sentence splitter which in our case is a POS tagger, requires a module that can guess unknown abbreviations. First, such a module can apply a well-known heuristic that single-word abbreviations are short and normally do not include vowels (Mr., Dr., kg.). Thus a word without vowels can be guessed to be an abbreviation unless it is written in all capital letters which can be an acronym (e.g. RPC). A span of single letters, separated by periods forms an abbreviation too (e.g.Y.M.C.A.).</Paragraph>
    <Paragraph position="3"> Other words shorter than four characters and unknown words shorter than five characters should be treated as potential abbreviations. Although these heuristics are accurate they manage to identify only about 60% of all abbreviations in the text which translates at 40% error rate as shown in the first row of Table 2.</Paragraph>
    <Paragraph position="4"> These surface-guessing heuristics can be supplemented with the document-centered approach (DCA) to abbreviation guessing, which we call Positional Guessing Strategy (PGS). Although a short word which is followed by a period can potentially be an abbreviation, the same word when occurring in the same document in a different context can be unambiguously classified as an ordinary word if it is used without a trailing period, or it can be unambiguously classified as an abbreviation if it is used with a</Paragraph>
    <Section position="1" start_page="268" end_page="269" type="sub_section">
      <SectionTitle>
Corpus
</SectionTitle>
      <Paragraph position="0"> surface guess surface guess and DCA surface guess and DCA and abbr. list trailing period and is followed by a lowercased word or a comma. This allows us to assign such words accordingly even in ambiguous contexts of the same document, i.e., when they are followed by a period. For instance, the word &amp;quot;Kong&amp;quot; followed by a period and then by a capitalized word cannot be safely classified as a regular word (non-abbreviation) and therefore it is a potential abbreviation. But if in the same document we detect a context &amp;quot;lived in Hong Kong in 1993&amp;quot; this indicates that &amp;quot;Kong&amp;quot; is normally written without a trailing period and hence is not an abbreviation. Having established that, we can apply this findings to the non-evident contexts and classify &amp;quot;Kong&amp;quot; as a regular word (nonabbreviation) throughout the document. However, if we detect a context such as &amp;quot;Kong., said&amp;quot; this indicates that in this document &amp;quot;'Kong&amp;quot; is normally written with a trailing period and hence is an abbreviation. This gives us grounds to classify &amp;quot;Kong&amp;quot; as an abbreviation in all its occurrences within the same document.</Paragraph>
      <Paragraph position="1"> The positional guessing strategy relies on the assumption that there is a consistency of writing within the same document. Different authors can write &amp;quot;Mr&amp;quot; or &amp;quot;Dr&amp;quot; with or without trailing period but we assume that the same author (the author of a document) will write consistently. However, there can occur a situation when a potential abbreviation is used as a regular word and as an abbreviation within the same document. This is usually the case when an abbreviation coincides with a regular word e.g. &amp;quot;Sun.&amp;quot; (meaning Sunday) and &amp;quot;Sun&amp;quot; (the name of a newspaper). To tackle this problem, our strategy is to collect not only unigrams of potential abbreviations in unambiguous contexts as explained earlier but also their bigrams with the preceding word. Now the positional guessing strategy can assign ambiguous instances on the basis of the bigrams it collected from the document.</Paragraph>
      <Paragraph position="2"> For instance, if in a document the system found a context &amp;quot;vitamin C is&amp;quot; it stores the bigram &amp;quot;vitamin C&amp;quot; and the unigrarn &amp;quot;C&amp;quot; with the information that it is a regular word. If in the same document the system also detects a context &amp;quot;John C. later said&amp;quot; it stores the bigram &amp;quot;John C.&amp;quot; and the unigram &amp;quot;C&amp;quot; with the information that it is an abbreviation. Here we have conflicting information for the word &amp;quot;C&amp;quot; it was detected as acting as a regular word and as an abbreviation within the same document - so there is not enough information to resolve ambiguous cases purely using the unigram. However, some cases can be resolved on the basis of the bigrams e.g. the system will assign &amp;quot;C&amp;quot; as an abbreviation in an ambiguous context &amp;quot;... John C. Research ...&amp;quot; and it will assign &amp;quot;C&amp;quot; as a regular word (non-abbreviation) in an ambiguous context &amp;quot;... vitamin C. Research ...&amp;quot; When neither unigrams nor bigrams can help to resolve an ambiguous context for a potential abbreviation, the system decides in favor of the more frequent category deduced from the current document for this potential abbreviation. Thus if the word &amp;quot;In&amp;quot; was detected as acting as a non-abbreviation (preposition) five times in the current document and two times as abbreviation (for the state Indiana), in a context where neither of the bigrams collected from the document can be applied, &amp;quot;In&amp;quot; is assigned as a regular word (non-abbreviation). The last resort strategy is to assign all non-resolved cases as non-abbreviations.</Paragraph>
      <Paragraph position="3"> Apart from the ability of finding abbreviations beyond the scope of the surface guessing heuristics, the document-centered approach also allows for the classification of some potential abbreviations as ordinary words, thus reducing the ambiguity for the sentence splitting module. The second row of Table 2 shows the results when we supplemented the surface guessing heuristics with the document-centered approach.</Paragraph>
      <Paragraph position="4"> This alone gave a huge improvement over the surface guessing heuristics.</Paragraph>
      <Paragraph position="5"> Using our abbreviation guessing module and an unlabeled corpus from New York Times 1996 of 300,000 words, we compiled a list of 270 abbreviations which we then used in our tagging experiments together with the guessing module. In this list we included abbreviations which were identified by our guesser and which had a frequency of five or greater.</Paragraph>
      <Paragraph position="6"> When we combined the guessing module together with the induced abbreviation list and applied it to the Brown Corpus and the WSJ we measured about 1% error rate on the identification of abbreviation as can be seen in the third row of Table 2.</Paragraph>
      <Paragraph position="7"> We also tested our POS tagger and the extended tagging model in conjunction with the abbreviation guesser only, when the system was not equipped with the list of abbreviations. The error rate on capitalized words went just a bit higher while the error  rate on the sentence boundaries increased by twothree times but still stayed reasonable. In terms of absolute numbers, the tagger achieved a 0.98% error rate on the Brown Corpus and a 1.95% error rate on the WSJ when disarnbiguating sentence boundaries. The extended system without the abbreviation list was about 30% more accurate and achieved a 0.65% error rate on sentence splitting on the Brown Corpus and 1.39% on the WSJ corpus as shown in the last row of Table 1. The larger impact on the WSJ corpus can be explained by the fact that it has a higher proportion of abbreviations than the Brown Corpus. In the Brown Corpus, 8% of potential sentence boundaries come after abbreviations. Tile WSJ is richer in abbreviations and 17% of potential sentence boundaries come after abbreviations. Thus, unidentified abbreviations had a higher impact on the error rate in the WSJ.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML