File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/a00-2035_concl.xml
Size: 3,754 bytes
Last Modified: 2025-10-06 13:52:38
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2035"> <Title>Tagging Sentence Boundaries</Title> <Section position="8" start_page="269" end_page="269" type="concl"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> In this paper we presented an approach which treats the sentence boundary disambiguation problem as part of POS tagging. In its &quot;vanilla&quot; version the system performed above the results recently quoted in the literature for the SBD task. When we combined the &quot;vanilla&quot; model with the document-centered approach to proper name handling we measured about a 20% further improvement in the performance on sentence splitting and about a 40% improvement on capitalized word assignment.</Paragraph> <Paragraph position="1"> POS tagging approach to sentence splitting produces models which are highly portable across different corpora: POS categories are much more frequent than individual words and less affected by unseen words. This differentiates our approach from word-based sentence splitters. In contrast to (Palmer and Hearst, 1997), which also used POS categories as predictive features, we relied on a proper POS tagging technology, rather than a shortcut to POS tag estimation. This ensured higher accuracy of the POS tagging method which cut the error rate of the SATZ system by 69%. On the other hand because of its simplicity the SATZ approach is probably easier to implement and faster to train than a POS tagger.</Paragraph> <Paragraph position="2"> On single-case texts the syntactic approach did not show a considerable advantage to the word-based methods: all periods which followed abbreviations were assigned as &quot;sentence internal&quot; and the results achieved by our system on the single-case texts were in line with that of the other systems.</Paragraph> <Paragraph position="3"> The abbreviation guessing module which combines the surface guessing heuristics with the document centered approach makes our system very robust to new domains. The system demonstrated strong performance even without being equipped with a list of known abbreviations which, to our knowledge, none of previously described SBD systems could achieve.</Paragraph> <Paragraph position="4"> Another important advantage of our approach we see is that it requires potentially a smaller amount of training data and this training data does not need to be labeled in any way. In training a conventional sentence splitter one usually collects periods with the surrounding context and these samples have to be manually labeled. In our case a POS tagging model is trained on all available words, so syntactic dependencies between words which can appear in a local context of a period can be established from other parts of the text. Our system does not require annotated data for training and can be unsupervisedly trained from raw texts of approximately 300,000 words or more.</Paragraph> <Paragraph position="5"> There are ways for further improvement of the performance of our system by combining it with a word-based system which encodes specific behavior for individual words. This is similar to how the SATZ system was combined with the Alembic system. This addresses the limitation of our syntactic approach in treating cases when an abbreviation is followed by a proper name always as &quot;non sentence boundary&quot;. In fact we encoded one simple rule that an abbreviation which stands for an American state (e.g. Ala. or Kan.) always is sentence terminal if followed by a proper name. This reduced the error rate on the WSJ from 0.31% to 0.25%. Another avenue for further development is to extend the system to other languages.</Paragraph> </Section> class="xml-element"></Paper>