File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/a94-1013_concl.xml

Size: 3,822 bytes

Last Modified: 2025-10-06 13:57:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="A94-1013">
  <Title>Adaptive Sentence Boundary Disambiguation</Title>
  <Section position="6" start_page="81" end_page="82" type="concl">
    <SectionTitle>
4 Discussion and Future Work
</SectionTitle>
    <Paragraph position="0"> We have presented an automatic sentence boundary labeler which uses probabilistic part-of-speech information and a simple neural network to correctly disambiguate over 98.5% of sentence-boundary punctuation marks. A novel aspect of the approach is its use of prior part-of-speech probabilities, rather than word tokens, to represent the context surrounding the punctuation mark to be disambiguated. This leads to savings in parameter estimation and thus training time. The stochastic nature of the input, combined with the inherent robustness of the connectionist network, produces robust results. The algorithm is to be used in conjunction with a part-of-speech tagger, and so assumes the availability of a lexicon containing prior probabilities of parts-ofspeech. The network is rapidly trainable and thus should be easily adaptable to new text genres, and is very efficient when used in its labeling capacity. Although the systems of Wasson and Riley (1989) report slightly better error rates, our approach has the advantage of flexibility for application to new text genres, small training sets (and hence fast training times), (relatively) small storage requirements, and little manual effort. Futhermore, additional experimentation may lower the error rate.</Paragraph>
    <Paragraph position="1"> Although our results were obtained using an English lexicon and text, we designed the boundary labeler to be equally applicable to other languages, assuming the accessibility of lexical part-of-speech frequency data (which can be obtained by running a part-of-speech tagger over a large corpus of text, if it is not available in the tagger itself) and an abbreviation list. The input to the neural network is a language-independent set of descriptor arrays, so training and labeling would not require recoding for a new language. The heuristics described in Section 2 may need to be adjusted for other languages in order to maximize the efficacy of these descriptor arrays.</Paragraph>
    <Paragraph position="2"> Many variations remain to be tested. We plan to: (i) test the approach on French and perhaps German, (ii) perform systematic studies on the effects of asymmetric context sizes, different part-of-speech categorizations, different thresholds, and larger descriptor arrays, (iii) apply the approach to texts with unusual or very loosely constrained markup formats,  validation set of 258 items.</Paragraph>
    <Paragraph position="3"> and perhaps even to other markup recognition problems, and (iv) compare the use of the neural net with more conventional tools such as decision trees and Hidden Markov Models.</Paragraph>
    <Paragraph position="4"> Acknowledgements The authors would like to acknowledge valuable advice, assistance, and encouragement provided by Manuel FPShndrich, Haym Hirsh, Dan Jurafsky, Terry Regier, and Jeanette Figueroa. We would also like to thank Ken Church for making the PARTS data available, and Ido Dagan, Christiane Hoffmann, Mark Liberman, Jan Pedersen, Martin RSscheisen, Mark Wasson, and Joe Zhou for assistance in finding references and determining the status of related work. Special thanks to Prof. Franz Guenthner for introducing us to the problem.</Paragraph>
    <Paragraph position="5"> The first author was sponsored by a GAANN fellowship; the second author was sponsored in part by the Advanced Research Projects Agency under Grant No. MDA972-92-J-1029 with the Corporation for National Research Initiatives (CNRI) and in part by the Xerox Palo Alto Research Center (PARC).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML