File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0702_intro.xml
Size: 1,971 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0702"> <Title>Experiments in Unsupervised Entropy-Based Corpus Segmentation</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The paper presents an approach to segment a corpus into words, based on entropy. We assume that the corpus is not annotated with additional information, and that we have no information whatsoever about the corpus or the language, and no linguistic resources such as a lexicon or grammar. Such a situation may occur e.g. if there is a (sufficiently large) corpus of an unknown or unidentified language and alphabet. 1 Based on entropy, we search for separators, without knowing a priory by which symbols or sequences of symbols they are constituted.</Paragraph> <Paragraph position="1"> Over the last decades, entropy has frequently been used to segment corpora \[Wolff, 1977, Alder, 1988, Hutchens and Alder, 1998, among many others\].</Paragraph> <Paragraph position="2"> and it is commonly used with compression techniques. Harris \[1955\] proposed an approach for segmenting words into morphemes that, although it did not use entropy, was based on an intuitively similar concept: Every symbol of a word is annotated with the count of all possible successor symbols given the .substring that ends with the current symbol, and with the count of all possible predecessor symbols I Such a corpus can be electronically encoded with arbitrarily defined symbol codes.</Paragraph> <Paragraph position="3"> given the tail of the word that starts with the current symbol. Maxima in these counts are used to segment the word into morphemes.</Paragraph> <Paragraph position="4"> All steps of the present approach will be described on the example of a German corpus. In addition, we will give results obtained on modified versions of this corpus, and on an English corpus.</Paragraph> </Section> class="xml-element"></Paper>