File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1239_intro.xml
Size: 1,871 bytes
Last Modified: 2025-10-06 14:06:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1239"> <Title>Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora Hervd D~jean GREYC - CNRS - UPRESA 6072 Universit~ de Caen - Basse Normandie</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The method presented in this paper is inspired by the distributional approach developed by American structuraIists between 1940 and 1950 (Harris, 1951).</Paragraph> <Paragraph position="1"> This approach is characterized by two facts: (a) the use of corpora and (b) the use of the notion of distribution instead of the sense of elements. The distribution of an element is the set of the environments in which the element occurs. Other works describe systems that induce structures from corpora, but they use tagged corpora (Brill, 1993), or grammatical informations (Brent, 1993), or work with artificial samples (Elman, 1990). Our originality lies in the fazt that we only use untagged and non artificial corpora without specific knowledge about the studied language. We try to discover the structures of a natural language from raw texts of this language (on 100,000 words). We show that this kind of discovery is possible if we have some expectations of the structure of Natural Languages and if we use some formal properties.</Paragraph> <Paragraph position="2"> The method relies on structural linguistic concepts: the morpheme, the chunk and the linearity of the language, i.e. the corpus is composed of a unidimensional sequence of elements. We first give here an overview of the concepts and general principles, from morphemes to syntactic structures discovery.</Paragraph> <Paragraph position="3"> Then we explain in detail how the segmentation is carried out.</Paragraph> </Section> class="xml-element"></Paper>