File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-2208_intro.xml
Size: 2,527 bytes
Last Modified: 2025-10-06 14:06:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2208"> <Title>The Automatic Extraction of Open Compounds from Text Corpora</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> This paper discusses a method automatic extraction of candidates for open compound registration. An open compound refers to an uninterrupted sequence of words, generally functioning as a single constituent (Smadja and McKcown , 1990). We propose a new method of extraction for languages which haw~ no specific use of punctuation to signify word boundaries. Our method is applied to n-gram text data using statistical observation of the change of frequency of occurrence when the window size of string observation is extended (character) cluster-wise. We generate both rightward and the leftward sorted n-gram data, then determine the left and right boundaries of a string using the methods of competitive ,selection and unified selection. In this paper, we examine the result of applying our medlod to Thai tex~ corpora and also introduce conventional Thai spelling rules to avoid e, xtracting invalid strings.</Paragraph> <Paragraph position="1"> Previous work (Nagao et al., 1994:) has shown all effective way of constructing a sorted file for tile efficient calculation of n-gram data. However a surprisely large numbe, r of invMid strings were also extracted. Subsequent work, (Ikehara et al., 1995) has extended the sorted file to avoid repeating the counting of substrings contained in strings already counted. This meant the extraction of only the longest strings in the order of frequency of occurrence. The result of extraction was improved as a result, but the deterinination of longest strings is always made consecutively from left to right. If an erroneous string is extracte, d, its error directly propagates through the, rest of input. It is possible that a string with an invalid starting pattern will be extracted because a string too long in character length has been extracted previously.</Paragraph> <Paragraph position="2"> In the following sections, we firstly describe the necessity for making this statistical ol)servad(m for extracting open comtlounds from Thai text corpora. Then, the methodology of data preparw tion and open compound extraction is explained, Finally, we discuss the result of an experiment on both large and small test corpora to investigate the effectiveness of our method.</Paragraph> </Section> class="xml-element"></Paper>