File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/96/c96-1097_concl.xml
Size: 3,325 bytes
Last Modified: 2025-10-06 13:57:32
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1097"> <Title>A Statistical Method for Extracting Uninterrupted and Interrupted Collocations from Very Large Corpora</Title> <Section position="6" start_page="577" end_page="577" type="concl"> <SectionTitle> 6. Conclusion </SectionTitle> <Paragraph position="0"> The methods of automatically identifying and extracting uninterrupted and interrupted collocations from very large corpora has been proposed.</Paragraph> <Paragraph position="1"> First, from the view point of collocational expression extraction, the problems of Nagao and Moffs algorithm for calculating arbitrary length of N-gram has been pointed out. And, under the condition that fractional substrings are restrained to be extract, a new method of automatically extracting and tabulating all of the uninterrupted collocational substrings has been proposed. Next, using these results, a method for automatically extracting interrupted collocational substrings has been proposed. In this method, combinations of uninterrupted collocational substrings which collocate at different positions within a sentence are extracted and counted.</Paragraph> <Paragraph position="2"> The method was applied to newspaper articles involving some 8.92 million characters. The results for uninterrupted collocations were compared with that of N-gram statistics. In the case of substring extraction with 2 or more characters, conventional method yielded substring of 4.4 millions types and the total frequency of them amount to 31.2 millions. In contrast, the method proposed in this paper extracted 0.97 millions types of substrings and a total frequency of them has reduced to 2.6 millions.</Paragraph> <Paragraph position="3"> In the case of interrupted collocational substring extraction, combining the substring with frequency of 10 times or more extracted by the first method, 6.5 thousand types of pairs of substrinks with the total frequency of 21.8 thousands were extracted.</Paragraph> <Paragraph position="4"> From these results, it can be said that, viewed from the point of extraction of collocational expressions (as units of syntactic and semantic expressions), substrings obtained by conventional methods include a voluminous amount of fractional substrings. In contrast, the method proposed in this paper reduces many of such fractional substrings and condensed into a group of substrings that can be regarded as units of expression. As a result, it has been made possible to easily calculate interrupted collocations and together with phrase templates and other basic data regarding sentence structure.</Paragraph> <Paragraph position="5"> This paper used Japanese character chains to examine the algorithm. Yet this algorithm can be applied to arbitrary symbol chains. Various types of applications are possible, such as word chains, syntactic element chains obtained from results of morphological analysis or semantic attribute chains which consist of each word being converted to semantic attributes. As shown in this paper, applications for Japanese character chains still involve output of some amount of fractional stings. But when applications to word chains or syntactic element strings are concerued, further restriction of unnecessary elements are anticipated.</Paragraph> </Section> class="xml-element"></Paper>