File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/96/c96-1097_abstr.xml
Size: 10,299 bytes
Last Modified: 2025-10-06 13:48:34
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1097"> <Title>A Statistical Method for Extracting Uninterrupted and Interrupted Collocations from Very Large Corpora</Title> <Section position="1" start_page="0" end_page="575" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In order to extract rigid expressions with a high frequency of use, new algorithm that can efficiently extract both uninterrupted and interrupted collocations from very large corpora has been proposed.</Paragraph> <Paragraph position="1"> The statistical method recently proposed for calculating N-gram of m'bitrary N can be applied to the extraction of uninterrupted collocations. But this method posed problems that so large volumes of fractional and unnecessary expressions are extracted that it was impossible to extract interrupted collocations combining the results. To solve this problem, this paper proposed a new algorithm that restrains extraction of unnecessary substrings. This is followed by the proposal of a method that enable to extract interrupted collocations.</Paragraph> <Paragraph position="2"> The new methods are applied to Japanese newspaper articles involving 8.92 million characters. In the case of uninterrupted collocations with string length of 2 or mere characters and frequency of appearance 2 or more times, there were 4.4 millions types of expressions (total frequency of 31.2 millions times) extracted by the N-gram method. In contrast, the new method has reduced this to 0.97 million types (total frequency of 2.6 million times) revealing a substantial reduction in fractional and unnecessary expressions. In the case of interrupted collocational substring extraction, combining the substring with frequency of 10 times or more extracted by the first method, 6.5 thousand types of pairs of substrings with the total frequency of 21.8 thousands were extracted.</Paragraph> <Paragraph position="3"> I. Introduction In natural language processing, the importance of large volume corpus has been pointed out together with the need for technology of analyzing these linguistic data. For example, in machine translation, there are many expressions that are difficult to be translated literally. Phrase translations or pattern translations based on phrase or pattern dictionaries are considered very useful for the translations of these expressions.</Paragraph> <Paragraph position="4"> In order to realize these translation, it is required to identify phrases of high frequency and patterns of expressions from the corpora. There are many method proposed to extract rigid expressions from corpora such as a method of focusing on the binding strength of two words (Church and Hanks 1990); the distance between words (Smadja and Makeown 1990); and the number of combined words and frequency of appearance (Kita 1993, 1994). But it was not easy to identify and extract expressions of arbitrary lengths and high frequency of appearance from very large corpora.</Paragraph> <Paragraph position="5"> Thus, conventional methods had to introduce some kinds of restrictions such as the limitation of the kind of chains or the length of chains to be extracted (Smadja 1993, Shinnou and Isahara 1995).</Paragraph> <Paragraph position="6"> Recently, a new method which can calculate arbitrary number of n-gram statistics for very large corpora has been proposed (Nagao and Mori 1994). This method has made it possible to automatically and quickly extract and tabulate substrings of any length used in source texts. Unfortunately, in this method, so many fractional substrings that were grammatically and semantically inconsistent were being extracted that it was difficult to extract combi nations of expressions collocated at separate locations (i.e. interrupted collocation) which requires a search of the source text by combining the strings thus extracted. Thus, the analyses had to be limited into small texts (Colier 1994).</Paragraph> <Paragraph position="7"> To overcome this problems, this paper first, proposes a method that can automatically extract and tabulate uninterrupted collocational substrings and without omission from the corpora in the order of substring length and frequency under the condition that fractional substrings are excluded. Second, using the results of the first method, it also proposes a method that can automatically extract and tabulate interrupted coUocational substrings.</Paragraph> <Paragraph position="8"> 2. N-gram Method and the Problem Involved (1) Conditions for Collocational Substring extradtion In order to extract uninterrupted collocation without omission and to minimize extraction of fractional substrings, we will introduce the following three conditions. 1st Condition: Substrings can be extracted in the order of the number of matching character (string length).</Paragraph> <Paragraph position="9"> 2nd Condition: Substrings can be extracted in the order of frequency of use.</Paragraph> <Paragraph position="10"> 3rd Condition: Substrings should be extracted according to the principle of the longest match.</Paragraph> <Paragraph position="11"> Fig. 1 Substrings to be Extracted Here, 3rd condition means that when a string (for instance a in Fig.l) is extracted from a certain location within the source text, any substring ( B, T ) that is included within the string ( a ) is not subject to extraction. But should such substring ( 6 ) be located in a separate or overlap position, it is to be extracted.</Paragraph> <Paragraph position="12"> (2) Conventional Algorithm for N-gram Statistics Before discussing the algorithm which satisfies the previous conditions for uninterrupted collocational substring, let's consider the Nagao and Mori's algorithm propose for N-gram statistics.</Paragraph> <Paragraph position="13"> \[Statistical Method for N-gram\] Assume that the total number of characters in a source text (corpus) is N.</Paragraph> <Paragraph position="14"> Prepare PT-0 (Pointer Table-O) of N records of SP (Source Pointer), with the values of 0, 1, 2,... i,...,N-1. Here, the value i represents the String-word i which is the substring from position i to the last character (N-1 address) in the source text.</Paragraph> <Paragraph position="15"> The records of .PT-0 are sorted in the order of corresponding String-words to obtain SPT-O (Sorted Pointer Table-0).</Paragraph> <Paragraph position="16"> The characters of String-word i is compared with that of the next String-word i+1 from the beginning. The number of matched characters are registered in the field of a NMC (Number of Matching Character) in the record i.</Paragraph> <Paragraph position="17"> Comparing the values of NMCs of record i and that of the record i+1 of the SPT from i=1 to i=N-1, substrings are extracted and their frequency are determined* 1.</Paragraph> <Paragraph position="18"> (3) Problems of N-gram Statistics Nagao and Mori's method obviously fulfills requirements of Conditions 1 and 2, but not Condition 3. It is expected that the accurate frequency of any substring a is obtained subtracting the frequency by the frequency of the other substring ~ which is included in substring o~ *2. Unfortunately, this does not satisfy Condition 3. At the time when extracted substring list has been compiled, information regarding mutual inter-relationship between the extracted substrings within the original text has been lost rendering calculations impossible.</Paragraph> <Paragraph position="19"> 3. Extraction of Uninterrupted Collocation 3.1 Invaliditafion of Extracted Substfings (1) Co-relations between Extracted Substrings In order to satisfy the requirement of Condition 3, consider the extraction of n-gram substring after extracting m -gram substring. The problem arises when there is a certain overlap between them as shown in Fig.1.</Paragraph> <Paragraph position="20"> The Case of Absorbed Relation (Case 1) can be classified into three sub-cases as shown, but regardless of which situation, the m-gram substfing is absorbed in the sub-string of n-gram and therefore there is no need to extract such a m-gram substring. Thus, when extracting n-gram strings, there is a need to invalidate the related record of the SPT so that m-gram strings do not become involved in processes to follow.</Paragraph> <Paragraph position="21"> The Case of Partially Joint Relation (Case 2) can be further classified into two sub-cases. But in either situation, the m-gram string and n-gram string merely overlapped and therefore they are need to be extracted separately.</Paragraph> <Paragraph position="22"> (2) Necessity of Validity Check for String-words When one substring is extracted, in order not to extract the absorbed string from the same part of sotlrce text where the substring was already extracted (Case 1), related records of SPT need to be checked if the record is valid or not before extracting the next substring.</Paragraph> <Paragraph position="23"> For example, the substring of 6 characters in the String -word 3 shown in Fig. 3 was extracted, the substring of String-words 3,4,5,...,8 need to be set as invalid for the length equal or less than 6,5,4,.-.,1 characters from the beginning.</Paragraph> <Section position="1" start_page="574" end_page="575" type="sub_section"> <SectionTitle> 3.2 Extracting Algorithm </SectionTitle> <Paragraph position="0"> Here, we propose an algorithm which satisfy Condition The length of substrings to be extracted are decided from NMC and written in the NSC field of SPT- 0. According to the method shown in 3.1(2), check the validity of the suhstring pointed by the records of the PT-1 in the order of the record number and write the results in the VF field.</Paragraph> <Paragraph position="1"> Re-sort the PT- 1 in the order of the values of SP fields to obtain a SPT- 1.</Paragraph> <Paragraph position="2"> By referring to the SPT-1, the strings to be extracted are determined and their frequencies are calculated. An example of the algorithm is shown in Fig.4. In this example, the types of substrings extracted by the conventional algorithm amounted to 24 with the total frequency of 72. In contrast, in the method proposed in this paper, these numbers have reduced to 5 and 10 respectively.</Paragraph> </Section> </Section> class="xml-element"></Paper>