File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1097_metho.xml

Size: 8,665 bytes

Last Modified: 2025-10-06 14:14:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1097">
  <Title>A Statistical Method for Extracting Uninterrupted and Interrupted Collocations from Very Large Corpora</Title>
  <Section position="2" start_page="575" end_page="575" type="metho">
    <SectionTitle>
4. Extraction of Interrupted Collocation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="575" end_page="575" type="sub_section">
      <SectionTitle>
4.1 Conditions for Extraction
</SectionTitle>
      <Paragraph position="0"> Here, let's consider combinations of 2 or more uninterrupted collocational substrings in different locations within a single sentence together with a method of determining the frequency of them. In this case, boundary conditions of sentences and mutual relationship between the extracted substrings need to be considered.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="575" end_page="575" type="metho">
    <SectionTitle>
(1) Boundary Conditions of Sentences
</SectionTitle>
    <Paragraph position="0"> When considering the collocation of substrings within a sentence, combinations of expressions spread over borders of sentences need to be excluded. But when a single sentence includes other sentences, the extraction of the combinations in units of sentences poses complications.</Paragraph>
    <Paragraph position="1"> To simplify matters, we first assume that the sub-strings which have any kinds of punctuation mark as a part of them are not extracted in the procedure of uninterrupted collocation extraction. This can be easily performed by restraining the comparison procedure after finding a punctuation mark in Procedure 3. Second, we assume that when a left quote character is found within a sentence, all characters are ignored until the right quote character forming a pair with the former character.</Paragraph>
    <Paragraph position="2"> (2) Relationships between Extracted Substrings In extraction of interrupted collocations, substrings that are linked to or partially overlap one another are excluded from the scope of extraction. Let's consider substrings a and ~0 which have been extracted from the same sentence.</Paragraph>
    <Paragraph position="3"> The positioning would be one of the three cases shown in Fig.3. Case (c) in which substring a and ~0 are separate from one another is a case of extracting interrupted collocations, and Cases (a) and Co) are not*3.</Paragraph>
    <Paragraph position="4">  (3) Order of Substring Appearance  In the case of extracting interrupted collocations, the order of appearance of substrings should be considered. Hence, collocational substrings are extracted and counted taking notice of the order of the appearance of each substring.</Paragraph>
    <Paragraph position="5">  extracted in Chapter 3 in the order of extractions. These Number are registered in the NES (Number of Extracted Substrings) field of the respective record in SPT- 1.</Paragraph>
    <Paragraph position="6">  The SPT- 1 is sorted in the original order of the values of ST' fields.</Paragraph>
    <Paragraph position="7"> Procedure 10: Numbering of the sentences SN(Sentence Number) field is added for entering the sentence number of original sentence to which one's record belongs.</Paragraph>
    <Paragraph position="8"> Procedure !1: Table condensation The table obtained is condensed by procedures shown in the following to obtain a SPT-2&amp;quot;.</Paragraph>
    <Paragraph position="9">  (1) All fields other than the four, Sentence Numbers, ESN, NSC and RN are deleted.</Paragraph>
    <Paragraph position="10"> (2) All records with no values in the ArES field are deleted.  Here, k is the number of substrings which compose interrupted colocational expressions. Then, all of the combinations of k NESs for every sentence are written down into a file and sorted. And the number of the same combination of NES are counted.</Paragraph>
    <Paragraph position="11"> Thus, the substring list of interrupted collocations can be obtained. If the sentence number is given to every combination list of NES, the sentences corresi~onding to the extracted interrupted collocation can easily be identified.</Paragraph>
    <Paragraph position="12"> The lower part of Fig.4 shows the application of this method for k=2. In this case, there are possibility of 25 combinations for 5 types of uninterrupted collocational substrings obtained by chapter 3. Out of these combinations, 7 combinations were extracted as the combinations which collocate twice or more within the same sentence. And the total frequency of these amount to 14 times.</Paragraph>
  </Section>
  <Section position="4" start_page="575" end_page="577" type="metho">
    <SectionTitle>
5. Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="575" end_page="577" type="sub_section">
      <SectionTitle>
5.1 Uninterrupted Collocational Substrings
</SectionTitle>
      <Paragraph position="0"> Applying the proposed method to the newspaper articles of Nikkei Industrial News for three months (8.92 million characters), uninterrupted and interrupted coUocational substrings were extracted. In this experiments, XEROX *3 In the case of (a), there would be a combination of substrings which is regarded as a interrupted collocation. However the frequency of such a pair is limitted to 1. Then there is no need to consider.</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> ARGOSS 5270 (OS4.1.3) was used. The memory capacity were 48 MB.</Paragraph>
      <Paragraph position="5">  (1) Characteristics of Extracted Substring From the view point of the length and frequency, the number of extracted substrings are compared with those of the N-gram method and summarized in Table 1 and  7. 63% 1.76% 3. 32% 0. 78% 1.35% 0. 36% 0. 45% 0. 17% 0. 25Z 0. 12% 0. 14% 0. 07%  From these results, the following observations can be obtained.</Paragraph>
      <Paragraph position="6"> @ Compared with the N-gram method, most of fractional substring has been deleted, and the types m~d the number of the extracted substrings have highly reduced. For example, in the extraction of substrings with the Table 3 Examples of Extracted Substrings (in the order  (it seems to do ~ ), (82 Japan shop), b-Cb~za (c)~b~ (16), 7 -2-'2 b &gt;--)L H~N (14) (wonder if ~ do ~ ), (Washington 19 ), \[ 21,155 types Total 47,336 times\] length of 2 or more and the frequency of 2 times or more, the substring type reduced to 22.2 % and total frequency of them reduced to 8.38 %. This effect increases as the increase of substring length. In the case of substrings of 20 or more characters, these number reduced to 1%. @ Most of substrings extracted by the proposed method forms expressions as syntactic or semantic units and there are few fractional substrings.</Paragraph>
      <Paragraph position="7"> (2) Processing Time It took about 40 hours to make SPT-O*4. But successive processes were performed very quickly (within one hour).</Paragraph>
    </Section>
    <Section position="2" start_page="577" end_page="577" type="sub_section">
      <SectionTitle>
5.2 Interrupted Collocational Substrings
(1) Characteristics of Extracted Substrings
</SectionTitle>
      <Paragraph position="0"> Interrupted collocational substrings were extracted for every two substrings which had appeared 10 or more times in the source text*5. The results are shown in Table 5.</Paragraph>
      <Paragraph position="1"> And, examples of substrings with high frequency and with much characters in total are shown in Table 6.</Paragraph>
      <Paragraph position="2"> Table 5 Number of Extracted Pairs of Substrings</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="577" end_page="577" type="metho">
    <SectionTitle>
~----___~_ Results
</SectionTitle>
    <Paragraph position="0"> From these results, it can also be seen that expressions typical to newspapers have been extracted. Thus, using the output results, we can easily obtain interrupted collocational expressions as well as uninterrupted ones.</Paragraph>
    <Paragraph position="1">  &amp;b~9(586), &amp;~\]t(512), &amp;b~1,~5(436), ~#_(325) ~35~(324), ~{(315) (to say that), (said that), (set as), (again), (is that), (photogralihy), bT'j~b, (302), &amp;~o 7~&lt; (283), N~ (281), (~i~ (278), ~J'NJL, N# (277), bT)~b (274), N-f &gt; b (269), (but), (said that), (Tokyo), (Price), (EC), (however), (Point),  In the case of interrupted collocational substring extraction, processing time depend highly on the number of components of substrings. In this experiment, the turnaround time was 1 or 2 hours where components of collocations to be extracted was limited to the substrings with the frequency of 10 or more times.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML