File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1061_intro.xml

Size: 8,081 bytes

Last Modified: 2025-10-06 14:06:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1061">
  <Title>Retrieving Collocations by Co-occurrences and Word Order Constraints</Title>
  <Section position="3" start_page="0" end_page="477" type="intro">
    <SectionTitle>
2 Algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="476" type="sub_section">
      <SectionTitle>
2.1 Extracting units of collocation
</SectionTitle>
      <Paragraph position="0"> (Nagao and Mori, 1994) developed a method to calculate the frequencies of strings composed of n characters(a grams). Since this method generates all n-character strings appeared in a text, the output contains a lot of fragments and useless expressions.</Paragraph>
      <Paragraph position="1"> For example, even if &amp;quot;local&amp;quot;, &amp;quot;area&amp;quot;, and &amp;quot;network&amp;quot; always appear as the substrings of '% local area network&amp;quot; in a corpus, this method generates redundant strings such as &amp;quot;a local&amp;quot;, &amp;quot;a local area&amp;quot; and &amp;quot;area network&amp;quot;.</Paragraph>
      <Paragraph position="2"> To filter out the fragments, we measure the distribution of adjacent words preceding and following 1A word is recognized as a minimum unit in such a language as English where writespace is used to delimit words, while a character is recognized as that in such languages as Japanese and Chinese which have no word delimiters. Although the method described in this paper is applicable to either kinds of languages, we have taken English as an example.</Paragraph>
      <Paragraph position="3">  the strings using entropy threshold. This is based on the idea that adjacent words will be widely distributed if the string is meaningful, and they will be localized if the string is a substring of a meaningful string. Taking the example mentioned above, the words which follow % local area&amp;quot; are practically identified as &amp;quot;network&amp;quot; because % local area&amp;quot; is a substring of % local area network&amp;quot; in the corpus. On the contrary, the words which follow % local area network&amp;quot; are hardly identified because &amp;quot;a local area network&amp;quot; is a unit of expression and innumerable words are possible to follow the string. It means that the distribution of adjacent words is effective to judge whether the string is an appropriate unit or not.</Paragraph>
      <Paragraph position="4"> We introduce entropy value, which is a measure of disorder. Let the string be str, the adjacent words wl...wn, and the frequency of str freq(str). The probability of each possible adjacent word p(wi) is then: y~eq(wi) p(wi)- freq(str) (1) At that time, the entropy of str H(str) is defined as:</Paragraph>
      <Paragraph position="6"> H(str) takes the highest value if n = freq(str) and 1 for all and it takes the lowest value 0 p(wi) = -~ wi, if n = 1 and p(wi) = 1. Calculating the entropy of both sides of the string, we adopt the lower one as the entropy of the string. Str is accepted only if the following inequation is satisfied:</Paragraph>
      <Paragraph position="8"> Fragmental strings such as &amp;quot;a local&amp;quot; and &amp;quot;area network&amp;quot; are filtered out with these procedures because their entropy values are expected to be small.</Paragraph>
      <Paragraph position="9"> Most of the strings extracted in this stage are meaningful units such as compound words, prepositional phrases, and idiomatic expressions. These strings are uninterrupted collocations of themselves while they are used in the next stage to construct collocations. This method is useful for the languages without word delimiters, and for the other languages as well.</Paragraph>
    </Section>
    <Section position="2" start_page="476" end_page="477" type="sub_section">
      <SectionTitle>
2.2 Extracting collocations
</SectionTitle>
      <Paragraph position="0"> By the use of each string derived in the previous stage, this stage extracts strings which frequently co-occur with the string and constructs them as a collocation. It is based on the idea that there is a string which is used to induce a collocation. We call this string % key string&amp;quot;, hereafter. The followings are the procedures to retrieve a collocation:  1. Take a key string strk from the strings stri(i = 1...n), and retrieve sentences containing strk from the corpus.</Paragraph>
      <Paragraph position="1"> 2. Examine how often each possible combinations of str~ and stri co-occurs, and extract stri if the frequency exceeds a given threshold Tire q.</Paragraph>
      <Paragraph position="2"> 3. Examine every two strings stri and strj and refine them by the following steps alternately: * Combine stri and strj when they overlap or adjoin each other and the following inequation is satisfied: freq(stri, strj ) freq(stri) &gt; Tratio (4) * Filter out stri if strj subsumes stri and the following inequation is satisfied: freq(strj) freq(srti) &gt;Tratio (5) 4. Construct a collocation by arranging the strings  stri in accordance with the word order in the corpus.</Paragraph>
      <Paragraph position="3"> The second step and the third step narrow down the strings to the units of collocation. Through these steps, only the strings which significantly co-occur with the key string strk are extracted.</Paragraph>
      <Paragraph position="4"> The second step eliminates the strings that are not frequent enough. Consider the example of Figure 1. This is a list of sentences containing the key string &amp;quot;Refer to&amp;quot; retrieved and each underlined string corresponds to a string stri. Assuming the frequency threshold Tlr~q as 2, the strings which co-occur with str~ more than twice are extracted in the second step. Table 1 shows the result of this step. Although it is very simple technique, almost all the useless strings are excluded through this step. stri f req( strk , stri )  The third step reorganizes the strings to be optimum units in the specific context. This is based on the idea that a longer string is more significant as a unit of collocations if it is frequent enough. Assuming that the threshold Tra~io is 0.75, first, a string &amp;quot;manual for specific instructions&amp;quot; is produced as the inequation (4) is satisfied. Next, &amp;quot;manual&amp;quot; and &amp;quot;for specific instructions&amp;quot; are deleted as the inequation (5) is satisfied. This process is repeated until no string satisfies the inequations. Table 2 shows a result of this step.</Paragraph>
      <Paragraph position="5"> The fourth step constructs a collocation by arranging the strings in accordance with the word order in the sentences retrieved in the first step. Taking stri in order of frequency, this step determines  Refer to the appropriate manual for instructions o_nn... Refer t.o. the manual for specific instructions. Refer to the installation manual for specific instructions fo__PSr ... Refer to the manual for specific in'~~-~ffn ~  where stri is placed in a collocation. In this example, the position of &amp;quot;the&amp;quot; is examined first. According to the sentences shown in Figure 1, &amp;quot;the&amp;quot; is always placed next to &amp;quot;Refer to&amp;quot;. Then its position is determined to follow &amp;quot;Refer to&amp;quot;. Next, the position of &amp;quot;manual for specific instructions&amp;quot; is examined and it is determined to follow a gap placed after &amp;quot;Refer to the&amp;quot;. Finally, the following collocation is produced: &amp;quot;Refer to the ... manual for specific instructions on ...&amp;quot; The broken lines in the collocation indicates the gaps where any substitutable words or phrases can be filled in. In the example, &amp;quot;appropriate&amp;quot; or &amp;quot;installation&amp;quot; is filled in the first gap. Thus, we retrieve an arbitrary length of interrupted or uninterrupted collocation induced by the key string. This procedure is performed for each string obtained in the previous stage. By changing the threshold, various levels of collocations are retrieved. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML