File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1061_metho.xml

Size: 7,333 bytes

Last Modified: 2025-10-06 14:14:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1061">
  <Title>Retrieving Collocations by Co-occurrences and Word Order Constraints</Title>
  <Section position="4" start_page="477" end_page="478" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We performed an experiment for evaluating the algorithm. The corpus used in the experiment is a computer manual written in English comprising 1,311,522 words (in 120,240 sentences).</Paragraph>
    <Paragraph position="1"> In the first stage of this method, 167,387 strings are produced. Among them, 650, 1950, 6774 strings are extracted over the entropy threshold 2, 1.5, 1 respectively. For 650 strings whose entropy is greater than 2, 162 strings (24.9%) are complete sentences, 297 strings (45.7%) are regarded as grammatically appropriate units, and 114 strings (17.5%) are regarded as meaningful units even though they are not grammatical. This told us that the precision of the first stage is 88.1%.</Paragraph>
    <Paragraph position="2"> Table 3 shows top 20 strings in order of entropy value. They are quite representative of the given domain. Most of them are technical jargons related to computers and typical expressions used in manual descriptions although they vary in their constructions. It is interesting to note that the strings which do not belong to the grammatical units also take high entropy value. Some of them contain punctuation, and some of them terminate in articles. Punctuation marks and function words in the strings are useful to recognize how the strings are used in a corpus. null Table 4 illustrates how the entropy is changed with the change of string length. The third column in the table shows the kinds of adjacent words which follow the strings. The table shows that the ungrammatical strings such as &amp;quot;For more information on&amp;quot; and &amp;quot;For more information, refer to&amp;quot; act more cohesively than the grammatical string &amp;quot;For more information&amp;quot; in the corpus. Actually, the former strings are more useful to construct collocations in the second stage.</Paragraph>
    <Paragraph position="3"> In the second stage, we extracted collocations from 411 key strings retrieved in the first stage (297 grammatical units and 114 meaningful units). Necessary thresholds are given by the following set of equations:</Paragraph>
    <Paragraph position="5"> As a result, 269 combinations of units are retrieved as collocations. Note that collocations are not generated from all the key strings because some of them are uninterrupted collocations in themselves like No.</Paragraph>
    <Paragraph position="6"> 2 in Table 3. Evaluation is done by human check and 180 collocations are regarded as meaningful. The precision is 43.8% when the number of meaningful collocation is divided by the number of the key strings and 66.9% when it is divided by the number of the collocations retrieved in the second stage 2.</Paragraph>
    <Paragraph position="7"> Table 5 shows the collocations extracted with the underlined key strings. The table indicates that arbitrary length of collocations, which are frequently used in computer manuals, are retrieved through the method. As the method focuses on the co-occurrence of strings, most of the collocations are specific to the given domain. Common collocations are tend to be ignored because they are not used repeatedly in a single text. It is not a serious problem,  however, becausecommon collocations are limited in number and we can efficiently obtain them from dictionaries or by human reflection.</Paragraph>
    <Paragraph position="8"> No. 7 and 8 in Table 5 are the examples of invalid collocations. They contain unnecessary strings such as &amp;quot;to a&amp;quot; and &amp;quot;, the&amp;quot; in them. The majority of invalid collocations are of this type. One possible solution is to eliminate unnecessary strings at the second stage. Most of the unnecessary strings consist of only punctuation marks and function words. Therefore, by filtering out these strings, invalid collocations produced by the method should be reduced.</Paragraph>
    <Paragraph position="9"> Figure 2 summarizes the result of the evaluation.</Paragraph>
    <Paragraph position="10"> In the experiment, 573 strings are retrieved as appropriate units of collocations and 180 combinations of units are retrieved as appropriate collocations. Precision is 88.1% in the first stage, and 66.9% in the second stage.</Paragraph>
    <Paragraph position="11">  Although evaluation of retrieval systems is usually performed with precision and recall, we cannot examine recall rate in the experiment. It is difficult to recognize how many collocations are in a corpus because the measure differs largely dependent on the domain or the application considered. As an alternative way to evaluate the algorithm, we are planning to apply the collocations retrieved to a machine translation system and evaluate how they contribute to the quality of translation.</Paragraph>
  </Section>
  <Section position="5" start_page="478" end_page="478" type="metho">
    <SectionTitle>
4 Related work
</SectionTitle>
    <Paragraph position="0"> Algorithms for retrieving collocations has been described (Smadja, 1993) (Haruno et al., 1996).</Paragraph>
    <Paragraph position="1"> (Smadja, 1993) proposed a method to retrieve collocations by combining bigrams whose co-occurrences are greater than a given threshold 3. In their approach, the bigrams are valid only when there are fewer than five words between them. This is based on the assumption that &amp;quot;most of the lexical relations involving a word w can be retrieved by examining the neighborhood of w wherever it occurs, within a span of five (-5 and +5 around w) words.&amp;quot; While the assumption is reasonable for some languages such as English, it cannot be applied to all the languages, especially to the languages without word delimiters.</Paragraph>
    <Paragraph position="2"> (Haruno et al., 1996) constructed collocations by combining a couple of strings 4 of high mutual information iteratively. But the mutual information is estimated inadequately lower when the cohesiveness between two strings is greatly different. Take &amp;quot;in spite (of)&amp;quot;, for example. Despite the fact that &amp;quot;spite&amp;quot; is frequently used with &amp;quot;in&amp;quot;, mutual information between &amp;quot;in&amp;quot; and &amp;quot;spite&amp;quot; is small because &amp;quot;in&amp;quot; is used in various ways. Thus, there is the possibility that the method misses significant collocations even though one of the strings have strong cohesiveness.</Paragraph>
    <Paragraph position="3"> In contrast to these methods, our method focuses on the distribution of adjacent words (or characters) when retrieving units of collocation and the co-occurrence frequencies and word order between a key string and other strings when retrieving collocations. Through the method, various kinds of collocations induced by key strings are retrieved regardless of the number of units or the distance between units in a collocation. Another distinction is that our method does not require any lexical knowledge or language dependent information such as part of speech. Owing to this, the method have good applicability to many languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML