File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/w93-0309_evalu.xml

Size: 13,000 bytes

Last Modified: 2025-10-06 14:00:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0309">
  <Title>Extraction of V-N-Collocations from Text Corpora: A Feasibility Study for German</Title>
  <Section position="5" start_page="77" end_page="81" type="evalu">
    <SectionTitle>
4 Evaluation of the Results
</SectionTitle>
    <Paragraph position="0"> Below, the top bigrams with kommen (come) are shown, and some of the nonsignificant ones (t &lt; 1.65), to illustrate MI and t-scores. Bigrams with the infinitive form give best results compared to other inflection forms, possibly because this form covers lst/3rd pers. pl. present tense, the infinitive and the nonfinite main verb of complex tenses (modals, conditional, future) at the same time. Also, the latter two always occur in verb-final position.</Paragraph>
    <Paragraph position="1"> N + kommen Translation  (zur) Geltung k.</Paragraph>
    <Paragraph position="2"> (in) Betracht k.</Paragraph>
    <Paragraph position="3"> (in) Beriihrung k.</Paragraph>
    <Paragraph position="4"> (zur) Anwen&lt;hmg k.</Paragraph>
    <Paragraph position="5"> (zu) Trgnen k.</Paragraph>
    <Paragraph position="6"> (zur) Ruhe k.</Paragraph>
    <Paragraph position="7"> (auf den) Gedanken k.</Paragraph>
    <Paragraph position="8"> (in den) Himrnel k.</Paragraph>
    <Paragraph position="9"> (zu) Hilfe k.</Paragraph>
    <Paragraph position="10"> (zu) Wort k. Vernunft (in) Frage k.</Paragraph>
    <Paragraph position="11"> ~z.ur) Welt k.</Paragraph>
    <Paragraph position="12"> :fie  show to advantage to be considered come into contact to be used come to tears get some peace get the idea go to heaven come to aid get a chance to speak reason to be possible to be born</Paragraph>
    <Section position="1" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
4.1 Precision and Recall
</SectionTitle>
      <Paragraph position="0"> The question how much is extractable fully automatically can be answered by an evaluation of precision and &amp;quot;recall' of the described method as it is done for memory tests.</Paragraph>
      <Paragraph position="1"> Following Smadja (1991a) we define precision as the number of correctly found collocations divided by the number of V-N combinations found at all. Recall reflects the ratio of the number of correctly found collocations and the maximal number of collocations that could possibly have been found. The latter is slightly difficult to determine, because in principle this means to know the total number of collocations occurring in the whole corpus. Another possibility, to take all collocations that are mentioned in a dictionary as the maximal number of valid collocations, had to be discarded: a comparison with Agricola (1970) or Drosdowski (1970) is not really possible because the  collocations found in the corpus are not a subset of those mentioned in the dictionaries. Only 22 of the .-1:3 collocations found with the lemma bring- in the MKI (BI6) belong to the 135 combinations mentioned in the lexical entry for bringen in Agricola (1970).</Paragraph>
      <Paragraph position="2"> Of the remaining 21 in the MKI, 9 can be found in the corresponding noun entries, and 12 do not appear at all though they are 'significant' collocations, e.g. Klarheit bringen (clarify). zur Entfaltung hr. (develop), zur Wirkung hr. (bring the effect), in Schwierigkeiten br. (create difh'culties), ins Gespr~:ch br. (bring into discussion). Thus, we decided to use instead the number of collocations with the infinitive as determined by the standard method (BI6) as the basis for recall comparisons, i.e. 100% recall is set to this number.</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
4.2 Results of the Standard Method
</SectionTitle>
      <Paragraph position="0"> Frequencies for the infinitives of the 16 verbs range from 832 (kommen) to 117 (gelangen). The number of V-N combinations varies from 46 (bringen) to 6 (erfahren, gelangen, geraten, treten), precision fiom 100% (geraten, ziehen) to 33% (eHa hren). Average figures are presented in table 1 below, labeled BI6 Inf. If non-significant combinations are omitted with a t-test (BI6/t Inf), the average of collocations among the extracted V-N combinations is only 95.8% of those found without a significance boundar.v, but precision rises slightly. With a threshold of MI &gt; 6, precision would go up to 82.1% with a still acceptable loss in recall of approximately 10%.</Paragraph>
    </Section>
    <Section position="3" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
4.3 Experiment 1: Variation of Window-Size
</SectionTitle>
      <Paragraph position="0"> To see whether the collocational nouns could be better located directly to the left of the verb rather than within a couple of words, we reduced window-size to 3 words including the verb (this allows one word in between, e.g. 'zu' (to) in infinitival constructions).</Paragraph>
      <Paragraph position="1"> As shown in table 1 for BI3 In:f, precision rises about 10%, but with a recall of 72.1%, because those collocations where other arguments or post modifiers occur between N and V are no longer captured. Taking again only significant combinations (BI3/'t In:f) precision rises again slightly. This leads to the conclusion that for German, unless syntactic relations can be determined, a smaller window is preferable to improve a correct detection of preceding object arguments and to exclude unrelated nouns.</Paragraph>
    </Section>
    <Section position="4" start_page="78" end_page="79" type="sub_section">
      <SectionTitle>
4.4 Experiment 2: Simulating Lemmatizing
</SectionTitle>
      <Paragraph position="0"> Because no lemmatizing program was available we used an additional program on top of the bigram calculations for the inflected forms. In order to keep the amount of V-N combinations within a magnitude that could still be checked manually for correctness,  we restricted search to a 3-word window to the left. V-N combinations that occurred less than two times with a single inflection forth of the verb were sorted out. The inflection forms for the infinitive (also lst/3rd pers. pl.), 3rd pers. sg. present and past tense, lst/3rd pets. pl. past and past participle were added up; 1st pers. sg. and 2nd pers. sg./pl, were so rare thai they could be ignored. The average results are again presented in table 1 (BI3 3.emma); the number of extracted collocations is maximal, but precision is the lowest of all. Precision ranges from 33.3% (gehen) to 88.2% (setzen), recall from 50% (erfahren) to 166.7% (setzen). Recall figures are above 100% because the absolute number of collocations found is higher than for BI6 In:f, the basis for the recall calculations. Regarding lemmatization our study shows that one gets more collocations, but at the expense of more uninteresting combinations as well. One explanation for this is that 3rd pers. sg. present/past and lst/3rd pers. pl. past only occur to the right of their noun argument in subordinate clauses, whereas lst/3rd pers.</Paragraph>
      <Paragraph position="1"> pl. present are identical with the nonfinite form which additionally occurs in verb-final position in main clauses with a finite auxiliary or modal verb and in infinitive clauses.</Paragraph>
    </Section>
    <Section position="5" start_page="79" end_page="81" type="sub_section">
      <SectionTitle>
4.5 Experiment 3: Varying Corpus Size
</SectionTitle>
      <Paragraph position="0"> For infinitive bringen and lexeme bring-, V-N combinations were also calculated with BI6 for a larger corpus consisting of the MK1 and BZK together. For MK1 alone, 31 of 46 combinations are collocations, a precision of 67.4% (recall is set to 100%). With the larger corpus the number of found V-N collocations is more than twice as big, with only a slightly lower precision 2. Thus, larger corpora would improve results considerably.</Paragraph>
      <Paragraph position="1"> Results for the \]exeme with the highest number of collocations at all (73) are along the same lines; however almost, every second V-N combination is no V-N collocation in the sense defined in section 2, i.e. results are much better overall for the infinitive separately. The complete data for bringen are listed below.</Paragraph>
      <Paragraph position="2">  possibly be improved by determining syntactic relations as done by Smadja (1991a,b) for English, we conducted another test with bringen, where we manually excluded those uninteresting extracted combinations in which the nouns were in fact used in subject position of the verb. The results for ~The latest runs with the combined corpus showed that for the infinitives precision even rises slightly on average (82.1%) while recall is almost doubled (134,9%); compared to BI3 Inf in table 1.  the two window-sizes, infinitive and lexeme, are shown in table 3. Precision would rise up to 100t?~, with still a good recall of S7.1t~. if one could consider syntactic/'elations for the extraction of V-N collocations. Tile best recall of 43 collocations within 5 words to the left of the lexeme would then still correspond to 78.2c70 precision as compared to 587~, if subjects can/rot be detected. These results point in the same direction as Smadja's who reports an improvement fi'om 40 to 80% precision if syntactic relations are considered, with a 94% recall of all collocations that had been found regardless of syntactic/'elations. However, this cannot as easily be achieved in a large scale for German due to the complicated parsing techniques necessary for the varying word order.</Paragraph>
      <Paragraph position="3">  The graphics in figure 1 visualize the results of the experiments for the verb br/ngen; the left y-axis shows recall and precision in per cent, the one to the right the number of counted V-N collocations. The left, graph compares the results for the infinitive, the right one those for the lexeme. From left to right are shown: 3-word window  with t-threshold (3tl), 3-word window without t (3I), 6-word window with t (6tl) and without (6I), 6-word window for the enlarged corpus (6I+). 3L stands for '3word window, lexeme', 3L(oS) means the exclusion of subject nouns; 6L and 6L+ are analogous to the infinitive version.</Paragraph>
      <Paragraph position="4"> The result for '61+' implies that larger corpora will improve recall without a serious decline of precision compared to the same method used with the smaller corpus (6I; see also footnote 2). Whether the recall number should at the cost of a bad precision be pushed even higher by calculating MI for lexemes (6L vs. 6L+) can be decided in view of the application the data are extracted for. Once the number of V-N collocations is generally big enough, higher significance and MI thresholds can be used in order to improve precision again. MI sorts the extracted combinations in such a way that the collocations are the better the higher the MI-score is (with a few exceptions which often reflect highly significant, but linguistically uninteresting word combinations from one of the texts; this could hopefully be avoided with a more balanced corpus).</Paragraph>
      <Paragraph position="5"> In general, a trade-off has to be found between the number of extracted collocations (recall) and the number of uninteresting items in between (precision), depending on the application. The described approach seems to be a good method for corpora with texts from restricted domains, where a special terminology is used which will thus show up strongly against 'normal' combinations.</Paragraph>
      <Paragraph position="6"> Very high precision rates, which are an indispensible requirement for lexical acquisition, can only realistically be envisaged for German with parsed corpora (3L(oS) has the best recall-precision ratio in figure 1); otherwise the main advantage lies in a better lexicographical support, which should not be underestimated both for manually built NLP lexica and for printed dictionaries. Lemmatizing does not seem to be always useful, as a comparison of 61+ and 3L shows. Possibly the data are blurred because as mentioned on p. 6 the various inflection forms are distributed differently in verb-final and verb-second clauses, at least in the investigated corpus. Restricted lemmatizing with infinitive (lst/3rd pers. pl.) and past participle for a search to the left, and with 3rd pers. sg. pres./past and lst/3rd pers. pl. past for a search to the right (which is problematic, though) promises to give more precise results, as long as search strategies cannot take into account, the syntactic structure of a sentence.</Paragraph>
      <Paragraph position="7"> Work is currently in progress to calculate trigrams to check for prepositions in SVCs or for specific (or no) determiners for phrasemes. This will give indications to distinguish SVCs and lexicalized, phraseological SVCs from other collocations. In addition, we plan to consider the variation in span position of the noun within the searched window in order to distinguish fixed phrasemes from flexible ones.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML