File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2210_metho.xml
Size: 12,348 bytes
Last Modified: 2025-10-06 14:15:03
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2210"> <Title>Idiomatic object usage and support verbs</Title> <Section position="2" start_page="0" end_page="1289" type="metho"> <SectionTitle> 2 Semantic asymmetry </SectionTitle> <Paragraph position="0"> The linguistic hypothesis that syntactic relations, such as subject-verb and object-verb relations, are semantically asymmetric in a systematic way (Keenan, 1979) is well-known. Mc-Glashan (1993, p. 213) discusses Keenan's principles concerning directionality of agreement relations and concludes that semantic interpretation of functor categories varies with argument categories, but not vice versa. He cites Keenan who argues that the meaning of a transitive verb depends on the object, for example the meaning of the verb cut seems to vary with the direct object: * in cut finger &quot;to make an incision on the surface of&quot;, * in cut cake &quot;to divide into portions&quot;, * in cut lawn &quot;to trim&quot; and * in cut heroin &quot;diminish the potency&quot;. This phenomenon is also called semantic tailoring (Allerton, 1982, p. 27).</Paragraph> <Paragraph position="1"> There are two different types of asymmetric expressions even if they probably form a continuum: those in which the sense of the functor is modified or selected by a dependent element and those in which the functor is semantically empty. The former type is represented by the verb cut above: a distinct sense is selected according to the (type of) object. The latter type contains an object that forms a fixed collocation with a semantically empty verb. These pairings are usually language-specific and semantically unpredictable.</Paragraph> <Paragraph position="2"> Obviously, the amount of tailoring varies considerably. At one end of the continuum is idiomatic usage. It is conceivable that even a highly idiomatic expression like taking toll can be used non-idiomatically. There may be texts where the word toll is used non-idiomatically, as it also may occur from time to time in any text as, for instance, in The Times corpus: The IRA could be profiting by charging a toll for crossborder smuggling. But when it appears in a sentence like Barcelona's fierce summer is taking its toll, it is clearly a part of an idiomatic expression.</Paragraph> </Section> <Section position="3" start_page="1289" end_page="1289" type="metho"> <SectionTitle> 3 Distributed frequency of an object </SectionTitle> <Paragraph position="0"> As the discussion in the preceding chapter shows, we assume that when there is a verb-object collocation that can be used idiomatically, it is the object that is the more interesting element. The objects in idiomatic usages tend to have a distinctive distribution. If an object appears only with one verb (or few verbs) in a large corpus we expect that it has an idiomatic nature. The previous example of take toll is illustrative: if the word toll appears only with the verb take but nothing else is done with tolls, we may then assume that it is not the toll in the literary sense that the text is about.</Paragraph> <Paragraph position="1"> The task is thus to collect verb-object collocations where the object appears in a corpus with few verbs; then study the collocations that are topmost in the decreasing order of frequency.</Paragraph> <Paragraph position="2"> The restriction that the object is always attached to the same verb is too strict. When we applied it to ten million words of newspaper text, we found out that even the most frequent of such expressions, make amends and take precedence, appeared less than twenty times, and the expressions have temerity, go berserk and go ex-dividend were even less frequent. It was hard to obtain more collocations because their frequency went very low. Then expressions like have appendix were equivalently exposed with expressions like run errand.</Paragraph> <Paragraph position="3"> Therefore, instead of taking the objects that occur with only one verb, we take all objects and distribute them over their verbs. This means that we are concerned with all occurrences of an object as a block, and give the block the score that is the frequency of the object divided by the number of different verbs that appear with the object.</Paragraph> <Paragraph position="4"> The formula is now as follows. Let o be an object and let (F~, V~, o), . . . , (Fn, Vn, o) be triples where Fj > 0 is the frequency or the relative frequency of the collocation of o as an object of the verb ~ in a corpus. Then the score for the object o is the sum ~--1 F~/n.</Paragraph> <Paragraph position="5"> The frequency of a given object is divided by the number of different verbs taking this given object. If the number of occurrences of a given object grows, the score increases. If the object appears with many different verbs, the score decreases. Thus the formula favours common objects that are used in a specific sense in a given corpus.</Paragraph> <Paragraph position="6"> This scheme still needs some parameters.</Paragraph> <Paragraph position="7"> First, the distribution of the verbs is not taken into account. The score is the same in the case where an object occurs with three different verbs with the frequencies, say, 100, 100, and 100, and in the case where the frequencies of the three heads are 280, 10 and 10. In this case, we want to favour the latter object, because the verb-object relation seems to be more stable with a small number of exceptions. One way to do this is to sum up the squares of the frequencies instead of the frequencies themselves. Second, it is not clear what the optimal penalty is for multiple verbs with a given object. This may be parametrised by scaling the denominator of the formula. Third, we introduce a threshold frequency for collocations so that only the collocations that occur frequently enough are used in the calculations. This last modification is crucial when an automatic parsing system is applied because it eliminates infrequent parsing errors.</Paragraph> <Paragraph position="8"> The final formula for the distributed frequency DF(o) of the object o in a corpus of n triples (Fj, Vj, o) with Fj > C is the sum 4=1 nb where a, b and C are constants that may depend on the corpus and the parser.</Paragraph> </Section> <Section position="4" start_page="1289" end_page="1290" type="metho"> <SectionTitle> 4 The corpora and parsing </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1289" end_page="1290" type="sub_section"> <SectionTitle> 4.1 The syntactic parser </SectionTitle> <Paragraph position="0"> We used the Conexor Functional Dependency Grammar (FDG) by Tapanainen and J~rvinen (1997) for finding the syntactic relations. The new version of the syntactic parser can be tested at http://www, conexor.fi.</Paragraph> </Section> <Section position="2" start_page="1290" end_page="1290" type="sub_section"> <SectionTitle> 4.2 Processing the corpora </SectionTitle> <Paragraph position="0"> We analysed the corpora with the syntactic parser and collected the verb-object collocations from the output. The verb may be in the infinitive, participle or finite form. A noun phrase in the object function is represented by its head.</Paragraph> <Paragraph position="1"> For instance, the sentence I saw a big black cat generates the pair (see, cat I. A verb may also have an infinitive clause as its object. In such a case, the object is represented by the infinitive, with the infinitive marker if present. Naturally, transitive nonfinite verbs can have objects of their own. Therefore, for instance, the sentence I want to visit Paris generates two verb-objects pairs: (want, to visit) and (visit, Paris). The parser recognises also clauses, e.g. that-clauses, as objects.</Paragraph> <Paragraph position="2"> We collect the verbs and head words of nominal objects from the parser's output. Other syntactic arguments are ignored. The output is normalised to the baseforms so that, for instance, the clause He made only three real mistakes produces the normalised pair: (make, mistake). The tokenisation in the lexical analysis produces some &quot;compound nouns&quot; like vice/president, which are glued together. We regard these compounds as single tokens.</Paragraph> <Paragraph position="3"> The intricate borderline between an object, object adverbial and mere adverbial nominal is of little importance here, because the latter tend to be idiomatic anyway. More importantly, due to the use of a syntactic parser, the presence of other arguments, e.g. subject, predicative complement or indirect object, do not affect the result. null</Paragraph> </Section> </Section> <Section position="5" start_page="1290" end_page="1290" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> In our experiment, we used some ten million words from a The Times newspaper corpus, taken from the Bank of English corpora (J~irvinen, 1994). The overall quality of the result collocations is good. The verb-object collocations with highest distributed object frequencies seem to be very idiomatic (Table 1).</Paragraph> <Paragraph position="1"> The collocations seem to have different status in different corpora. Some collocations appear in every corpus in a relatively high position. For example, collocations like take toll, give birth and make mistake are common English expressions. null Some other collocations are corpus spe-</Paragraph> <Section position="1" start_page="1290" end_page="1290" type="sub_section"> <SectionTitle> Times </SectionTitle> <Paragraph position="0"> cific. An experiment with the Wall Street Journal corpus contains collocations like name vice-/-precident and file lawsuit that are rare in the British corpora. These expressions could be categorised as cultural or area specific. They are to mutual information likely to appear again in other issues of WSJ or in other American newspapers.</Paragraph> </Section> </Section> <Section position="6" start_page="1290" end_page="1290" type="metho"> <SectionTitle> 6 Mutual information </SectionTitle> <Paragraph position="0"> Mutual information between a verb and its object was also computed for comparison with our method. The collocations from The Times with the highest mutual information and high t-value are listed in Table 2. See Church et al. (1994) for further information. We selected the t-value so that it does not filter out the collocations of Table 1. Mutual information is computed from a list of verb-object collocations.</Paragraph> <Paragraph position="1"> The first impression~ when comparing Tables 1 and 2, is that the collocations in the latter are somewhat more marginal though clearly semantically motivated. The second observation is that the top collocations contain mostly rare words and parsing errors made by the underlying syntactic parser; three out of the top five pairs are parsing errors.</Paragraph> <Paragraph position="2"> We tested how the top ten pairs of Table 1 are rated by mutual information. The result is in Table 3 where the position denotes the position when sorted according to mutual information and filtered by the t-value. The t-value is selected so that it does not filter out the top pairs in Table 1. Without filtering, the positions are in range between 32 640 and 158091. The result shows clearly how different the nature of mutual information is. Here it seems to favour pairs that we would like to rule out and vice versa.</Paragraph> <Paragraph position="3"> sorted according to the DF function</Paragraph> </Section> <Section position="7" start_page="1290" end_page="1290" type="metho"> <SectionTitle> 7 Frequency </SectionTitle> <Paragraph position="0"> In a related piece of work, Hindle (1994) used a parser to study what can be done with a given noun or what kind of objects a given verb may get. If we collect the most frequent objects for the verb have, we are answering the question: &quot;What do we usually have?&quot; (see Table 4). The distributed frequency of the object gives a different flavour to the task: if we collect the collocations in the order of the distributed frequency of the object, we are answering the question: &quot;What do we only have?&quot; (see Table 5).</Paragraph> </Section> class="xml-element"></Paper>