File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1043_metho.xml
Size: 22,050 bytes
Last Modified: 2025-10-06 14:10:04
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1043"> <Title>Automatically Constructing a Lexicon of Verb Phrase Idiomatic Combinations</Title> <Section position="4" start_page="338" end_page="341" type="metho"> <SectionTitle> 3 Automatic Recognition of VNICs </SectionTitle> <Paragraph position="0"> Here we describe our measures for idiomaticity, whichquantify thedegreeoflexical, syntactic, and overall fixedness of a given verb+noun combination, represented as a verb-noun pair. (Note that our measures quantify fixedness, not flexibility.)</Paragraph> <Section position="1" start_page="338" end_page="338" type="sub_section"> <SectionTitle> 3.1 Measuring Lexical Fixedness </SectionTitle> <Paragraph position="0"> AVNICislexically fixedifthereplacement ofany of its constituents by a semantically (and syntactically) similar word generally does not result in another VNIC, but in an invalid or a literal expression. One way of measuring lexical fixedness of a given verb+noun combination is thus to examine theidiomaticity ofitsvariants, i.e., expressions generated by replacing one of the constituents by a similar word. This approach has two main challenges: (i) it requires prior knowledge about the idiomaticity of expressions (which is what we are developing our measure to determine); (ii) it needs information on &quot;similarity&quot; among words.</Paragraph> <Paragraph position="1"> Inspired by Lin(1999), weexamine the strength of association between the verb and noun constituents of the target combination and its variants, as an indirect cue to their idiomaticity. We use the automatically-built thesaurus of Lin (1998) to find similar words to the noun of the target expression, in order to automatically generate variants. Only the noun constituent is varied, since replacing the verb constituent of a VNIC with a semantically related verb is more likely to yield another VNIC, as in keep/lose one's cool (Nunberg et al., 1994).</Paragraph> <Paragraph position="2"> Let a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a16a15a17a9a18a5a20a19a22a21a24a23a26a25a27a23a29a28a31a30 be the set of the a28 most similar nouns to the noun a9 of the target pair a32a34a33a36a35 a9a38a37 . We calculate the association strength for the target pair, and for each of its variants, a32a39a33a36a35 a9 a5 a37 , using pointwise mutual information (PMI) (Church et al., 1991):</Paragraph> <Paragraph position="4"> where a67 a23a69a68a70a23a31a28 and a9a36a71 is the target noun; a56 is the set of all transitive verbs in the corpus; a58 is the set of all nouns appearing as the direct object of some verb; a60a2a7 a33a72a35 a9 a45 a11 is the frequency of a33 and in the direct object position of any verb in a56 .</Paragraph> <Paragraph position="5"> Lin (1999) assumes that a target expression is non-compositional if and only if its a40a73a41a74a43 value is significantly different from that of any of the variants. Instead, we propose a novel technique thatbringstogether theassociation strengths (a40a42a41a44a43 values) of the target and the variant expressions into a single measure reflecting the degree of lexical fixedness for the target pair. We assume that the target pair is lexically fixed to the extent that its a40a42a41a44a43 deviates from the average a40a42a41a44a43 of its variants. Our measure calculates this deviation, normalized using the sample's standard deviation:</Paragraph> </Section> <Section position="2" start_page="338" end_page="339" type="sub_section"> <SectionTitle> 3.2 Measuring Syntactic Fixedness </SectionTitle> <Paragraph position="0"> Compared to compositional verb+noun combinations, VNICs are expected to appear in more restricted syntactic forms. To quantify the syntactic fixedness of a target verb-noun pair, we thus need to: (i) identify relevant syntactic patterns, i.e., those that help distinguish VNICs from literalverb+noun combinations; (ii) translate thefrequency distribution of the target pair in the identified patterns into a measure of syntactic fixedness. Determining a unique set of syntactic patterns appropriate for the recognition of all idiomatic combinations is difficult indeed: exactly which formsanidiomatic combination can occur inisnot entirely predictable (Sag et al., 2002). Nonetheless, there are hypotheses about the difference in behaviour of VNICs and literal verb+noun combinations with respect to particular syntactic variations (Nunberg et al., 1994). Linguists note that semantic analyzability is related to the referential status of the noun constituent, which is in turn related to participation in certain morphosyntactic forms. In what follows, we describe three types of variation that are tolerated by literal combinations, but are prohibited by many VNICs.</Paragraph> <Paragraph position="1"> Passivization There is much evidence in the linguistic literature that VNICs often do not undergo passivization.1 Linguists mainly attribute this to the fact that only a referential noun can appear as the surface subject of a passive construction.</Paragraph> <Paragraph position="2"> 1There are idiomatic combinations that are used only in a passivized form; we do not consider such cases in our study. Determiner Type A strong correlation exists between the flexibility of the determiner preceding the noun in a verb+noun combination and the overall flexibility of the phrase (Fellbaum, 1993).</Paragraph> <Paragraph position="3"> It is however important to note that the nature of the determiner is also affected by other factors, such as the semantic properties of the noun.</Paragraph> <Paragraph position="4"> Pluralization While the verb constituent of a VNIC is morphologically flexible, the morphological flexibility of the noun relates to its referential status. A non-referential noun constituent is expected to mainly appear in just one of the singular or plural forms. The pluralization of the noun is of course also affected by its semantic properties.</Paragraph> <Paragraph position="5"> Merging the three variation types results in a pattern set, a0 a0 , of a1a2a1 distinct syntactic patterns, given in Table 1.2</Paragraph> </Section> <Section position="3" start_page="339" end_page="339" type="sub_section"> <SectionTitle> 3.2.2 Devising a Statistical Measure </SectionTitle> <Paragraph position="0"> Thesecond stepistodeviseastatistical measure that quantifies the degree of syntactic fixedness of a verb-noun pair, with respect to the selected set of patterns, a0 a0 . We propose a measure that compares the &quot;syntactic behaviour&quot; of the target pair with that of a &quot;typical&quot; verb-noun pair. Syntactic behaviour of a typical pair is defined as the prior probability distribution over the patterns in The syntactic behaviour of the target verb-noun pair a32a34a33a72a35 a9 a37 is defined as the posterior probability distribution over the patterns, given the particular pair. The posterior probability of an individual pattern a3a5a4 is estimated as:</Paragraph> <Paragraph position="2"> The degree of syntactic fixedness of the target verb-noun pair is estimated as the divergence of its syntactic behaviour (the posterior distribution 2We collapse some patterns since with a larger pattern set the measure may require larger corpora to perform reliably. over the patterns), from the typical syntactic behaviour (the prior distribution). The divergence of the two probability distributions is calculated using a standard information-theoretic measure, the</Paragraph> <Paragraph position="4"> KL-divergence is always non-negative and is zero if and only if the two distributions are exactly the same. Thus, a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a58a52a37a54a26a55 a7 a33a72a35 a9a91a11a73a98a100a99a67a95a35a104a103 a101a106a105 . KL-divergence is argued to be problematic because it is not a symmetric measure. Nonetheless, it has proven useful in many NLP applications (Resnik, 1999; Dagan et al., 1994). Moreover, the asymmetry is not an issue here since we are concerned with the relative distance of several posterior distributions from the same prior.</Paragraph> </Section> <Section position="4" start_page="339" end_page="340" type="sub_section"> <SectionTitle> 3.3 A Hybrid Measure of Fixedness </SectionTitle> <Paragraph position="0"> VNICs are hypothesized to be, in most cases, both lexically and syntactically more fixed than literal verb+noun combinations (see Section 2). We thus propose a new measure of idiomaticity to be a measure of the overall fixedness of a given pair.</Paragraph> <Paragraph position="1"> We define a75a62a76a51a77a79a78a81a80a18a82a18a78a81a83a84a83a60a59a13a61a84a87a13a62a15a63 a85a88a85 a7 a33a72a35 a9a12a11 as:</Paragraph> <Paragraph position="3"> where a64 weights the relative contribution of the measures in predicting idiomaticity.</Paragraph> <Paragraph position="4"> 4 Evaluation of the Fixedness Measures To evaluate our proposed fixedness measures, we determine their appropriateness asindicators ofidiomaticity. We pose a classification task in which idiomatic verb-noun pairs are distinguished from literal ones. We use each measure to assign scores to the experimental pairs (see Section 4.2 below). We then classify the pairs by setting a threshold, here the median score, where all expressions with scores higher than the threshold are labeled as idiomatic and the rest as literal.</Paragraph> <Paragraph position="5"> We assess the overall goodness of a measure by looking at its accuracy (Acc) and the relative reduction in error rate (RER) on the classification task described above. The RER of a measure reflects the improvement in its accuracy relative to another measure (often a baseline).</Paragraph> <Paragraph position="6"> We consider two baselines: (i) a random baseline, a0a2a1 a82a18a80 , that randomly assigns a label (literal or idiomatic) to each verb-noun pair; (ii) a more informed baseline, a40a42a41a44a43 , an information-theoretic measure widely used for extracting statistically significant collocations.3</Paragraph> </Section> <Section position="5" start_page="340" end_page="340" type="sub_section"> <SectionTitle> 4.1 Corpus and Data Extraction </SectionTitle> <Paragraph position="0"> We use the British National Corpus (BNC; &quot;http://www.natcorp.ox.ac.uk/&quot;) to extract verb-noun pairs, along with information on the syntactic patterns they appear in. We automatically parse the corpus using the Collins parser (Collins, 1999), and further process it using TGrep2 (Rohde, 2004). For each instance of a transitive verb, we use heuristics to extract the noun phrase (NP) in either the direct object position (if the sentence is active), or the subject position (if the sentence is passive). We then use NP-head extraction software4 to get the head noun of the extracted NP, its number (singular or plural), and the determiner introducing it.</Paragraph> </Section> <Section position="6" start_page="340" end_page="340" type="sub_section"> <SectionTitle> 4.2 Experimental Expressions </SectionTitle> <Paragraph position="0"> We select our development and test expressions from verb-noun pairs that involve a member of a predefined list of (transitive) &quot;basic&quot; verbs. Basic verbs, in their literal use, refer to states or acts that are central to human experience. They are thus frequent, highly polysemous, and tend to combine with other words to form idiomatic combinations (Nunberg et al., 1994). An initial list of suchverbswasselected fromseveral linguistic and psycholinguistic studies on basic vocabulary (e.g., Pauwels 2000; Newman and Rice 2004). We further augmented this initial list with verbs that are semantically related to another verb already in the 3As in Eqn. (1), our calculation of PMI here restricts the verb-noun pair to the direct object relation.</Paragraph> <Paragraph position="1"> 4We use a modified version of the software provided by Eric Joanis based on heuristics from (Collins, 1999).</Paragraph> <Paragraph position="2"> list; e.g., lose is added in analogy with find. The final list of 28 verbs is: blow, bring, catch, cut, find, get, give, have, hear, hit, hold, keep, kick, lay, lose, make, move, place, pull, push, put, see, set, shoot, smell, take, throw, touch From the corpus, we extract all verb-noun pairs withminimum frequency of a1 a67 that contain abasic verb. From these, we semi-randomly select an idiomatic and a literal subset.5 A pair is considered idiomatic if it appears in a credible idiom dictionary, such as the Oxford Dictionary of Current Idiomatic English (ODCIE) (Cowie et al., 1983), or the Collins COBUILD Idioms Dictionary (CCID) (Seaton and Macaulay, 2002). Otherwise, the pair is considered literal. We then randomly pull out a1a4a3 a67 development and a5 a67a53a67 test pairs (half idiomatic and half literal), ensuring both low and high frequency items are included. Sample idioms corresponding to the extracted pairs are: kick the habit, move mountains, lose face, and keep one's word.</Paragraph> </Section> <Section position="7" start_page="340" end_page="340" type="sub_section"> <SectionTitle> 4.3 Experimental Setup </SectionTitle> <Paragraph position="0"> Development expressions are used in devising the fixedness measures, as well as in determining the values of the parameters a28 in Eqn. (2) and a64 in Eqn. (4). a28 determines the maximum number of nouns similar to the target noun, to be considered in measuring the lexical fixedness of a given pair.</Paragraph> <Paragraph position="1"> The value of this parameter is determined by performing experiments over the development data, in which a28 ranges from a1 a67 to a1 a67a53a67 by steps of a1 a67 ; a28 is set to a6 a67 based on the results. We also experimented with different values of a64 ranging from a67 to a1 by steps of a7 a1 . Based on the development results, thebest value for a64 is a7a9a8 (giving moreweight to the syntactic fixedness measure).</Paragraph> <Paragraph position="2"> Testexpressions aresaved asunseen data forthe final evaluation. We further divide the set of all testexpressions, TESTa63 a85a88a85 ,intotwosetscorresponding to two frequency bands: TESTa10a12a11a13a15a14 contains a6 a67 idiomatic and a6 a67 literal pairs, each with total frequency between a1 a67 and a16a66a67 (a1 a67 a23a20a60a18a17a20a19a20a21a96a7 a33a36a35 a9 a35a12a63 a11a23a22 a16a66a67 ); TESTa10a12a24a26a25a27a28a24 consists of a6 a67 idiomatic and a6 a67 literal pairs, each with total frequency of a16a66a67 or greater (a60a18a17a20a19a20a21a96a7 a33a36a35 a9 a35a91a63 a11a30a29 a16a66a67 ). All frequency counts are over the entire BNC.</Paragraph> </Section> <Section position="8" start_page="340" end_page="341" type="sub_section"> <SectionTitle> 4.4 Results </SectionTitle> <Paragraph position="0"> We first examine the performance of the individual fixedness measures, a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a104a85a88a87a90a89 and 5In selecting literal pairs, we choose those that involve a physical act corresponding to the basic semantics of the verb.</Paragraph> <Paragraph position="1"> fixedness and the two baseline measures over all test pairs. a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a53a52a65a54a21a55 , as well as that of the two baselines, a0a2a1 a82a72a80 and a40a73a41a74a43 ; see Table 2. (Results for the over-all measure are presented later in this section.) As can be seen, the informed baseline, a40a42a41a44a43 , shows a large improvement over the random baseline (a5 a8 a33 error reduction). This shows that one can get relatively good performance by treating verb+noun idiomatic combinations as collocations.</Paragraph> <Paragraph position="2"> a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a84a85a88a87a90a89 performs as well as the informed baseline (a34 a67 a33 error reduction). This result shows that, as hypothesized, lexical fixedness is areasonably good predictor of idiomaticity. Nonetheless, the performance signifies a need for improvement.</Paragraph> <Paragraph position="3"> Possibly the most beneficial enhancement would be a change in the way we acquire the similar nouns for a target noun.</Paragraph> <Paragraph position="4"> The best performance (shown in boldface) belongs to a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a60a52a65a54a21a55 , with a16a66a67 a33 error reduction over the random baseline, and a5 a67 a33 error reduction over the informed baseline. These results demonstrate that syntactic fixedness is a good indicator of idiomaticity, better than a simple measure of collocation (a40a73a41a74a43 ), or a measure of lexical fixedness. These results further suggest that looking into deep linguistic properties of VNICs is both necessary and beneficial for the appropriate treatment of these expressions.</Paragraph> <Paragraph position="5"> a40a73a41a74a43 is known to perform poorly on low frequency data. To examine the effect of frequency on the measures, we analyze their performance on the two divisions of the test data, corresponding to the two frequency bands, TESTa10a12a11 a13a15a14 and TESTa10a12a24a26a25 a27a28a24 . Results are given in Table 3, with the best performance shown in boldface.</Paragraph> <Paragraph position="6"> As expected, the performance of a40a73a41a74a43 drops substantially for low frequency items. Interestingly, although it is a PMI-based measure, a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a86a85a88a87a97a89 performs slightly better when the data is separated based on frequency. The performance of a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a60a52a65a54a21a55 improves quite a bit when it is applied to high frequency items, while it improves only slightly on the low frequency items.</Paragraph> <Paragraph position="7"> These results show that both Fixedness measures perform better onhomogeneous data, whileretaining comparably good performance on heterogeneous data. These results reflect that our fixedness measures are not as sensitive tofrequency as a40a42a41a44a43 . Hence they can be used with a higher degree of confidence, especially when applied to data that is heterogeneous with regard to frequency. This is important because while some VNICs are very common, others have very low frequency.</Paragraph> <Paragraph position="8"> Table 4 presents the performance of the hybrid measure, a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a58a59a13a61a84a87a24a62a25a63 a85a88a85 , repeating that of a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a84a85a88a87a90a89 and a75a2a76a94a77a79a78a81a80a72a82a96a78a81a83a84a83a60a52a37a54a26a55 for comparison. a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a60a59a13a61a84a87a24a62a25a63 a85a88a85 outperforms both lexical and syntactic fixedness measures, with a substantial improvement over a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83 a85a88a87a90a89 , and a small, but notable, improvement over a75a2a76a94a77a95a78a81a80a18a82a96a78a81a83a84a83a21a52a37a54a26a55 . Each of the lexical and syntactic fixedness measures is a good indicator of idiomaticity on its own, with syntactic fixedness being a better predictor. Here we demonstrate that combining them into a single measure of fixedness, while giving more weight to the better measure, results in a more effective predictor of idiomaticity.</Paragraph> </Section> </Section> <Section position="5" start_page="341" end_page="342" type="metho"> <SectionTitle> 5 Determining the Canonical Forms </SectionTitle> <Paragraph position="0"> Our evaluation of the fixedness measures demonstrates their usefulness for the automatic recognition of idiomatic verb-noun pairs. To represent such pairs in a lexicon, however, we must determine their canonical form(s)--Cforms henceforth. For example, the lexical representation of a32 shoot, breeze a37 should include shoot the breeze as a Cform.</Paragraph> <Paragraph position="1"> Since VNICs are syntactically fixed, they are mostly expected to have a single Cform. Nonetheless, there are idioms with two or more accept- null able forms. For example, hold fire and hold one's fire are both listed in CCID as variations of the same idiom. Our approach should thus be capable of predicting all allowable forms for a given idiomatic verb-noun pair.</Paragraph> <Paragraph position="2"> Weexpect aVNICtooccurinitsCform(s)more frequently than it occurs in anyother syntactic patterns. To discover the Cform(s) for a given idiomatic verb-noun pair, we thus examine its frequency of occurrence in each syntactic pattern in over the sample a15a84a60a62a7 a33a36a35 a9 a35 a3a57a4a65a36 a11a91a19 a3a5a4a37a36 a98 a0 a0 a30 . The statistic a0 a36 a7 a33a36a35 a9a91a11 indicates how far and in which direction the frequency of occurrence of the pair a32 a33a36a35 a9 a37 in pattern a6 a8a2a1 deviates from the sample'smean, expressed inunits ofthesample's standard deviation. To decide whether a3a5a4a25a36 is a canonical pattern for the target pair, we check whether</Paragraph> <Paragraph position="4"> a9a91a11a4a3a6a5a8a7 , where a5a9a7 is a threshold. For evaluation, we set a5a9a7 to a1 , based on the distribution of a10 and through examining the development data.</Paragraph> <Paragraph position="5"> We evaluate the appropriateness of this approach in determining the Cform(s) of idiomatic pairs by verifying its predicted forms against ODCIE and CCID. Specifically, for each of the a1 a67a53a67 idiomatic pairs in TESTa63 a85a88a85 , we calculate the precision and recall of its predicted Cforms (those whose a0 -scores are above a5a11a7 ), compared to the Cforms listed in the two dictionaries. The average precision across the 100 test pairs is 81.7%, and the average recall is 88.0% (with 69 of the pairs having 100% precision and 100% recall). Moreover, we find that for the overwhelming majority of the pairs, a8 a3 a33 , the predicted Cform with the highest a0 -score appears in the dictionary entry of the pair. Thus, our method of detecting Cforms performs quite well.</Paragraph> </Section> class="xml-element"></Paper>