File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1120_metho.xml

Size: 25,745 bytes

Last Modified: 2025-10-06 14:10:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1120">
  <Title>Accurate Collocation Extraction Using a Multilingual Parser</Title>
  <Section position="4" start_page="0" end_page="954" type="metho">
    <SectionTitle>
2 Hybrid Collocation Extraction
</SectionTitle>
    <Paragraph position="0"> We consider that syntactic analysis of source corpora is an inescapable precondition for collocation extraction, and that the syntactic structure of source text has to be taken into account in order to ensure the quality and interpretability of results.</Paragraph>
    <Paragraph position="1"> 1To put it simply, collocations are non-idiomatical, but restricted, conventional lexical combinations.</Paragraph>
    <Paragraph position="2">  Asamatteroffact, someoftheexistingcollocation extraction systems already employ (but only to a limited extent) linguistic tools in order to support the collocation identification in text corpora. Forinstance, lemmatizersareoftenusedforrecognizing all the inflected forms of a lexical item, and POS taggers are used for ruling out certain categories of words, e.g., in (Justeson and Katz, 1995).</Paragraph>
    <Paragraph position="3"> Syntactic analysis has long since been recognized as a prerequisite for collocation extraction (for instance, by Smadja2), but the traditional systems simply ignored it because of the lack, at that time, of efficient and robust parsers required for processing large corpora. Oddly enough, this situation is nowadays perpetuated, in spite of the dramatic advances in parsing technology. Only a few exceptions exists, e.g., (Lin, 1998; Krenn and Evert, 2001).</Paragraph>
    <Paragraph position="4"> One possible reason for this might be the way that collocations are generally understood, as a purely statistical phenomenon. Some of the best-known definitions are the following: &amp;quot;Collocations of a given word are statements of the habitual and customary places of that word&amp;quot; (Firth, 1957, 181); &amp;quot;arbitrary and recurrent word combination&amp;quot; (Benson, 1990); or &amp;quot;sequences of lexical items that habitually co-occur&amp;quot; (Cruse, 1986, 40).</Paragraph>
    <Paragraph position="5"> Mostoftheauthorsmakenoclaimswithrespectto the grammatical status of the collocation, although this can indirectly inferred from the examples they provide.</Paragraph>
    <Paragraph position="6"> On the contrary, other definitions state explicitly that a collocation is an expression of language: &amp;quot;co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern&amp;quot; (Cowie, 1978); &amp;quot;a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit&amp;quot; (Choueka, 1988). Our approach is committed to these later definitions, hence the importance we lend to using appropriate extraction methodologies, based on syntactic analysis.</Paragraph>
    <Paragraph position="7"> The hybrid method we developed relies on the parser Fips (Wehrli, 2004), that implements the Government and Binding formalism and supports several languages (besides the ones mentioned in 2&amp;quot;Ideally, in order to identify lexical relations in a corpus one would need to first parse it to verify that the words are used in a single phrase structure. However, in practice, free-style texts contain a great deal of nonstandard features over which automatic parsers would fail. This fact is being seriously challenged by current research (...), and might not be true in the near future&amp;quot; (Smadja, 1993, 151).</Paragraph>
    <Paragraph position="8"> the abstract, a few other are also partly dealt with).</Paragraph>
    <Paragraph position="9"> We will not present details about the parser here; what is relevant for this paper is the type of syntactic structures it uses. Each constituent is represented by a simplified X-bar structure (without intermediate level), in which to the lexical head is attached a list of left constituents (its specifiers) and right constituents (its complements), and each of these are in turn represented by the same type of structure, recursively.</Paragraph>
    <Paragraph position="10"> Generallyspeaking, acollocationextractioncan be seen as a two-stage process: I. in stage one, collocation candidates are identified from the text corpora, based on criteria which are specific to each system; II. in stage two, the candidates are scored and ranked using specific association measures (a review can be found in (Manning and Sch&amp;quot;utze, 1999; Evert, 2004; Pecina, 2005)).</Paragraph>
    <Paragraph position="11"> According to this description, in our approach the parser is used in the first stage of extraction, for identifying the collocation candidates. A pair of lexical items is selected as a candidate only if there is a syntactic relation holding between the two items (one being the head of the current parse structure, and the other the lexical head of its specifier/complement). Therefore,thecriterionweemploy for candidate selection is the syntactic proximity, as opposed to the linear proximity used by traditional, window-based methods.</Paragraph>
    <Paragraph position="12"> As the parsing goes on, the syntactic word pairs are extracted from the parse structures created, from each head-specifier or head-complement relation. The pairs obtained are then partitioned according to their syntactic configuration (e.g., noun + adjectival or nominal specifier, noun + argument, noun + adjective in predications, verb + adverbial specifier, verb + argument (subject, object), verb + adjunt, etc). Finally, the log-likelihood ratios test (henceforth LLR) (Dunning, 1993) is applied on each set of pairs. We call this method hybrid, since it combines syntactic and statistical information (about word and co-occurrence frequency).</Paragraph>
    <Paragraph position="13"> The following examples -- which, like all the examples in this paper, are actual extraction results -- demonstrate the potential of our system to detect collocation candidates, even if subject to complex syntactic transformations.</Paragraph>
    <Paragraph position="14">  1.a) raise question: The question of political leadership has been raised several times by previous speakers.</Paragraph>
    <Paragraph position="15"> 1.b) play role: What role can Canada's immigration program play in helping developing nations... ? 1.c) make mistake: We could look back and probably see a lot of mistakes that all parties including Canada perhaps may have made.</Paragraph>
  </Section>
  <Section position="5" start_page="954" end_page="955" type="metho">
    <SectionTitle>
3 Multilingual Extraction Results
</SectionTitle>
    <Paragraph position="0"> In this section, we present several extraction results obtained with the system presented in section 2. The experiments were performed on data in the four languages, and involved the following corpora: for English and French, a subpart or the HansardCorpusofproceedingsfromtheCanadian Parliament; for Italian, documents from the Swiss Parliament; and for Spanish, a news corpus distributed by the Linguistic Data Consortium.</Paragraph>
    <Paragraph position="1"> Some statistics on these corpora, some processing details and quantitative results are provided in  tokens); the next three rows show some parsing statistics3, and the last rows display the number of collocation candidates extracted and of candidates for which the LLR score could be computed4.</Paragraph>
    <Paragraph position="2">  In Table 2 we list the top collocations (of length two) extracted for each language. We do not specificallydiscussheremultilingual issuesincol- null and Italian are due to the relatively reduced coverage of the parsers of these two languages (under development). However, even if a sentence is not assigned a complete parse tree, some syntactic pairs can still be collected from the partial parses.</Paragraph>
    <Paragraph position="3">  The collocation pairs obtained were further processed with a procedure of long collocations extraction described elsewhere (Seretan et al., 2003). Some examples of collocations of length 3, 4 and 5 obtained are: minister of Canadian heritage, house proceed to statement by, secretary to leader of gouvernment in house of common (En), question adresser `a ministre, programme de aide `a r'enovation r'esidentielle, agent employer force susceptible causer (Fr), bolsa de comercio local, peso en cuota de fondo de inversi'on, permitir uso de papel de deuda esterno (Sp), consiglio federale disporre, creazione di nuovo posto di lavoro, costituire fattore penalizzante per regione (It)5. 5Note that the output of the procedure contains lemmas rather than inflected forms.</Paragraph>
  </Section>
  <Section position="6" start_page="955" end_page="956" type="metho">
    <SectionTitle>
4 Comparative Evaluation Hypotheses
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="955" end_page="955" type="sub_section">
      <SectionTitle>
4.1 Does Parsing Really Help?
</SectionTitle>
      <Paragraph position="0"> Extractingcollocationsfromrawtext, withoutpreprocessing the source corpora, offers some clear advantages over linguistically-informed methods such as ours, which is based on the syntactic analysis: speed (in contrast, parsing large corpora of texts is expected to be much more time consuming), robustness (symbolic parsers are often not robust enough for processing large quantities of data), portability (no need to a priori define syntactic configurations for collocations candidates).</Paragraph>
      <Paragraph position="1"> On the other hand, these basic systems suffer from the combinatorial explosion if the candidate pairs are chosen from a large search space. To cope with this problem, a candidate pair is usually chosen so that both words are inside a context ('collocational') window of a small length. A 5word window is the norm, while longer windows prove impractical (Dias, 2003).</Paragraph>
      <Paragraph position="2"> It has been argued that a window size of 5 is actually sufficient for capturing most of the collocational relations from texts in English. But there is no evidence sustaining that the same holds for other languages, like German or the Romance ones that exhibit freer word order. Therefore, as window-based systems miss the 'long-distance' pairs, their recall is presumably lower than that of parse-based systems. However, the parser could also miss relevant pairs due to inherent analysis errors.</Paragraph>
      <Paragraph position="3"> As for precision, the window systems are susceptible to return more noise, produced by the grammatically unrelated pairs inside the collocational window. By dividing the number of grammatical pairs by the total number of candidates considered, we obtain the overall precision with respecttogrammaticality; thisresultisexpectedto be considerably worse in the case of basic method than for the parse-based methods, just by virtue of the parsing task. As for the overall precision with respect to collocability, we expect the proportional figures to be preserved. This is because the parser-based methods return less, but better pairs (i.e., only the pairs identified as grammatical), and because collocations are a subset of the grammatical pairs.</Paragraph>
      <Paragraph position="4"> Summing up, the evaluation hypothesis that can be stated here is the following: parse-based methods outperform basic methods thanks to a drastic reduction of noise. While unquestionable under the assumption of perfect parsing, this hypothesis has to be empirically validated in an actual setting.</Paragraph>
    </Section>
    <Section position="2" start_page="955" end_page="956" type="sub_section">
      <SectionTitle>
4.2 Is More Data Better Than Better Data?
</SectionTitle>
      <Paragraph position="0"> The hypothesis above refers to the overall precision and recall, that is, relative to the entire list of selected candidates. One might argue that these numbers are less relevant for practice than they arefromatheoretical(evaluation)perspective, and that the exact composition of the list of candidates identified is unimportant if only the top results (i.e., those pairs situated above a threshold) are looked at by a lexicographer or an application.</Paragraph>
      <Paragraph position="1"> Considering a threshold for the n-best candidates works very much in the favor of basic methods. As the amount of data increases, there is a reduction of the noise among the best-scored pairs, which tend to be more grammatical because the likelihood of encountering many similar noisy pairs is lower. However, as the following example shows, noisy pairs may still appear in top, if they occur often in a longer collocation: 2.a) les essais du missile de croisi`ere 2.b) essai - croisi`ere The pair essai - croisi`ere is marked by the basic systems as a collocation because of the recurrent association of the two words in text as part or the longer collocation essai du missile de croisi`ere. It is an grammatically unrelated pair, while the correct pairs reflecting the right syntactic attachment are essai missile and missile (de) croisi`ere.</Paragraph>
      <Paragraph position="2"> We mentioned that parsing helps detecting the 'long-distance' pairs that are outside the limits of the collocational window. Retrieving all such complex instances (including all the extraposition cases) certainly augment the recall of extraction systems, but this goal might seem unjustified, because the risk of not having a collocation represented at all diminishes as more and more data is processed. One might think that systematically missing long-distance pairs might be very simply compensated by supplying the system with more data, and thus that larger data is a valid alternative to performing complex processing.</Paragraph>
      <Paragraph position="3"> While we agree that the inclusion of more data compensates for the 'difficult' cases, we do consider this truly helpful in deriving collocational information, for the following reasons: (1) more data means more noise for the basic methods; (2) some collocations might systematically appear in  a complex grammatical environment (such as passive constructions or with additional material inserted between the two items); (3) more importantly, the complex cases not taken into account alter the frequency profile of the pairs concerned.</Paragraph>
      <Paragraph position="4"> These observations entitle us to believe that, evenwhenmoredataisadded, the n-bestprecision might remain lower for the basic methods with respect to the parse-based ones.</Paragraph>
    </Section>
    <Section position="3" start_page="956" end_page="956" type="sub_section">
      <SectionTitle>
4.3 How Real the Counts Are?
</SectionTitle>
      <Paragraph position="0"> Syntactic analysis (including shallower levels of linguistic analysis traditionally used in collocation extraction, suchaslemmatization, POStagging, or chunking) has two main functions.</Paragraph>
      <Paragraph position="1"> On the one hand, it guides the extraction system in the candidate selection process, in order to better pinpoint the pairs that might form collocations and to exclude the ones considered as inappropriate (e.g., the pairs combining function words, such as a preposition followed by a determiner).</Paragraph>
      <Paragraph position="2"> On the other, parsing supports the association measures that will be applied on the selected candidates, by providing more exact frequency information on words -- the inflected forms count as instances of the same lexical item -- and on their co-occurrence frequency -- certain pairs might count as instance of the same pair, others do not.</Paragraph>
      <Paragraph position="3"> In the following example, the pair loi modifier is an instance of a subject-verb collocation in 3.a), and of a verb-object collocation type in 3.b). Basic methods are unable to distinguish between the two types, and therefore count them as equivalent.</Paragraph>
      <Paragraph position="4">  3.a) Loi modifiant la Loi sur la responsabilit'e civile 3.b) la loi devrait ^etre modifi'ee  Parsing helps to create a more realistic frequency profile for the candidate pairs, not only because of the grammaticality constraint it applies on the pairs (wrong pairs are excluded), but also because it can detect the long-distance pairs that are outside the collocational window.</Paragraph>
      <Paragraph position="5"> Given that the association measures rely heavily on the frequency information, the erroneous counts have a direct influence on the ranking of candidates and, consequently, on the top candidates returned. We believe that in order to achieve a good performance, extraction systems should be as close as possible to the real frequency counts and, of course, to the real syntactic interpretation provided in the source texts6.</Paragraph>
      <Paragraph position="6"> Since parser-based methods rely on more accurate frequency information for words and their co-occurrence than window methods, it follows that the n-best list obtained with the first methods will probably show an increase in quality over the second. null To conclude this section, we enumerate the hypotheses that have been formulated so far: (1) Parse methods provide a noise-freer list of collocation candidates, in comparison with the window methods; (2) Local precision (of best-scored results) with respect to grammaticality is higher for parse methods, since in basic methods some noise still persists, even if more data is included; (3) Local precision with respect to collocability is higher for parse methods, because they use a more realistic image of word co-occurrence frequency.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="956" end_page="958" type="metho">
    <SectionTitle>
5 Comparative Evaluation
</SectionTitle>
    <Paragraph position="0"> We compare our hybrid method (based on syntactic processing of texts) against the window method classically used in collocation extraction, from the point of view of their precision with respect to grammaticality and collocability.</Paragraph>
    <Section position="1" start_page="956" end_page="957" type="sub_section">
      <SectionTitle>
5.1 The Method
</SectionTitle>
      <Paragraph position="0"> The n-best extraction results, for a given n (in our experiment, n varies from 50 to 500 at intervals of 50) are checked in each case for grammatical well-formedness and for lexicalization. By lexicalization we mean the quality of a pair to constitute (part of) a multi-word expression -- be it compound, collocation, idiom or another type of syntagmatic lexical combination. We avoid giving collocability judgments since the classification of multi-word expressions cannot be made precisely and with objective criteria (McKeown and Radev, 2000). We rather distinguish between lexicalizable and trivial combinations (completely regular productions, such as big house, buy bread, that do not deserve a place in the lexicon). As in (Choueka, 1988) and (Evert, 2004), we consider that a dominant feature of collocations is that they are unpredictable for speakers and therefore have to be stored into a lexicon.</Paragraph>
      <Paragraph position="1"> 6To exemplify this point: the pair d'eveloppement humain (which has been detected as a collocation by the basic method) looks like a valid expression, but the source text consistently offers a different interpretation: d'eveloppement des ressources humaines.</Paragraph>
      <Paragraph position="2">  Each collocation from the n-best list at the different levels considered is therefore annotated with one of the three flags: 1. ungrammatical; 2. trivial combination; 3. multi-word expression (MWE).</Paragraph>
      <Paragraph position="3"> On the one side, we evaluate the results of our hybrid, parse-based method; on the other, we simulate a window method, by performing the following steps: POS-tag the source texts; filter the lexical items and retain only the open-class POS; consider all their combinations within a collocational window of length 5; and, finally, apply the log-likelihood ratios test on the pairs of each configuration type.</Paragraph>
      <Paragraph position="4"> In accordance with (Evert and Kermes, 2003), we consider that the comparative evaluation of collocation extraction systems should not be done at the end of the extraction process, but separately for each stage: after the candidate selection stage, for evaluating the quality (in terms of grammaticality) of candidates proposed; and after the application of collocability measures, for evaluating the measures applied. In each of these cases, different evaluation methodologies and resources are required. In our case, since we used the same measure for the second stage (the log-likelihood ratios test), we could still compare the final output of basic and parse-based methods, as given by the combination of the first stage with the same collocability measure.</Paragraph>
      <Paragraph position="5"> Again, similarly to Krenn and Evert (2001), we believe that the homogeneity of data is important for the collocability measures. We therefore applied the LLR test on our data after first partitioning it into separate sets, according to the syntactical relation holding in each candidate pair. As the data used in the basic method contains no syntacticinformation, thepartitioningwasdonebasedon POS-combination type.</Paragraph>
    </Section>
    <Section position="2" start_page="957" end_page="957" type="sub_section">
      <SectionTitle>
5.2 The Data
</SectionTitle>
      <Paragraph position="0"> The evaluation experiment was performed on the whole French corpus used in the extraction experiment (section 2), that is, a subpart of the Hansard corpus of Canadian Parliament proceedings. It contains 112 text files totalling 8.43 MB, with an average of 628.1 sentences/file and 23.46 tokens/sentence (as detected by the parser). The total number of tokens is 1, 649, 914.</Paragraph>
      <Paragraph position="1"> On the one hand, the texts were parsed and 370, 932 candidate pairs were extracted using the hybrid method we presented. Among the pairs extracted, 11.86% (44, 002 pairs) were multi-word expressions identified at parse-time, since present in the parser's lexicon. The log-likelihood ratios test was applied on the rest of pairs. A score could be associated to 308, 410 of these pairs (corresponding to 131, 384 types); for the others, the score was undefined.</Paragraph>
      <Paragraph position="2"> On the other hand, the texts were POS-tagged using the same parser as in the first case. If in the first case the candidate pairs were extracted during the parsing, in the second they were generated after the open-class filtering. From 673, 789 POSfiltered tokens, a number of 1, 024, 888 combinations (560, 073 types) were created using the 5length window criterion, while taking care not to cross a punctuation mark. A score could be associated to 1, 018, 773 token pairs (554, 202 types), which means that the candidate list is considerably larger than in the first case. The processing time was more than twice longer than in the first case, because of the large amount of data to handle.</Paragraph>
    </Section>
    <Section position="3" start_page="957" end_page="958" type="sub_section">
      <SectionTitle>
5.3 Results
</SectionTitle>
      <Paragraph position="0"> The 500 best-scored collocations retrieved with the two methods were manually checked by three human judges and annotated, as explained in 5.1, as either ungrammatical, trivial or MWE. The agreement statistics on the annotations for each method are shown in Table 3.</Paragraph>
      <Paragraph position="1">  For reporting n-best precision results, we used as reference set the annotated pairs on which at least two of the three annotators agreed. That is, from the 500 initial pairs retrieved with each method, 497 pairs were retained in the first case (parse method), and 483 pairs in the second (window method).</Paragraph>
      <Paragraph position="2"> Table 4 shows the comparative evaluation results for precision at different levels in the list of best-scored pairs, both with respect to grammaticality and to collocability (or, more exactly, the potential of a pair to constitute a MWE). The numbers show that a drastic reduction of noise is achieved by parsing the texts. The error rate with  respect to grammaticality is, on average, 15.9% for the window method; with parsing, it drops to 1.5% (i.e., 10.6 times smaller).</Paragraph>
      <Paragraph position="3"> This result confirms our hypothesis regarding the local precision which was stated in section 4.2. Despite the inherent parsing errors, the noise reduction is substantial. It is also worth noting that we compared our method against a rather high baseline, as we made a series of choices susceptible to alleviate the candidates identification with the window-based method: we filtered out function words, we used a parser for POS-tagging (that eliminated POS-ambiguity), and we filtered out cross-punctuation pairs.</Paragraph>
      <Paragraph position="4"> As for the MWE precision, the window method performs better for the first 100 pairs7); on the remaining part, the parsing-based method is on average 3.7% better. The precision curve for the window method shows a more rapid degradation than it does for the other. Therefore we can conclude that parsing is especially advantageous if one investigates more that the first hundred results (as it seems reasonable for large extraction experiments). null In spite of the rough classification we used in annotation, we believe that the comparison performed is nonetheless meaningful since results should be first checked for grammaticality and 'triviality' before defining more difficult tasks such as collocability.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML