File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1092_metho.xml
Size: 11,154 bytes
Last Modified: 2025-10-06 14:08:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1092"> <Title>Automatic extraction of paraphrastic phrases from medium size corpora</Title> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 3 Overview of the aproach </SectionTitle> <Paragraph position="0"> Argument structure acquisition is a complex task since the argument structure is rarely complete. To overcome this problem, we propose an acquisition process in which all the arguments are acquired separately.</Paragraph> <Paragraph position="1"> Figure 1 presents an outline of the overall paraphrase acquisition strategy. The process is made of automatic steps and manual validation stages. The process is weakly supervised since the analyst only has to provide one example to the system. However, we observed that the quality of the acquisition process highly depends from this seed example, so that several experiments has to be done for the acquisition of an argument structure, in order to be sure to obtain an accurate coverage of a domain.</Paragraph> <Paragraph position="2"> From the seed pattern, a set of paraphrases is automatically acquired, using similarity measures between words and a shallow syntactic analysis of the found patterns, in order to ensure they describe a predicative sequence. All these stages are described below, after the description of similarity measures allowing to calculate the semantic proximity between words.</Paragraph> <Paragraph position="3"> Several studies have recently proposed measures to calculate the semantic proximity between words. Different measures have been proposed, which are not easy to evaluate (see (Lin and Pantel, 2002) for proposals). The methods proposed so far are automatic or manual and generally imply the evaluation of word clusters in different contexts (a word cluster is close to another one if the words it contains are interchangeable in some linguistic contexts).</Paragraph> <Paragraph position="4"> Budanitsky and Hirst (2001) present the evaluation of 5 similarity measures based on the structure of Wordnet. All the algorithms they examine are based on the hypernym-hyponym relation which structures the classification of clusters inside Wordnet (the synsets). They sometimes obtain unclear conclusions about the reason of the performances of the different algorithms (for example, comparing Jiang and Conrath's measure (1997) with Lin's one (1998): &quot;It remains unclear, however, just why it performed so much better than Lin's measure, which is but a different arithmetic combination of the same terms&quot;). However, the authors emphases on the fact that the use of the sole hyponym relation is insufficient to capture the complexity of meaning: &quot;Nonetheless, it remains a strong intuition that hyponymy is only one part of semantic relatedness; meronymy, such as wheel-car, is most definitely an indicator of semantic relatedness, and, a fortiori, semantic relatedness can arise from little more than common or stereotypical associations or statistical co-occurrence in real life (for example, penguin-Antarctica; birthday-candle; sleep-pajamas)&quot;.</Paragraph> <Paragraph position="5"> In this paper, we propose to use the semantic distance described in (Dutoit et al., 2002) which is based on a knowledge-rich semantic net encoding a large variety of semantic relationships between set of words, including meronymy and stereotypical associations.</Paragraph> <Paragraph position="6"> The semantic distance between two words A and B is based on the notion of nearest common ancestors (NCA) between A and B.</Paragraph> <Paragraph position="7"> NCA is defined as the set of nodes that are daughters of c(A) [?] c(B) and that are not ancestors in c(A) [?] c(B). The activation measure d _ is equal to the mean of the weight of each NCA calculated from A and B :</Paragraph> <Paragraph position="9"> Please, refer to (Dutoit and Poibeau, 2002) for more details and examples. However, this measure is sensitive enough to give valuable results for a wide variety of applications, including text filtering and information extraction (Poibeau et al., 2002).</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 5 The acquisition proces </SectionTitle> <Paragraph position="0"> The process begins as the end-user provides a predicative linguistic structure to the system along with a representative corpus. The system tries to discover relevant parts of text in the corpus based on the presence of plain words closely related to the ones of the seed pattern.</Paragraph> <Paragraph position="1"> A syntactic analysis of the sentence is then done to verify that these plain words correspond to a paraphrastic structure. The method is close to the one of Morin and Jacquemin (1999), who first try to locate couples of relevant terms and then apply relevant patterns to analyse the nature of their relationship. However, Morin and Jacquemin only focus on term variations whereas we are interested in predicative structures, being either verbal or nominal. The syntactic variations we have to deal with are then different and, for a part, more complex than the ones examined by Morin and Jacquemin.</Paragraph> <Paragraph position="2"> The detail algorithm is described below: 1. The head noun of the example pattern is compared with the head noun of the candidate pattern using the proximity measure from (Dutoit et al., 2002). This result of the measure must be under a threshold fixed by the end-user.</Paragraph> <Paragraph position="3"> 2. The same condition must be filled by the &quot;expansion&quot; element (possessive phrase or verb complement in the candidate pattern).</Paragraph> <Paragraph position="4"> 3. The structure must be predicative (either a nominal or a verbal predicate, the algorithm does not make any difference at this level).</Paragraph> <Paragraph position="5"> The following schema (Figure 2) resumes the Finally, this process is formalized throughout the algorithm 1. Note that the predicative form is acquired together with its arguments, as in a co-training process.</Paragraph> <Paragraph position="6"> The result of this analysis is a table representing predicative structures, which are semantically equivalent to the initial example pattern. The process uses the corpus and the semantic net as two different complementary knowledge sources: [?] The semantic net provides information about lexical semantics and relations between words [?] The corpus attests possible expressions and filter irrelevant ones.</Paragraph> <Paragraph position="7"> We performed an evaluation on different French corpora, given that the semantic net is especially rich for this language. We take the expression cession de societe (company transfer) as an initial pattern. The system then discovered the following expressions, each of them being semantic paraphrases of the initial seed pattern: reprise des activites rachat d'activite acquerir des magasins racheter *c-company* cession de *c-company*...</Paragraph> <Paragraph position="8"> The result must be manually validated. Some structures are found even if they are irrelevant, due to the activation of irrelevant links. It is the case of the expression renoncer a se porter acquereur (to give up buying sthg), which is not relevant. In this case, there was a spurious link between to give up and company in the semantic net.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 5.1 Dealing with syntactic variations </SectionTitle> <Paragraph position="0"> The previous step extract semantically related predicative structures from a corpus.</Paragraph> <Paragraph position="1"> These structures are found in the corpus in various linguistic structures, but we want the system to be able to find this information even if it appears in other kind of linguistic sequences. That is the reason why we associate some meta-graphs with the linguistic structures, so that different transformations can be recognized. This strategy is based on Harris theory of sublanguages (1991). These transformations concern the syntactic level, either on the head (H) or on the expansion part introduced by the French preposition a or de), [?] Noun -- possessive phrase.</Paragraph> <Paragraph position="2"> These meta-graphs encode the major part of the linguistic structures we are concern with in the process of IE.</Paragraph> <Paragraph position="3"> The graph on Figure 4 recognizes the following sequences (in brackets we underline the couple of words previously extracted from the corpus): Reprise des activites charter... (H: reprise, E: activite) Reprendre les activites charter...</Paragraph> <Paragraph position="4"> (H: reprendre, E: activite) Reprise de l'ensemble des magasins suisse... (H: reprise, E: magasin) Reprendre l'ensemble des magasins suisse... (H: reprendre, E: magasin) Racheter les differentes activites...</Paragraph> <Paragraph position="5"> (H: racheter, E: activite) Rachat des differentes activites...</Paragraph> <Paragraph position="6"> (H: rachat, E: activite) This kind of graph is not easy to read. It includes at the same time some linguistic tags and some applicability constraints. For example, the first box contains a reference to the @A column in the table of identified structures. This column contains a set of binary constraints, expressed by some signs + or -. The sign + means that the identified pattern is of type verb-direct object: the graph can then be applied to deal with passive structures. In other words, the graph can only be applied in a sign + appears in the @A column of the constraints table. The constraints are removed from the instantiated graph. Even if the resulting graph is normally not visible (the compilation process directly produced a graph in a binary format), we give an image of a part of that graph on Figure 4.</Paragraph> <Paragraph position="7"> This mechanism using constraint tables and meta-graph has been implemented in the finite-state toolbox INTEX (Silberztein, 1993). 26 meta-graphs have been defined modeling linguistic variation for the 4 predicative structures defined above. The phenomena mainly concern the insertion of modifiers (with the noun or the verb), verbal transformations (passive) and phrasal structures (relative clauses like ...Vivendi, qui a rachete Universal...Vivendi, that bought Universal). The compilation of the set of meta-graphs relations. These graphs are relatively abstract but the end-user is not intended to directly manipulate them. They generate instantiated graphs, that is to say graphs in which the abstract variables have been replaced linguistic information as modeled in the constraint tables. This method associates a couple of elements with a set of transformation that covers more examples than the one of the training corpus. This generalization process is close to the one imagined by Morin and Jacquemin (1999) for terminology analysis but, as we already said, we cover sequences that are not only nominal ones.</Paragraph> </Section> </Section> class="xml-element"></Paper>