File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1806_intro.xml
Size: 4,430 bytes
Last Modified: 2025-10-06 14:02:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1806"> <Title>Multiword Unit Hybrid Extraction</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Multiword units (MWUs) include a large range of linguistic phenomena, such as compound nouns (e.g. interior designer), phrasal verbs (e.g. run through), adverbial locutions (e.g. on purpose), compound determinants (e.g. an amount of), prepositional locutions (e.g. in front of) and institutionalized phrases (e.g. con carne). MWUs are frequently used in everyday language, usually to precisely express ideas and concepts that cannot be compressed into a single word. As a consequence, their identification is a crucial issue for applications that require some degree of semantic processing (e.g. machine translation, summarization, information retrieval).</Paragraph> <Paragraph position="1"> In recent years, there has been a growing awareness in the Natural Language Processing (NLP) community of the problems that MWUs pose and the need for their robust handling. For that purpose, syntactical (Didier Bourigault, 1993), statistical (Frank Smadja, 1993; Ted Dunning, 1993; Gael Dias, 2002) and hybrid syntaxico-statistical methodologies (Beatrice Daille, 1996; Jean-Philippe Goldman et al. 2001) have been proposed.</Paragraph> <Paragraph position="2"> In this paper, we propose an original hybrid system called HELAS that extracts MWU candidates from part-of-speech tagged corpora. Unlike classical hybrid systems that manually pre-define local part-of-speech patterns of interest (Beatrice Daille, 1996; Jean-Philippe Goldman et al. 2001), our solution automatically identifies relevant syntactical patterns from the corpus. Word statistics are then combined with the endogenously acquired linguistic information in order to extract the most relevant sequences of words i.e. MWU candidates.</Paragraph> <Paragraph position="3"> Technically, we conjugate the Mutual Expectation (ME) association measure with the acquisition process called GenLocalMaxs (Gael Dias, 2002) in a five step process.</Paragraph> <Paragraph position="4"> First, the part-of-speech tagged corpus is divided into two sub-corpora: one containing words and one containing part-of-speech tags. Each sub-corpus is then segmented into a set of positional ngrams i.e. ordered vectors of textual units. Third, the ME independently evaluates the degree of cohesiveness of each positional ngram i.e. any positional ngram of words and any positional ngram of part-of-speech tags. A combination of both MEs is then used to evaluate the global degree of cohesiveness of any sequence of words associated with its respective part-of-speech tag sequence. Finally, the GenLocalMaxs retrieves all the MWU candidates by evidencing local maxima of association measure values thus avoiding the definition of global thresholds. The overall architecture can be seen in Figure 1.</Paragraph> <Paragraph position="5"> Compared to existing hybrid systems, the benefits of HELAS are clear. By avoiding human intervention in the definition of syntactical patterns, it provides total HELAS stands for Hybrid Extraction of Lexical ASsociations. null flexibility of use. Indeed, the system can be used for any language without any specific tuning. HELAS also allows the identification of various MWUs like phrasal verbs, adverbial locutions, compound determinants, prepositional locutions and institutionalized phrases.</Paragraph> <Paragraph position="6"> Finally, it responds to some extent to the affirmation of Benoit Habert and Christian Jacquemin (1993) that claim that &quot;existing hybrid systems do not sufficiently tackle the problem of the interdependency between the filtering stage [the definition of syntactical patterns] and the acquisition process [the scoring and the election of relevant sequences of words] as they propose that these two steps should be independent&quot;.</Paragraph> <Paragraph position="7"> The article is divided into five main sections: (1) we introduce the related work; (2) we present the text corpus segmentation into positional ngrams; (3) we define the Mutual Expectation and a new combined association measure; (4) we propose the GenLocalMaxs algorithm as the acquisition process; Finally, in (5), we present some results over the Brown Corpus.</Paragraph> </Section> class="xml-element"></Paper>