File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1806_metho.xml
Size: 24,651 bytes
Last Modified: 2025-10-06 14:08:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1806"> <Title>Multiword Unit Hybrid Extraction</Title> <Section position="4" start_page="1" end_page="11" type="metho"> <SectionTitle> 3 Text Segmentation </SectionTitle> <Paragraph position="0"> Positional ngrams are nothing more than ordered vectors of textual units which principles are introduced in the next subsection.</Paragraph> <Section position="1" start_page="1" end_page="11" type="sub_section"> <SectionTitle> 3.1 Positional Ngrams </SectionTitle> <Paragraph position="0"> The original idea of the positional ngram model (Gael Dias, 2002) comes from the lexicographic evidence that most lexical relations associate words separated by at most five other words (John Sinclair, 1974). As a consequence, lexical relations such as MWUs can be continuous or discontinuous sequences of words in a context of at most eleven words (i.e. 5 words to the left of a pivot word, 5 words to the right of the same pivot word and the pivot word itself). In general terms, a MWU can be defined as a specific continuous or discontinuous sequence of words in a (2.F+1)-word size window context (i.e. F words to the left of a pivot word, F words to the right of the same pivot word and the pivot word itself). This situation is illustrated in Figure 2 for the multiword unit Ngram Statistics that fits in the window context of size 2.3+1=7.</Paragraph> <Paragraph position="1"> Thus, any substring (continuous or discontinuous) that fits inside the window context and contains the pivot word is called a positional word ngram. For instance, [Ngram Statistics] is a positional word ngram as is the discontinuous sequence [Ngram ___ from] where the gap represented by the underline stands for any word occurring between Ngram and from (in this case, Statistics). More examples are given in Table 1.</Paragraph> <Paragraph position="2"> Generically, any positional word ngram may be defined as a vector of words [p .</Paragraph> <Paragraph position="3"> Thus, the positional word ngram [Ngram Statisitcs] would be rewritten as [0 Ngram +1 Statistics]. More examples are given in Table 2.</Paragraph> <Paragraph position="4"> Positional word ngrams Algebraic notation However, in a part-of-speech tagged corpus, each word is associated to a unique part-of-speech tag. As a consequence, each positional word ngram is linked to a corresponding positional tag ngram. A positional tag ngram is nothing more than an ordered vector of part-of-speech tags exactly in the same way a positional word ngram is an ordered vector of words. Let's exemplify this situation. Let's consider the following portion of a part-of-speech tagged sentence following the Brown tag set: Virtual /JJ Approach /NN to /IN Deriving /VBG Ngram /NN Statistics /NN from /IN Large /JJ Scale /NN Corpus /NN It is clear that the corresponding positional tag ngram of the positional word ngram [0 Ngram +1 Statisitcs] is the vector [0 /NN +1 /NN]. More examples are in Table 3.</Paragraph> <Paragraph position="5"> Generically, any positional tag ngram may be defined as a vector of part-of-speech tags [p</Paragraph> <Paragraph position="7"> stands for any part-of-speech tag in the positional tag ngram and p represents the distance that separates the part-of-speech tags t</Paragraph> <Paragraph position="9"> So, any sequence of words, in a part-of-speech tagged corpus, is associated to a positional word ngram and a corresponding positional tag ngram. In order to introduce the part-of-speech tag factor in any sequence of words of part-of-speech tagged corpus, we present an alternative notation of positional ngrams called positional word-tag ngrams.</Paragraph> <Paragraph position="10"> In order to represent a sequence of words with its associated part-of-speech tags, a positional ngram may be represented by the following vector of words and part- null By statement, any p ii is equal to zero.</Paragraph> <Paragraph position="11"> Virtual Approach to Deriving Ngram Statistics from Large Scale pivot F=3 F=3 of-speech tags [p</Paragraph> <Paragraph position="13"> stands for any word in the positional ngram, t i stands for the part-of-speech tag of the word u positional ngram [Ngram Statistics] can be represented by the vector [0 Ngram /NN +1 Statistics /NN] given the text corpus in section (3.1). More examples are given in Table 4.</Paragraph> <Paragraph position="14"> This alternative notation will allow us to defining, with elegance, our combined association measure, introduced in the next section.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.2 Data Preparation </SectionTitle> <Paragraph position="0"> So, the first step of our architecture deals with segmenting the input text corpus into positional ngrams. First, the part-of-speech tagged corpus is divided into two sub-corpora: one sub-corpus of words and one sub-corpus of part-of-speech tags. The word sub-corpus is then segmented into its set of positional word ngrams exactly in the same way the tagged sub-corpus is segmented into its set of positional tag ngrams.</Paragraph> <Paragraph position="1"> In parallel, each positional word ngram is associated to its corresponding positional tag ngram in order to further evaluate the global degree of cohesiveness of any sequence of words in a part-of-speech tagged corpus.</Paragraph> <Paragraph position="2"> Our basic idea is to evaluate the degree of cohesiveness of each positional ngram independently (i.e. the positional word ngrams on one side and the positional tag ngrams on the other side) in order to calculate the global degree of cohesiveness of any sequence in the part-of-speech tagged corpus by combining its respective degrees of cohesiveness i.e. the degree of cohesiveness of its sequence of words and the degree of cohesiveness of its sequence of part-of-speech tags.</Paragraph> <Paragraph position="3"> In order to evaluate the degree of cohesiveness of any sequence of textual units, we use the association measure called Mutual Expectation.</Paragraph> </Section> </Section> <Section position="5" start_page="11" end_page="11" type="metho"> <SectionTitle> 4 Cohesiveness Evaluation </SectionTitle> <Paragraph position="0"> The Mutual Expectation (ME) has been introduced by Gael Dias (2002) and evaluates the degree of cohesiveness that links together all the textual units contained in a positional ngram ([?]n, n [?] 2) based on the concept of Normalized Expectation and relative frequency.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.1 Normalized Expectation </SectionTitle> <Paragraph position="0"> The basic idea of the Normalized Expectation (NE) is to evaluate the cost, in terms of cohesiveness, of the loss of one element in a positional ngram. Thus, the NE is defined in Equation 1 where the function k(.) returns the frequency of any positional ngram</Paragraph> <Paragraph position="2"/> <Paragraph position="4"> However, evaluating the average cost of the loss of an element is not enough to characterize the degree of cohesiveness of a sequence of textual units. The Mutual Expectation is introduced to solve this insufficiency.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.2 Mutual Expectation </SectionTitle> <Paragraph position="0"> Many applied works in Natural Language Processing have shown that frequency is one of the most relevant statistics to identify relevant textual associations. For instance, in the context of multiword unit extraction, (John Justeson and Slava Katz, 1995; Beatrice Daille, 1996) assess that the comprehension of a multiword unit is an iterative process being necessary that a unit should be pronounced more than one time to make its comprehension possible. Gael Dias (2002) believes that this phenomenon can be enlarged to part-of-speech tags.</Paragraph> <Paragraph position="1"> From this assumption, they pose that between two positional ngrams with the same NE, the most frequent positional ngram is more likely to be a relevant sequence.</Paragraph> <Paragraph position="2"> So, the Mutual Expectation of any positional ngram is defined in Equation 3 based on its NE and its relative frequency embodied by the function p(.).</Paragraph> <Paragraph position="3"> The &quot;^&quot; corresponds to a convention used in Algebra that consists in writing a &quot;^&quot; on the top of the omitted term of a given succession indexed from 1 to n.</Paragraph> <Paragraph position="4"> We will note that the ME shows interesting properties.</Paragraph> <Paragraph position="5"> One of them is the fact that it does not sub-evaluate interdependencies when frequent individual textual units are present. In particular, this allows us to avoid the use of lists of stop words. Thus, when calculating all the positional ngrams, all the words and part-of-speech tags are used. This fundamentally participates to the flexibility of use of our system.</Paragraph> <Paragraph position="6"> As we said earlier, the ME is going to be used to calculate the degree cohesiveness of any positional word ngram and any positional tag ngram. The way we calculate the global degree of cohesiveness of any sequence of words associated to its part-of-speech tag sequence, based on its two MEs, is discussed in the next subsection. null</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3 Combined Association Measure </SectionTitle> <Paragraph position="0"> The drawbacks shown by the statistical methodologies evidence the lack of linguistic information. Indeed, these methodologies can only identify textual associations in the context of their usage. As a consequence, many relevant structures can not be introduced directly into lexical databases as they do not guarantee adequate linguistic structures for that purpose.</Paragraph> <Paragraph position="1"> In this paper, we propose a first attempt to solve this problem without pre-defining syntactical patterns of interest that bias the extraction process. Our idea is simply to combine the strength existing between words in a sequence and the evidenced interdependencies between its part-of-speech tags. We could summarize this idea as follows: the more cohesive the words of a sequence and the more cohesive its part-of-speech tags, the more likely the sequence may embody a multiword unit.</Paragraph> <Paragraph position="2"> This idea can only be supported due to two assumptions.</Paragraph> <Paragraph position="3"> On one hand, a great deal of studies in lexicography and terminology assess that most of the MWUs evidence well-known morpho-syntactic structures (Gaston Gross, 1996). On the other hand, MWUs are recurrent combinations of words capable of representing a fifth of the overall surface of a text (Benoit Habert and Christian Jacquemin, 1993). Consequently, it is reasonable to think that the syntactical patterns embodied by the MWUs may endogenously be identified by using statistical scores over texts of part-of-speech tags exactly in the same manner as word dependencies are identified in corpora of words. So, the global degree of cohesiveness of any sequence of words may be evaluated by a combination of its own ME and the ME of its associated part-of-speech tag sequence. The degree of cohesiveness of any positional ngram based on a part-of-speech tagged corpus can then be evaluated by the combined association measure (CAM) defined in Equation 4 where a stands as a parameter that tunes the focus whether on words or on part-of-speech tags.</Paragraph> <Paragraph position="5"> We will see in the final section of this paper that different values of a lead to fundamentally different sets of multiword unit candidates. Indeed, a can go from a total focus on part-of-speech tags (i.e. the relevance of a word sequence is based only on the relevance of its part-of-speech sequence) to a total focus on words (i.e. the relevance of a word sequence is defined only by its word dependencies). Before going to experimentation, we need to introduce the used acquisition process which objective is to extract the MWUs candidates.</Paragraph> </Section> </Section> <Section position="6" start_page="11" end_page="11" type="metho"> <SectionTitle> 5 The Acquisition Process </SectionTitle> <Paragraph position="0"> The GenLocalMaxs (Gael Dias, 2002) proposes a flexible and fine-tuned approach for the selection process as it concentrates on the identification of local maxima of association measure values. Specifically, the GenLocalMaxs elects MWUs from the set of all the valued positional ngrams based on two assumptions. First, the association measures show that the more cohesive a group of words is, the higher its score will be. Second, MWUs are localized associated groups of words. So, we may deduce that a positional word-tag ngram is a MWU if its combined association measure value is higher or equal than the combined association measure values of all its sub-groups of (n-1) words and if it is strictly higher than the combined association measure values of all its super-groups of (n+1) words. Let cam be the combined association measure, W a positional word-tag ngram, Ohm the set of all the positional word-tag (n-1)grams contained in W, Ohm the set of all the positional word-tag (n+1)-grams containing W and sizeof(.) a function that returns the number of words of a positional word-tag ngram. The GenLocalMaxs is defined as: Among others, the GenLocalMaxs shows one important property: it does not depend on global thresholds. A direct implication of this characteristic is the fact that, as no tuning needs to be made in order to acquire the set of all the MWU candidates, the use of the system remains as flexible as possible. Finally, we show the results obtained by applying HELAS over the Brown Corpus.</Paragraph> </Section> <Section position="7" start_page="11" end_page="11" type="metho"> <SectionTitle> 6 The Experiments </SectionTitle> <Paragraph position="0"> In order to test our architecture, we have conducted a number of experiments with 11 different values of a for a portion of the Brown Corpus containing 249 578 words i.e. 249 578 words plus its 249 578 part-of-speech tags. The limited size of our corpus is mainly due to the space complexity of our system. Indeed, the number of computed positional ngrams is huge even for a small corpus. For instance, 21 463 192 positional ngrams are computed for this particular corpus for a 7word size window context. As a consequence, computation is hard. For this experiment, HELAS has been tested on a personal computer with 128 Mb of RAM, 20 Gb of Hard Disk and an AMD 1.4 Ghz processor under Linux Mandrake 7.2. On average, each experiment (i.e.</Paragraph> <Paragraph position="1"> for a given a ) took 4 hours and 20 minutes. Knowing that our system increases proportionally with the size of the corpus, it was unmanageable, for this particular experiment, to test our architecture over a bigger corpus.</Paragraph> <Paragraph position="2"> Even though, the whole processing stage lasted almost</Paragraph> </Section> <Section position="8" start_page="11" end_page="11" type="metho"> <SectionTitle> 48 hours </SectionTitle> <Paragraph position="0"> .</Paragraph> <Paragraph position="1"> We will divide our experiment into two main parts. First, we will do a quantitative analysis and then we will lead a qualitative analysis. All results will only tackle contiguous multiword units although non-contiguous sequences may be extracted. This decision is due to the lack of space.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 6.1 Quantitative Analysis </SectionTitle> <Paragraph position="0"> In order to understand, as deeply as possible, the interaction between word cohesiveness and part-of-speech tag cohesiveness, we chose eleven different values for a , i.e. a [?] {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, going from total focus on words (a = 1) to total focus on part-of-speech tags (a = 0).</Paragraph> <Paragraph position="1"> First, we show the number of extracted contiguous MWU candidates by a in table 5. The total results are not surprising. Indeed, with a = 0, the focus is exclusively on part-of-speech tags. It means that any word sequence, with an identified relevant part-of-speech sequence, is extracted independently of the words it contains. For instance, all the word sequences with the pattern [/JJ /NN] (i.e. Adjective + Noun) may be ex- null We are already working on an efficient implementation of HELAS using suffix-arrays and the concept of masks.</Paragraph> <Paragraph position="2"> tracted independently of their word dependencies! This obviously leads to an important number of extracted sequences. The inclusion of the word factor, by increasing the value of a , progressively leads to a decreasing number of extracted positional ngrams. In fact, the word sequences with relevant syntactical structures are being filtered out depending on their word statistics. Finally, with a = 1, the focus is exclusively on words. The impact of the syntactical structure is null and the positional ngrams are extracted based on their word associations.</Paragraph> <Paragraph position="3"> In this case, the word sequences do not form classes of morpho-syntactic structures being the reason why less positional ngrams are extracted.</Paragraph> <Paragraph position="4"> A deeper analysis of table 5 reveals interesting results. The smaller the values of a , the more positional 2grams are extracted. This situation is illustrated in Figure 3. # of ext ract ed ngrams by alpha Once again these results are not surprising. The Mutual Expectation tends to give more importance to frequent sequences of textual units. While it performs reasonably well on word sequences, it tends to over-evaluate the part-of-speech tag sequences. Indeed, sequences of two part-of-speech tags are much more frequent than other types of sequences and, as a consequence, tend to be over-evaluated in terms of cohesiveness. As small values of a focus on syntactical structures, it is clear that in this case, small sequences of words are preferred over longer sequences.</Paragraph> <Paragraph position="5"> By looking at Figure 3 and Table 5, we may think that a great number of extracted sequences are common to each experiment. However, this is not true. In order to assess this affirmation, we propose, in Table 6, the summary of the identical ratio.</Paragraph> <Paragraph position="6"> The identical ratio calculates, for two values of a , the quotient between the number of identical extracted sequences and the number of different extracted sequences. Thus, the first value of the first row of table 6, represents the identical ratio for a =0 and a =0.1, and means that there are 14.64 times more identical extracted sequences than different sequences between both experiments.</Paragraph> <Paragraph position="7"> Taking a =0 and a =1, it is interesting to see that there are much more different sequences than identical sequences between both experiments (identical ratio = 0.47). In fact, this phenomenon progressively increases as the word factor is being introduced in the combined association measure to reach a =1. This was somewhat unexpected. Nevertheless, this situation can be partly decrypted from Figure 3. Indeed, figure 3 shows that longer sequences are being preferred as a increases. In fact, what happens is that short syntactically well-founded sequences are being replaced by longer word sequences that may lack linguistic information. For instance, the sequence [Blue Mosque] was extracted with a =0, although the longer sequence [the Blue Mosque] was preferred with a =1 as whenever [Blue Mosque] appears in the text, the determinant [the] precedes it.</Paragraph> <Paragraph position="8"> Finally, a last important result concerns the frequency of the extracted sequences. Table 7 gives an overview of the situation. The figures are clear. Most of the extracted sequences occur only twice in the input text corpus. This result is rather encouraging as most known extractors need high frequencies in order to decide whether a sequence is a MWU or not. This situation is mainly due to the GenLocalMaxs algorithm.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 6.2 Qualitative Analysis </SectionTitle> <Paragraph position="0"> As many authors assess (Frank Smadja, 1993; John Justeson and Slava Katz, 1995), deciding whether a sequence of words is a multiword unit or not is a tricky problem. For that purpose, different definitions of multiword unit have been proposed. One of the most successful attempts can be attributed to Gaston Gross (1996) that classifies multiword units into six groups and provides techniques to determine their belonging.</Paragraph> <Paragraph position="1"> As a consequence, we intend as multiword unit any compound noun (e.g. interior designer), compound determinant (e.g. an amount of), verbal locution (e.g. run through), adverbial locution (e.g. on purpose), adjectival locution (e.g. dark blue) or prepositional locution (e.g. in front of).</Paragraph> <Paragraph position="2"> The analysis of the results has been done intramuros although we are aware that an external independent cross validation would have been more suited. However, it was not logistically possible to do so and by using Gaston Gross's classification and methodology, we narrow the human error evaluation as much as possible. Technically, we have randomly extracted and analysed 200 positional 2grams, 200 positional 3grams and 100 positional 4grams for each value of a . For the specific case of positional 5grams and 6grams, all the sequences have been analysed.</Paragraph> <Paragraph position="3"> Precision results of this analysis are given in table 8 and show that word dependencies and part-of-speech tag dependencies may both play an important role in the identification of relevant sequences. Indeed, values of a between 0.4 and 0.5 seem to lead to optimum results.</Paragraph> <Paragraph position="4"> Knowing that most extracted sequences are positional 2grams or positional 3grams, the global precision results approximate the results given by 2grams and 3grams. In these conditions, the best results are for a =0.5 reaching an average precision of 62 %. This would mean that word dependencies and part-of-speech tags contribute equally to multiword unit identification.</Paragraph> <Paragraph position="5"> A deeper look at the results evidences interesting regularities as shown in figure 4. Indeed, the curves for 4grams, 5grams and 6grams are reasonably steady along the X axis evidencing low results. This means, to some extent, that that our system does not seem to be able to tackle successfully multiword units with more than three words. In fact, neither a total focus on words or on part-of-speech tags seems to change the extraction results. However, the importance of these results must be weakened as they represent a small proportion of the extracted structures.</Paragraph> <Paragraph position="6"> Precision by alpha and ngram On the other hand, the curves for 2grams and 3grams show different behaviours. For the 3gram case, it seems that the syntactical structure plays an important role in the identification process. Indeed, precision falls down drastically when the focus passes to word dependencies. This is mainly due to the extraction of recurrent sequences of words that do not embody multiword unit syntactical structures like [been able to] or [can still be]. As 2grams are concerned, the situation is different. In fact, it seems that too much focus on either words or part-of-speech tags leads to unsatisfactory results. Indeed, optimum results are obtained for a balance between both criteria. This result can be explained by the fact that there exist many recurrent sequences of two words in a corpus. However, most of them are not multiword units like [of the] or [can be]. For that reason, only a balanced weight on part-of-speech tag and word dependencies may identify relevant two word sequences.</Paragraph> <Paragraph position="7"> However, not-so-high precision results show that two-word sequences still remain a tricky problem for our extractor as it is difficult to filter out very frequent patterns that embody meaningless syntactical structures.</Paragraph> </Section> </Section> class="xml-element"></Paper>