File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1124_metho.xml
Size: 19,295 bytes
Last Modified: 2025-10-06 14:08:47
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1124"> <Title>Detecting Multiword Verbs in the English Sublanguage of MEDLINE Abstracts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 MWV Extraction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Analysis of MWVs in the Corpus </SectionTitle> <Paragraph position="0"> The following experiment is carried out on a test corpus consisting of 1800 abstracts from the GENIA Corpus V3.0p, with 14955 sentences and 40.84K tokens (abstract titles are not included).</Paragraph> <Paragraph position="1"> In general, the methodologies for the extraction of multiword expressions (MWEs, including MWVs) can be classified into syntactic, statistical and hybrid syntactic-statistical (Dias, 2003). Purely syntactic processing of MWEs requires specific linguistic knowledge across the different domains of a language, such as a semantic ontology (Piao et al., 2003). Purely statistical processing overgenerates the MWE candidates (Ga&quot;el Dias, 2002), and is not sensitive enough to the MWE candidates with low frequencies (Piao et al., 2003). It is practical in some cases for a hybrid syntactic-statistical system to pre-define a set of MWE pattern rules, then use statistical techniques to filter proper candidates. But it lacks the flexibility to obtain a comprehensive coverage of possible MWE candidates, especially when a MWV is non-contiguous in our case. In addition, it also suffers from the problem of overgeneration, if the pre-defined syntactic pattern occurs rarely in the corpus. Sag et al. (2002) indicated that it is very necessary to find the balance between the two methods in hybrid systems. This point of view is taken into account in our approach.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Extraction of Contiguous MWV Candidates </SectionTitle> <Paragraph position="0"> A number of works about MWV extraction from corpora are based on the output of a POS tagger and a chunker (Baldwin and Villavicencio, 2002; Bannard et al., 2003), or the output of a parser (McCarthy et al., 2003). These works extracted mainly the verb+particle structures. Similar to those works, the MWV extraction in our experiment is also based on the chunking output. But, since MWVs has various POS tag patterns, it is not practical to assign each pattern an according syntactic rule. Therefore a variation of finite state automaton is considered in our approach for the extraction of MWVs. Let G denote this automaton.</Paragraph> <Paragraph position="2"> tag sequence and lexical sequence;</Paragraph> <Paragraph position="4"> Controlling functions in F define operations for the output. Controlling functions in G define state transitions of the automaton with respect to the features from both POS tags and lexical entries of an input chunk. An example is given as following.</Paragraph> <Paragraph position="5"> * Example sentence: The 3'NF-E2/AP1 motif is able to exert both positive and negative regulatory effects on the zeta 2-globin promoter activity in K562 cells.</Paragraph> <Paragraph position="6"> where ADJP is an adjective phrase; ENP is a singular English noun phrase; ENPS is a plural English noun phrase; EVP is a singular English verb phrase; IN is a preposition; IVP is an infinitive verb phrase; and, SEPR is a sentence seperator.</Paragraph> <Paragraph position="7"> * Extraction of Contiguous MWV Candidates: In the following table, the Input items are the combination of both lexical sequences and the corresponding chunk tags, but only chunk tags are presented in the table. The output operation ph means no operation. For this example, it returns be able to as a MWV candidate. null</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Extraction of Non-contiguous MWV Candidates </SectionTitle> <Paragraph position="0"> When a set of new controlling functions are given, the finite automaton mentioned above also extracts non-contiguous MWV candidates. We primarily focuses on non-contiguous MWVs in form of verb + particle. As the particles in verb + particle MWVs are often intransitive (Baldwin and Villavicencio, 2002; McCarthy et al., 2003), which are different from the transitive prepositions followed by a noun chunk, we use this feature and a nearest assumption to extract non-contiguous MWV candidates. In general, we assume that a non-contiguous MWV occurs in a limited context window.5 Because of the specific test corpus in our experiment, the non-contiguous MWV candidates extracted in our experiment are a relative small sub-set of all the candidates.6 Most of them are not proper candidates. We suppose that the genre of scientific abstracts is an important reason for that: there are much more specific nominal terms as well as specific verbs (not MWVs) in scientific abstracts than in everyday language.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Solutions to Overgeneration of MWV Candidates </SectionTitle> <Paragraph position="0"> It is not surprising that the finite automaton is also sensitive to the low-frequent MWVs such as &quot;take place&quot; (7 times in the test corpus) and &quot;shed light on&quot; (4 times). But several problems of overgeneration7 are still found, which include: Case 1. Example: [take place]1.1, [take place at]1.2, [take place in]1.3. In general, we assume that the short structures are more reliable, especially when the occurrences of the short structures are much more frequent than the long structures.</Paragraph> <Paragraph position="1"> But in this example, all three phrases occur with the same frequency in the test corpus, we still choose the more reliable short structure, and add up all occurrences of these structures.</Paragraph> <Paragraph position="2"> Case 2. Example: [be able to]2.1, [be important for]2.2. The structure [2.1] is a MWV, but the structure [2.2], which has the same POS tag sequence as [2.1], is actually not a MWV accepted by a lexicon. In this case, the verb head is usually one of the most frequent verbs such as be, take, go, and etc. In a previous experiment, we computed the logarithm likelihood ratio of the two mutual hypotheses8 for the contiguous MWV candidates extracted from the test corpus, in order to find the reliability of such collocations. But we got some unexpected results, like be important in be important for was a more reliable structure than shed light in shed light on. It indicates that this score is still not sensitive enough to extremely sparse samples. In addition, this method</Paragraph> <Paragraph position="4"> tio [?]2logl = [?]2logL(H1)/L(H2) is more appropriate than kh2 test, since some MWVs are quite sparse in our test corpus (Manning and Sch&quot;utze, 2002).</Paragraph> <Paragraph position="5"> is not suitable for non-contiguous MWV candidates.</Paragraph> <Paragraph position="6"> In our experiment, we suppose that it is neither the verb head nor the preposition that determines the reliability of such MWV structures. Therefore we only focus on the distribution of the rest words (able or important) in the corpus. Such words, together with the verb head in a MWV pattern like verb + particle, in the following parts of paper, are given the name MWV head. For instance, we find that 83% occurrences of able are in the MWV candidate structure be able to, but only 8.4% occurrences of important are in be important for. Hence the structure [2.1] is a much better candidate of MWV than [2.2]. By this means, the low-frequency candidate shed light on can also get a better rank than the relative high-frequency candidate be important for.</Paragraph> <Paragraph position="7"> Case 3. Example: [take place]3.1, [bind DNA]3.2.</Paragraph> <Paragraph position="8"> [3.1] is a MWV, but [3.2], which also has the same POS tag sequence as [3.1], is not a MWV. In our case, a set of domain-specific terms are available from the NE-annotated GENIA corpus V3.01.</Paragraph> <Paragraph position="9"> Since we suppose that the MWVs contain only general words, the word like DNA in this corpus can be found in the specific word list, then this structure can be excluded from the list of MWV candidates.</Paragraph> <Paragraph position="10"> However, this method induces also problems. For example, give rise to is a MWV, but rise is also in the specific word list of this corpus. In this case, the specific word list could be selected according to some criteria (e.g., frequencies in the list of specific terms), so that a much more comprehensive list of MWV candidates can be produced without losing the generality.</Paragraph> <Paragraph position="11"> Case 4. Example: [be able to]4.1, [be unaffected]4.2. The POS tag pattern of [4.2] is a substring of [4.1], i.e., EVP + ADJP, but [4.2] is obviously not a proper MWV candidate. We assume that a proper MWV should have closed left and right boundaries, which means, the left boundary of a MWV candidate should be a verb, and the right boundary should be a preposition (including to) or a noun. Therefore such patterns with open right boundaries like [4.2] in this example are deleted from the candidate list.</Paragraph> <Paragraph position="12"> Case 5. Example: [be associated with]5.1 and [associate(d) with]5.2, [be used to]5.3 and [used to]5.4. The pair of [5.1] and [5.2] have no semantic differences between the past and present tense, as well as no semantic transition of the MWV itself between the passive and active voice. That means, they can all map to the MWV base form associate with. But there are semantic differences between the pair of [5.3] and [5.4]. The past tense phrase used to is a fixed idiomatic verb phrase (e.g., He used to smoke a pipe.), like the present tense phrase according to, generally they do not occur in forms of other tense. There is no semantic relationship between [5.3] and [5.4] in some cases, although the base forms of both structures are the same, i.e., use to. In this experiment, we do not consider the later case. All MWV candidates have the mapping to their base form, but the information about the passive and active voice are reserved, so that some candidates in passive forms (e.g., be inhibited by) can be excluded.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Evaluation of the Reliability of the MWV Candidates </SectionTitle> <Paragraph position="0"> After the above processing on the set of MWV candidates extracted by the finite automaton, the following task is to examine the reliability of the candidates, especially for those candidates that share the same MWV head. To solve this problem, statistical measurement is necessary. First, the frequencies of the MWV candidates in the test corpus are taken into account. For instance, result in is the most frequent MWV candidate, which has more than 320 occurrences, it is obviously a proper MWV candidate. From Figure 1, we can find that a large number of MWV candidates occur with relative low frequencies ranged from about 1 to 10.</Paragraph> <Paragraph position="1"> In order to avoid accidental errors during the process (mainly the wrong assignment of POS tags), the MWV candidates with the lowest frequencies from 1 to 4 are out of consideration. Second, the distribution of the MWV head in the MWV candidates is considered. We assume that a verb head of a certain MWV has the inertia (big probability) to construct other MWVs than to be isolated. For instance, 89% of occurrences of the verb head result are in result in, only 8.5% belongs to result from.</Paragraph> <Paragraph position="2"> Although result from is not a high frequent MWV candidate, it is still a proper one. Third, the contiguous and non-contiguous MWV candidates are treated as the same structure, so that such structures are not ignored by the statistical measurement. That means, if a MWV candidate occurs in both contiguous and non-contiguous forms, then their occurrences are added up. According to our experiment results, the occurrences of non-contiguous MWV candidates are much less than the contiguous candidates, which leads to a very small number of non-contiguous MWVs successfully extracted from our test corpus.</Paragraph> <Paragraph position="3"> To evaluate the reliability of a certain MWV candidate c in the candidate set C, following definitions are given.</Paragraph> <Paragraph position="4"> * head(c), the MWV head of c, c [?] C; * f(x), the frequency of x, x can be c, or head(c); * F(c), the sum of occurrences of all candidates in C, which share the same MWV head with c; * E(c), the evaluation score of c,</Paragraph> <Paragraph position="6"> where c1, c2 and c3 (c1, c2, c3 [?] 0) are coefficients; * t, the threshold of score evaluation,</Paragraph> <Paragraph position="8"> where c [?] C, a [?] 1, b [?] 0.</Paragraph> <Paragraph position="9"> If E(c) [?] t, c is a proper candidate. The flowchart of the process to filter proper MWV candidates is shown in Figure 2.</Paragraph> <Paragraph position="10"> In order to obtain the satisfying values of the coefficients and the threshold, a manual sample is created, so that the values of the coefficients can be tuned. It is not feasible that we give all extracted MWV candidates a human evaluation, therefore we candidates according to equation 1, when we set the threshold t = 1xminE(c)+0, c [?] CP (baseline). chose the most frequent 33 candidates (f(c) [?] 60), 31 candidates with moderate frequencies (14 [?] f(c) [?] 19), and 95 candidates with low frequencies (6 [?] f(c) [?] 7) as a manual sample set (CM, |CM |= 159). Those MWV candidates in the manual sample set are looked up in a dictionary,9 if there is such a MWV entry in the dictionary, then we assign a proper flag to the candidate.</Paragraph> <Paragraph position="11"> From the manual test sample set CM, we annotate 42 items as proper MWVs (CP, |CP |= 42). In the experiment, we set the coefficient c1 to be the reciprocal of the largest occurrence of the MWV candidates (c1 = 1/maxf(c),c [?] C), and t is set to be the linear function of the smallest score of the MWV in CP by the reliability evaluation, e.g., t = minE(c), c [?] CP. We use the scores of recall (R), precision(P), and F-measure (Fb=1) to evaluate our result. In the following equations, X denotes the set of candidates in CP, whose scores of reliability evaluation are greater than t, i.e., X = {c|c [?] CP and E(c) [?] t}; Y denotes the set of candidates in CM, whose scores of reliability evaluation are greater than t, i.e., Y = {c|c [?] CM and E(c) [?] t}.</Paragraph> <Paragraph position="13"/> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Result, Discussion, and Future Works </SectionTitle> <Paragraph position="0"> The result in Table 2 indicates that it is neither the frequency of occurrences of a MWV candidate (c1 = 0.003), nor the proportion of a MWV candidate to its head word (c2 = 0.5), especially the verb head, but the inertia of a verb to construct MWVs that determines a proper MWV candidate (c3 = 10).</Paragraph> <Paragraph position="1"> The result strongly supports this assumption.</Paragraph> <Paragraph position="2"> We also found that the initiation of the value of t was very important. In Table 2, the minimum value 9Since WordNet is lacking in MWV entries, we used the Oxford advanced learner's dictionary of current English (Encyclopedic version, 1992), and the online English-German dictionary LEO additionally, available at http://dict.leo.org/. of E(c) in CP was set to be the baseline of all test data. But we found that if the value of t was properly increased (according to equation 2), although the precision was therefore reduced, the F-measure was improved remarkably. Figure 3 shows how the changes of value t effect the result, given the same values of the coefficients in equation 1, that c1 = 0.003, c2 = 0.5, and c3 = 10. We got a much better F-measure when we set a = 2.3, b = 0.1 (or b = 0.2), so that Fb=1 = 0.753, if compared to the data in Table 2, where a = 1, b = 0, Fb=1 = 0.627.</Paragraph> <Paragraph position="3"> The reason is that some MWV candidates in CP, like use to and carry out, have the MWV heads that seem not to follow our assumption. Such verbs (use, carry, including be, and etc.) are often the most frequent verbs both in specific and general English language. Thus the syntactic and semantic combinations of such verbs and other words are quite rich, which led to a relative low score of E(c) in our experiment. Compared to recent other related works, we found that (Baldwin and Villavicencio, 2002) presented a F-measure of 0.896 by testing on WSJ.</Paragraph> <Paragraph position="4"> But they focused on the single prepositional particle situation, whereas our approach has the special interest in multiple and non-preposition particle cases.</Paragraph> <Paragraph position="5"> Moreover, they used quite a lot of syntactic techniques for more precise extraction of verb-particle constructions (not verb-preposition constructions), which is not the case in ours.</Paragraph> <Paragraph position="6"> Fb=1, given c1 = 0.003, c2 = 0.5, and c3 = 10.</Paragraph> <Paragraph position="7"> In addition, several other aspects also have negative effects on the result. First, the sublanguage is anyway specific compared with the general language, therefore some MWV candidates were hard to give an evaluation. For instance, transfect into/with can be found in neither dictionaries we used in this experiment, then it is hard to give them human evaluation. Second, the POS tag errors during the processing had also a negative effect. E.g., in MWV candidate be related to, related was POS tagged as an adjective, which led to a reduction of the value of E(relate to), since the MWV head of this inflectional structure was set to be the adjective related but not the root of the verb relate. Third, the language resources used in our experiment provided sometimes not the information we needed.</Paragraph> <Paragraph position="8"> For instance, WordNet was lacking in some specific lexical entries of verbs such as synergize, pretreat, and etc. Hence the distribution of their inflectional and derivational forms, such as synergizes and pretreated, could not be analyzed correctly.</Paragraph> <Paragraph position="9"> Our following work is to combine this work with the domain-specific single verbs determined in the corpus (Xiao and R&quot;osner, 2004a), in order to get a comprehensive understanding of domain-specific verbs. And, it will also be investigated if more domain specific resources (e.g., UMLS10 specialist lexicon, etc.), as well as adaptation of general language resources (e.g., WordNet, etc.) to this specific domain can improve the evaluation in equation 1 or not. Another future work is to examine the distribution of the inflectional and derivational forms of MWVs for both MWV candidate evaluation and other IE tasks.</Paragraph> </Section> class="xml-element"></Paper>