File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1124_intro.xml
Size: 7,429 bytes
Last Modified: 2025-10-06 14:02:09
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1124"> <Title>Detecting Multiword Verbs in the English Sublanguage of MEDLINE Abstracts</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> During the construction of an information extraction (IE) system in the biomedical domain, we found that not only the task of named entity recognition (NER), but also the appropriate handling of verbs in this domain plays an important role. It is very helpful to determine the domain-specific verbs in a specific domain when extracting useful information, because the domain-specific verbs construct semantic relations between named entities (NEs). However, three problems in the handling of verbs in a specific domain are still open: The first problem is how to determine domain-specific verbs. This problem has not received enough notice from most of the researchers yet. Actually, domain-specific verbs have been mentioned quite often in biomedical text processing (Thomas et al., 2000; Ono et al., 2001; Xiao and R&quot;osner, 2004b), but often referred to a set of manually or experientially selected verbs. Spasi'c et al. (2003) briefly presented a method to find domain-specific verbs by filtering the verbs in a stoplist, at the same time, using the co-occurrence of a verb and specific terms in the text. In our experiment, the domain-specific verbs are determined through the comparison between different corpora in different domains, or through genre analysis of the sublanguage dominated corpus.</Paragraph> <Paragraph position="1"> The second problem is how to determine multi-word verbs (MWVs). Here we do not make differences between the more detailed classification of multiword verbs, especially the verb-particle constructions and verb-preposition constructions (Baldwin and Villavicencio, 2002). As a subcategory of multiword expressions (Sag et al., 2002), MWVs raise the complexity of our processing. Because some MWVs share the same verb head but lead to different semantic interpretations, like result in and result from, considering only verb heads in the processing is obviously not sufficient.</Paragraph> <Paragraph position="2"> A good IE system should deal with such MWVs automatically and appropriately.</Paragraph> <Paragraph position="3"> The third problem is that there is a need to investigate the inflectional and derivational forms of the verbs. An IE system may have to deal with a set of patterns, in which the inflectional and derivational forms of the verbs should be taken into account. For example, in biomedical texts, the verb interact defines a binary relation between two substances, whereas its nominalization morpheme in a pattern such as the interaction of ... with ... also constructs such a relation. Note that such patterns often have close relationship with its common verb lemma, which is often a MWV. For instance, the above pattern can map to the MWV interact with.</Paragraph> <Paragraph position="4"> Table 1 shows the distribution of all inflectional and derivational forms of the verb inhibit in a corpus of 800 MEDLINE abstracts1 extracted from the GENIA corpus.2 This verb is a very important domain-specific verb in the biomedical domain. To deal with those inflectional and derivational forms appropriately will improve the performance of the IE system. null The following text focuses on the second problem inhibitor(s) 161 a/the ... inhibitor ...</Paragraph> <Paragraph position="5"> inhibition 167 ... inhibition of ...</Paragraph> <Paragraph position="6"> inhibitory 61 ... inhibitory effect/factor ...</Paragraph> <Paragraph position="7"> inhibiting 24 ... in inhibiting ...</Paragraph> <Paragraph position="8"> inhibited 119 ... inhibited ...</Paragraph> <Paragraph position="9"> be inhibited 73 ... be inhibited by ...</Paragraph> <Paragraph position="10"> inhibit 63 ... inhibit ...</Paragraph> <Paragraph position="11"> inhibits 57 ... inhibits ...</Paragraph> <Paragraph position="12"> taining the verb stem inhibit and their occurrences in a test corpus with 800 MEDLINE abstracts.</Paragraph> <Paragraph position="13"> above. Section 2 introduces a set of language processing tools used in the experiment. Detailed description of the approach for the extraction of proper MWVs is presented in section 3. The evaluation of the result and the aspects that have influence on the result are discussed in section 4, as well as our future works. Finally, in the appendix, we list a number of MWVs that have been extracted by our approach. null 2 Tokeniser, POS Tagger and Chunker Our experiment in this paper is carried out mainly on chunk sequences, therefore the following processing components are necessary: * Tokeniser: Following the whitespacedelimited tokenisation discipline, the tokeniser determines the segmentation of the non-lexical entries such as tokens with non-alphabet characters or abbreviations. After tokenisation, the sentence boundaries are determined as well.</Paragraph> <Paragraph position="14"> * POS tagger: The maximum entropy POS tagger developed by Ratnaparkhi (Ratnaparkhi, 1996) and the rule-based POS tagger developed by Brill (Brill, 1994) are trained with 1200 abstracts extracted from the GENIA corpus, which achieve accuracies of 97.97% and 98.06% respectively, when testing on the rest 800 abstract of the GENIA corpus. Since our test corpus is directly extracted from the POS tagged GENIA Corpus V3.0p, we do not have to apply the process of tokenisation and POS tagging.</Paragraph> <Paragraph position="15"> * Chunker: In this experiment, unlike the traditional statistical method for collocation extraction, where sentences are treated as word sequences (Manning and Sch&quot;utze, 2002), a shallow chunking process is first carried out.</Paragraph> <Paragraph position="16"> Then, sentences in our test corpus are treated as chunk sequences.</Paragraph> <Paragraph position="17"> Up to now, the chunker consists of two parts, both utilize WordNet 1.7.13 (Fellbaum, 1999) as the lexical resource for the lemmatization, i.e., as the verb and noun stemmer.</Paragraph> <Paragraph position="18"> - Verb chunker, which extracts the smallest verb chunks (not including the MWV structures) with the additional syntactic information such as number (3rd singular present), voice (active/passive), and negation. Since most of the scientific abstracts are written in present or past tense, the temporal information is not extracted especially. The verb chunker returns the common verb lemma of a verb, with the additional syntactic information mentioned above. For example, given an input verb chunk has not been established, it returns [establish, singular, passive, negation].</Paragraph> <Paragraph position="19"> - Noun chunker, which determines the noun chunk boundaries, negation, number (singular/plural), as well as some inner dependencies of the noun chunks containing substructure(s). For example, a noun chunk like [ [the retinoic acidsynthesizing enzyme] [aldehyde dehydrogenase 1] ] is actually an apposition structure.4 In this experiment, the singular stem of a plural noun token is not returned in order to avoid missing of necessary information. For example, although both take place and take places can map to the same base structure take place, they must be treated separately.</Paragraph> </Section> class="xml-element"></Paper>