File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-1706_intro.xml
Size: 10,194 bytes
Last Modified: 2025-10-06 14:01:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1706"> <Title>XML-Based NLP Tools for Analysing and Annotating Medical Language</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper we describe our use of XML for an analysis of medical language which involves a number of complex linguistic processing stages. The ultimate aim of the project is to to acquire lexical semantic information from MEDLINE through parsing, however, a fundamental tenet of our approach is that higher-level NLP activities benefit hugely from being based on a reliable and well-considerered initial stage of tokenisation. This is particularly true for language tasks in the biomedical and other technical domains since general purpose NLP technology may stumble at the first hurdle when confronted with character strings that represent specialised technical vocabulary. Once firm foundations are laid then one can achieve better performance from e.g. chunkers and parsers than might otherwise be the case.</Paragraph> <Paragraph position="1"> We show how well-founded tools, especially XML-based ones, can enable a variety of NLP components to be bundled together in different ways to achieve different types of analysis. Note that in fields such as information extraction (IE)itiscommontouse statistical text classification methods for data analysis. Our more linguistic approach may be of assistence in IE: see Craven and Kumlien (1999) for discussion of methods for IE from MEDLINE.</Paragraph> <Paragraph position="2"> Our processing paradigm is XML-based. As a mark-up language for NLP tasks, XML is expressive and flexible yet constrainable. Furthermore, there exist a wide range of XML-based tools for NLP applications which lend themselves to a modular, pipelined approach to processing whereby linguistic knowledge is computed and added as XML annotations in an incremental fashion. In processing MEDLINE abstracts we have built a number of such pipelines using as key components the programs distributed with the LT TTT and LT XML toolsets (Grover et al., 2000; Thompson et al., 1997). We have also successfully integrated non-XML public-domain tools into our pipelines and incorporated their output into the XML mark-up using the LT XML program xmlperl (McKelvie, 2000).</Paragraph> <Paragraph position="3"> In Section 2 we describe our use of XML-based tokenisation tools and techniques and in Sections 3 and 4 we describe two different approaches to analysing MEDLINE data which are built on top of the tokenisation. The first approach uses a hand-coded grammar to give complete syntactic and semantic analyses of sentences. The second approach performs a shallower statistically-based analysis which yields 'grammatical relations' rather than full logical forms. This information about grammatical relations is used in a statistically-trained model which disambiguates the semantic relations in noun compounds headed by deverbal nominalisations. For this second approach we compare two separate methods of shallow analysis which require the use of two different part-of-speech taggers.</Paragraph> <Paragraph position="4"> al., 1994) which contains 348,566 references taken from the years 1987-1991. Not every reference contains an abstract, thus the total number of abstracts in the corpus is 233,443. The total number of words in those abstracts is 38,708,745 and the abstracts contain approximately 1,691,383 sentences with an average length of 22.89 words.</Paragraph> <Paragraph position="5"> By pre-parsing we mean identification of word tokens and sentence boundaries and other lower-level processing tasks such as part-of-speech (POS) tagging and lemmatisation. These initial stages of processing form the foundation of our NLP work with MEDLINE abstracts and our methods are flexible enough that the representation of pre-parsing can be easily tailored to suit the input needs of subsequent higher-level processors. We start by converting the OHSUMED corpus from its original format to an XML format (see Figure 1). From this point on we pass the data through pipelines which are composed of calls to a variety of XML-based tools from the LT TTT and LT XML toolsets. The core program in our pipelines is the LT TTT program fsgmatch, a general purpose transducer which processes an input stream and rewrites it using rules provided in a hand-written grammar file, where the rewrite usually takes the form of the addition of XML mark-up. Typically, fsgmatch rules specify patterns over sequences of XML elements and use a regular expression language to identify patterns inside the character strings (PCDATA) which are the content of elements. For example, the following rule for decimals such as &quot;.25&quot; is searching for a sequence of two S elements where the first contains the string &quot;.&quot; as its PCDATA content and the second has been identified as a cardinal number (C='CD', e.g. any sequence of digits). When these two S elements are found, they are wrapped in a W element with the attribute C='CD' (targ sg). (Here S elements encode character sequences, see below, and stage process to identify word tokens within abstracts. First, sequences of characters are bundled into S (sequence) elements using fsgmatch. For each class of character a sequence of one or more instances is identified and the type is recorded as the value of the attribute C (UCA=upper case alphabetic, LCA=lower case alphabetic, WS=white space etc.).</Paragraph> <Paragraph position="6"> Figure 2 shows the string Arterial PaO2 as measured marked up for S elements (line breaks added for formatting purposes). Every single character including white space and newline is contained in S elements which become building blocks for the next call to fsgmatch where words are identified. An alternative approach would find words in a single step but our two-step method provides a cleaner set of word-level rules which are more easily modified and tailored to different purposes: modifiability is critical since the definition of what is a word can differ from one subsequent processing step to another.</Paragraph> <Paragraph position="7"> A pipeline which first identifies words and then performs sentence boundary identification and POS tagging followed by lemmatisation is shown in Figure 3 (somewhat simplified and numbering added for ease of exposition). The Perl program in step 1 wraps the input inside an XML header and footer as a first step towards conversion to XML.Step2 calls fsgmatch with the grammar file ohsumed.gr to identify the fields of an OHSUMED entry and convert them into XML mark-up: each abstract is put inside a RECORD element which contains sub-structure reflecting e.g. author, title, MESH code and the abstract itself. From this point on, all processing is directed at the ABSTRACT elements through the query &quot;.*/ABSTRACT&quot; . Steps 3 and 4 make calls to fsgmatch to identify S and W (word) elements as described above and after this point, in step 5, the S mark-up is discarded (using the LT TTT program sgdelmarkup) since it has now served its purpose.</Paragraph> <Paragraph position="8"> Step 6 contains a call to the other main LT TTT program, ltpos (Mikheev, 1997), which performs both sentence identification and POS tagging. The subquery (-qs) option picks out ABSTRACTsasthe elements within RECORDs(-q option) that are to be processed; the -qw option indicates that the input has already been segmented into words marked The query language that the LT TTT and LT XML tools use is a specialised XML query language which pinpoints the part of the XML tree-structure that is to be processed at that point. This query language pre-dates XPath and in expressiveness it constitutes a subset of XPath except that it also allows regular expressions over text content. Future plans include modifying out tools to allow for the use of XPath as a query language. up as W elements; the -sent option indicates that sentences should be wrapped as SENT elements; the -tag option is an instruction to output POS tags and the -pos attr option indicates that POS tags should be encoded as the value of the attribute P on W elements. The final resource.xml names the resource file that ltpos is to use. Note that the tagset used by ltpos is the Penn Treebank tagset (Marcus et al., 1994).</Paragraph> <Paragraph position="9"> 1. ohs2xml.perl \ 2. |fsgmatch -q &quot;.*/TEXT&quot; ohsumed.gr \ 3. |fsgmatch -q &quot;.*/ABSTRACT&quot; pretok.gr \ 4. |fsgmatch &quot;.*/ABSTRACT&quot; tok.gr \ 5. |sgdelmarkup -q &quot;.*/S&quot; \ 6. |ltpos -q &quot;.*/RECORD&quot; -qs &quot;.*/ABSTRACT&quot; \ -qw &quot;.*/W&quot; -sent SENT \ -tag -pos_attr P resource.xml \ 7. |xmlperl lemma.rule Up to this point, each module in the pipeline has used one of the LT TTT or LT XML programs which are sensitive to XML structure. There are, however, a large number of tools available from the NLP community which could profitably be used but which are not XML-aware. We have integrated some of these tools into our pipelines using the LT XML program xmlperl. This is a program which makes underlying use of an XML parser so that rules defined in a rule file can be directed at particular parts of the XML tree-structure. The actions in the rules are defined using the full capabilities of Perl. This gives the potential for a much wider range of transformations of the input than fsgmatch allows and, in particular, we use Perl's stream-handling capabilities to pass the content of XML elements out to a non-XML program, receive the result back and encode it back in the XML mark-up. Step 7 of the pipeline in Figure 3 shows a call to xmlperl with the rule file lemma.rule. This rule file invokes Minnen et al.'s (2000) morpha lemmatiser: the PCDATA content of each verbal or nominal W element is passed to the lemmatiser and the lemma that is returned is encoded as the value of the attribute LM. A sample of the output from the pipeline is shown in Figure 1.</Paragraph> </Section> class="xml-element"></Paper>