File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/i05-2038_intro.xml
Size: 3,389 bytes
Last Modified: 2025-10-06 14:02:56
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2038"> <Title>Syntax annotation for the GENIA corpus</Title> <Section position="2" start_page="3" end_page="220" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Research and development for information extraction from biomedical literature (biotextmining) has been rapidly advancing due to demands caused by information overload in the genome-related field. Natural language processing (NLP) techniques have been regarded as useful for this purpose. Now that focus of information extraction is shifting from extraction of &quot;nominal&quot; information such as named entity to &quot;verbal&quot; information such as relations of entities including events and functions, syntactic analysis is an important issue of NLP application in biomedical domain. In extraction of relation, the roles of entities participating in the relation must be identified along with the verb that represents the relation itself. In text analysis, this corresponds to identifying the subjects, objects, and other arguments of the verb.</Paragraph> <Paragraph position="1"> Though rule-based relation information extraction systems using surface pattern matching and/or shallow parsing can achieve high-precision (e.g. Koike et al., 2004) in a particular target domain, they tend to suffer from low recall due to the wide variation of the surface expression that describe a relation between a verb and its arguments. In addition, the portability of such systems is low because the system has to be re-equipped with different set of rules when different kind of relation is to be extracted. One solution to this problem is using deep parsers which can abstract the syntactic variation of a relation between a verb and its arguments represented in the text, and constructing extraction rule on the abstract predicate-argument structure.</Paragraph> <Paragraph position="2"> To do so, wide-coverage and high-precision parsers are required.</Paragraph> <Paragraph position="3"> While basic NLP techniques are relatively general and portable from domain to domain, customization and tuning are inevitable, especially in order to apply the techniques effectively to highly specialized literatures such as research papers and abstracts. As recent advances in NLP technology depend on machine-learning techniques, annotated corpora from which system can acquire rules (including grammar rules, lexicon, etc.) are indispensable resources for customizing general-purpose NLP tools. In bio-textmining, for example, training on part-of-speech (POS)-annotated GENIA corpus was reported to improve the accuracy of JunK tagger (English POS tagger) (Kazama et al., 2001) from 83.5% to 98.1% on MEDLINE abstracts (Tateisi and Tsujii, 2004), and the FraMed corpus (Wermter and Hahn, 2004) was used to train TnT tagger on German (Brants, 2000) to improve its accuracy from 95.7% to 98% on clinical reports and other biomedical texts. Corpus annotated for syntactic structures is expected to play a similar role in tuning parsers to biomedical domain, i.e., similar improvement on the performance of parsers is expected by using domain-specific treebank as a resource for learning. For this purpose, we construct GENA Treebank (GTB), a treebank on research abstracts in biomedical domain.</Paragraph> </Section> class="xml-element"></Paper>