File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2038_metho.xml
Size: 17,907 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-2038"> <Title>Syntax annotation for the GENIA corpus</Title> <Section position="3" start_page="220" end_page="221" type="metho"> <SectionTitle> 2 Outline of the Corpus </SectionTitle> <Paragraph position="0"> The base text of GTB is that of the GENIA corpus constructed at University of Tokyo (Kim et al., 2003), which is a collection of research abstracts selected from the search results of MEDLINE database with keywords (MeSH terms) human, blood cells and transcription factors. In the GENIA corpus, the abstracts are encoded in an XML scheme where each abstract is numbered with MEDLINE UID and contains title and abstract. The text of title and abstract is segmented into sentences in which biological terms are annotated with their semantic classes.</Paragraph> <Paragraph position="1"> The GENIA corpus is also annotated for part-of-speech (POS) (Tateisi and Tsujii, 2004), and coreference is also annotated in a part of the GENIA corpus by MedCo project at Institute for Infocomm Research, Singapore (Yang et al, 2004).</Paragraph> <Paragraph position="2"> GTB is the addition of syntactic information to the GENIA corpus. By annotating various linguistic information on a same set of text, the GENIA corpus will be a resource not only for individual purpose such as named entity extraction or training parsers but also for integrated systems such as information extraction using deep linguistic analysis. Similar attempt of constructing integrated corpora is being done in University of Pennsylvania, where a corpus of MEDLINE abstracts in CYP450 and oncology domains where annotated for named entities, POS, and tree structure of sentences (Kulick et al, 2004).</Paragraph> <Section position="1" start_page="220" end_page="221" type="sub_section"> <SectionTitle> 2.1 Annotation Scheme </SectionTitle> <Paragraph position="0"> The annotation scheme basically follows the Penn Treebank II (PTB) scheme (Beis et al, 1995), encoded in XML. A non-null constituent is marked as an element, with its syntactic category (which may be combined with its function tags indicating grammatical roles such as -SBJ, -PRD, and -ADV) used as tags. A null constituent is marked as a childless element whose tag corresponds to its categories. Other function tags are encoded as attributes. Figure 1 shows an example of annotated sentence in XML, and the corresponding PTB notation. The label &quot;S&quot; means &quot;sentence&quot;, &quot;NP&quot; noun phrase, &quot;PP&quot; prepositional phrase, and &quot;VP&quot; verb phrase.</Paragraph> <Paragraph position="1"> The label &quot;NP-SBJ&quot; means that the element is an NP that serves as the subject of the sentence.</Paragraph> <Paragraph position="2"> A null element, the trace of the object of &quot;studied&quot; moved by passivization, is denoted by &quot; <NP NULL=&quot;NONE&quot; ref=&quot;i55&quot;/>&quot; in XML and &quot;*-55&quot; in PTB notation. The number &quot;55&quot; which refers to the identifier of the moved element, is denoted by &quot;id&quot; and &quot;ref&quot; attributes in XML, and is denoted as a part of a label in PTB.</Paragraph> <Paragraph position="3"> In addition to changing the encoding, we made some modifications to the scheme. First, analysis within the noun phrase is simplified.</Paragraph> <Paragraph position="4"> Second, semantic division of adverbial phrases such as &quot;-TMP&quot; (time) and &quot;-MNR&quot; (manner) are not used: adverbial constituents other than &quot;ADVP&quot; (adverbial phrases) or &quot;PP&quot; used adverbially are marked with -ADV tags but not with semantic tags. Third, a coordination structure is explicitly marked with the attribute SYN=&quot;COOD&quot; whereas in the original PTB scheme it is not marked as such.</Paragraph> <Paragraph position="5"> In our GTB scheme, &quot;NX&quot; (head of a complex noun phrase) and &quot;NAC&quot; (a certain kind of nominal modifier within a noun phrase) of the PTB scheme are not used. A noun phrase is generally left unstructured. This is mainly in order to simplify the process of annotation. In case of biomedical abstracts, long noun phrases often involve multi-word technical terms whose syntactic structure is difficult to determine without deep domain knowledge. However, the structure of noun phrases are usually independent of the structure outside the phrase, so that it would be easier to analyze the phrases involving such terms independently (e.g. by biologists) and later merge the two analysis together. Thus we have decided that we leave noun phrases unstructured in GTB annotation unless their analysis is necessary for determining the structure outside the phrase. One of the exception is the cases that involves coordination where it is necessary to explicitly mark up the coordinated constituents.</Paragraph> <Paragraph position="6"> In addition, we have added special attributes &quot;TXTERR&quot;, &quot;UNSURE&quot;, and &quot;COMMENT&quot; for later inspection. The &quot;TXTERR&quot; is used when the annotator suspects that there is a grammatical error in the original text; the &quot;UNSURE&quot; attribute is used when the annotator is not confident; and the &quot;COMMENT&quot; is used for free comments (e.g. reason of using &quot;UNSURE&quot;) by the annotator.</Paragraph> </Section> <Section position="2" start_page="221" end_page="221" type="sub_section"> <SectionTitle> 2.2 Annotation Process </SectionTitle> <Paragraph position="0"> The sentences in the titles and abstracts of the base text of GENIA corpus are annotated manually using an XML editor used for the Global</Paragraph> </Section> <Section position="3" start_page="221" end_page="221" type="sub_section"> <SectionTitle> Document Annotation project (Hasida 2000). </SectionTitle> <Paragraph position="0"> Although the sentence boundaries were adopted from the corpus, the tree structure annotation was done independently of POS- and term- annotation already done on the GENIA corpus.</Paragraph> <Paragraph position="1"> The annotator was a Japanese non-biologist who has previously involved in the POS annotation of the GENIA corpus and accustomed to the style of research abstracts in English. Manually annotated abstracts are automatically converted to the PTB format, merged with the POS annotation of the GENIA corpus (version 3.02).</Paragraph> </Section> </Section> <Section position="4" start_page="221" end_page="221" type="metho"> <SectionTitle> 3 Annotation Results </SectionTitle> <Paragraph position="0"> So far, 500 abstracts are annotated and converted to the merged PTB format. In the merging process, we found several annotation errors.</Paragraph> <Paragraph position="1"> The 500 abstracts with correction of these errors are made publicly available as &quot;The GENIA Treebank Beta Version&quot; (GTB-beta).</Paragraph> <Paragraph position="2"> For further clean-up, we also tried to parse the corpus by the Enju parser (Miyao and Tsujii 2004), and identify the error of the corpus by investigating into the parse errors. Enju is an HPSG parser that can be trained with PTB-type corpora which is reported to have 87% accuracy on Wall Street Journal portion of Penn Treebank corpus. Currently the accuracy of the parser drops down to 82% on GTB-beta, and although proper quantitative analysis is yet to be done, it was found that the mismatches between labels of the treebank and the GENIA POS corpus (e.g.</Paragraph> <Paragraph position="3"> an -ing form labeled as noun in the POS corpus and as the head of a verb phrase in the tree corpus) are a major source of parse error. The correction is complicated because several errors in the GENIA POS corpus were found in this cleaning-up process. When the cleaning-up process is done, we will make the corpus publicly available as the proper release.</Paragraph> <Paragraph position="4"> <S><PP>In <NP>the present paper </NP></PP>,</Paragraph> </Section> <Section position="5" start_page="221" end_page="223" type="metho"> <SectionTitle> 4 Inter-Annotator Agreement </SectionTitle> <Paragraph position="0"> We have also checked inter-annotator agreement.</Paragraph> <Paragraph position="1"> Although the PTB scheme is popular among natural language processing society, applicability of the scheme to highly specialized text such as research abstract is yet to be discussed. Especially, when the annotation is done by linguists, lack of domain knowledge might decrease the stability and accuracy of annotation.</Paragraph> <Paragraph position="2"> A small part of the base text set (10 abstracts) was annotated by another annotator. The 10 abstracts were chosen randomly, had 6 to 17 sentences per abstract (total 108 sentences). The new annotator had a similar background as the first annotator that she is a Japanese non-biologist who has experiences in translation of a [125I]-labeled aldosterone derivative to plasma membrane rich fractions of HML was studied&quot; annotated in XML and PTB formats.</Paragraph> <Paragraph position="3"> technical documents in English and in corpus annotation of English texts.</Paragraph> <Paragraph position="4"> The two results were examined manually, and there were 131 disagreements. Almost every sentence had at least one disagreement. We have made the 'gold standard' from the two sets of abstracts by resolving the disagreements, and the accuracy of the annotators against this gold standard were 96.7% for the first annotator and 97.4% for the second annotator.</Paragraph> <Paragraph position="5"> Of the disagreement, the most prominent were the cases involving coordination, especially the ones with ellipsis. For example, one annotator annotated the phrase 'IL-1- and IL-18mediated function' as in Figure 2a, the other annotated as Figure 2b.</Paragraph> <Paragraph position="6"> Such problem is addressed in the PTB guideline and both formats are allowed as alternatives. As coordination with ellipsis occurs rather frequently in research abstracts, this kind of phenomena has higher effect on decrease of the agreement rate than in Penn Treebank. Of the 131 disagreements, 25 were on this type of coordination.</Paragraph> <Paragraph position="7"> Another source of disagreement is the attachment of modifiers such as prepositional phrases and pronominal adjectives. However, most are 'benign ambiguity' where the difference of the structure does not affect on interpretation, such as 'high expression of STAT in monocytes' where the prepositional phrase 'in monocytes' can attach to 'expression' or 'STAT' without much difference in meaning, and 'is augmented when the sensitizing tumor is a genetically modified variant' where the whclause can attach to 'is augmented' or 'augmented' without changing the meaning. The PTB guideline states that the modifier should be attached at the higher level in the former case and at the lower case in the latter. In the annotation results, one annotator consistently attached the modifiers in both cases at the higher level, and the other consistently at the lower level, indicating that the problem is in understanding the scheme rather than understanding the sentence.</Paragraph> <Paragraph position="8"> Only 15 cases were true ambiguities that needed knowledge of biology to solve, in which 5 involved coordination (e.g., the scope of 'various' in 'various T cell lines and peripheral blood cells') .</Paragraph> <Paragraph position="9"> Although the number was small, there were disagreements on how to annotate a mathematical formula such as 'n=2' embedded in the sentence, since mathematical formulae were outside the scope of the original PTB scheme. One annotator annotated this kind of phrase consistently as a phrase with '=' as an adjective, the other annotated as phrase with '=' as a verb.</Paragraph> <Paragraph position="10"> There were 6 such cases. Another disagreement particular to abstracts is a treatment of labeled sentences. There were 8 sentences in two abstracts where there is a label like 'Background:'. One annotator included the colon (':') in the label, while the other did not. Yet another is that one regarded the phrase 'Author et al' as coordination, and the other regarded 'et al' as a modifier.</Paragraph> <Paragraph position="11"> Other disagreements are more general type such as regarding '-ed' form of a verb as an adjective or a participle, miscellaneous errors such as omission of a subtype of label (such as '-PRD' or '-SBJ) or the position of <PRN> tags with regards to ',' for the inserted phrase, or the errors which look like just 'careless'. Such disagreements and mistakes are at least partially eliminated when reliable taggers and parsers are available for preprocessing</Paragraph> </Section> <Section position="6" start_page="223" end_page="224" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The result of the inter-annotator agreement test indicates that the writing style rather than the contents of the research abstracts is the source of the difficulty in tree annotation. Contrary to the expectation that the lack of domain knowledge causes a problem in annotation on attachments of modifiers, the number of cases where annotation of modifier attachment needs domain knowledge is small. This indicates that linguists can annotate most of syntactic structure without an expert level of domain knowledge.</Paragraph> <Paragraph position="1"> A major source of difficulty is coordination, especially the ones involving ellipsis. Coordination is reported to be difficult phenomena in annotation of different levels in the GENIA corpus (Tateisi and Tsujii, 2004), (Kim et al., 2003). In addition to the fact that this is the major source of inter-annotator agreement, the annotator often commented the coordinated structure as 'unsure'.</Paragraph> <Paragraph position="2"> The problem of coordination can be divided into two with different nature: one is that the annotation policy is still not well-established for the coordination involving ellipsis, and the other is an ambiguity when the coordinated phrase has modifiers.</Paragraph> <Paragraph position="3"> Syntax annotation of coordination with ellipsis is difficult in general but the more so in annotation of abstracts than in the case of general texts, because in abstracts authors tend to pack information in limited number of words. The PTB guideline dedicates a long section for this phenomena and allows alternatives in annotation, but there are still cases which are not well-covered by the scheme. For example, in addition to the disagreement, the phrase illustrated in Figure 2a and Figure 2b shows another problem of the annotation scheme. Both annotators fail to indicate that it is 'mediated' that was to be after 'IL-1' because there is no mechanism of coindexing a null element with a part of a token.</Paragraph> <Paragraph position="4"> This problem of ellipsis can frequently occur in research abstracts, and it can be argued that the tokenization criteria must be changed for texts in biomedical domain (Yamamoto and Satou, 2004) so that such fragment as 'IL-18' and 'mediated' in 'IL-18-ediated' should be regarede as separate tokens. The Pennsylvania biology corpus (Kulick et al., 2004) partially solves this problem by separating a token where two or more subtokens are connected with hyphens, but in the cases where a shared part of the word is not separated by a hyphen (e.g. 'metric' of 'stereo- and isometric alleles') the word including the part is left uncut. The current GTB follows the GENIA corpus that it retains the tokenization criteria of the original Penn Treebank, but this must be reconsidered in future.</Paragraph> <Paragraph position="5"> For analysis of coordination with ellipsis, if the information on full forms is available, one strategy would be to leave the inside structure of coordination unannotated in the treebank corpus (and in the phase of text analysis the structure is not established in the phase of parsing but with a different mechanism) and later merge it with the coordination structure annotation. The GENIA term corpus annotates the full form of a technical term whose part is omitted in the surface as an attribute of the '<cons>' element indicating a technical term (Kim et al., 2003). In the above-mentioned Pennsylvania corpus, a similar mechanism ('chaining') is used for recovering the full form of named entities. However, in both corpora, no such information is available outside the terms/entities.</Paragraph> <Paragraph position="6"> The cases where scope of modification in coordinated phrases is problematic are few but they are more difficult in abstracts than in general texts because the resolution of ambiguity needs domain knowledge. If term/entity annotation is already done, that information can help resolve this type of ambiguity, but again the problem is that outside the terms/entities such information is not available. It would be practical to have the structure flat but specially marked when the tree annotators are unsure and have a domain expert resolve the ambiguity, as the sentences that needs such intervention seems few. Some cases of ambiguity in modifier attachment (which do not involve coordination) can be solved with similar process.</Paragraph> <Paragraph position="7"> We believe that other type of disagreements can be solved with supplementing criteria for linguistic phenomena not well-covered by the scheme, and annotator training. Automatic pre-processing by POS taggers and parsers can also help increase the consistent annotation.</Paragraph> </Section> class="xml-element"></Paper>