File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-1903_abstr.xml
Size: 4,963 bytes
Last Modified: 2025-10-06 13:43:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1903"> <Title>Budapest, and the Research Institute for Linguistics at the Hungarian Academy of Sciences</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The Szeged Corpus is a manually annotated natural language corpus currently comprising 1.2 million word entries, 145 thousand different word forms, and an additional 225 thousand punctuation marks. With this, it is the largest manually processed Hungarian textual database that serves as a reference material for research in natural language processing as well as a learning database for machine learning algorithms and other software applications. Language processing of the corpus texts so far included morpho-syntactic analysis, POS tagging and shallow syntactic parsing.</Paragraph> <Paragraph position="1"> Semantic information was also added to a pre-selected section of the corpus to support automated information extraction.</Paragraph> <Paragraph position="2"> The present state of the Szeged Corpus (Alexin et al., 2003) is the result of three national projects and the cooperation of the University of Szeged, Department of Informatics, MorphoLogic Ltd.</Paragraph> <Paragraph position="3"> Budapest, and the Research Institute for Linguistics at the Hungarian Academy of Sciences.</Paragraph> <Paragraph position="4"> Corpus texts have gone through different phases of natural language processing (NLP) and analysis.</Paragraph> <Paragraph position="5"> Extensive and accurate manual annotation of the texts, incorporating over 124 person-months of manual work, is a great value of the corpus.</Paragraph> <Paragraph position="6"> 1 Texts of the Szeged Corpus When selecting texts for the Szeged Corpus, the main criteria was that they should be thematically representative of different text types. The first version of the corpus, therefore, contains texts from five genres, roughly 200 thousand words each. Due to its relative variability, it serves as a good reference material for natural language research applications, and proves to be large enough to guarantee the robustness of machine learning methods. Genres of Szeged Corpus 1.0 include: During further developments, the first version of the corpus was extended with a 200 thousandword-long sample of short business news2. The newly added section served as an experimental database for learning semantic frame mapping to be later integrated in an IE technology. Table 1. shows data referring to Szeged Corpus 2.0.</Paragraph> <Paragraph position="7"> 2 Annotation of the Szeged Corpus Morpho-syntactic analysis and POS tagging of the corpus texts included two steps. Initially, words were morpho-syntactically analysed with the help of the Humor3 automatic pre-processor. The program determined the possible morpho-syntactic labels of the lexicon entries, thereby creating the ambiguous version of the corpus. After the preprocessing, the entire corpus was manually disambiguated (POS tagged) by linguists. For the tagging of the Szeged Corpus, the Hungarian version of the internationally acknowledged MSD (Morpho-Syntactic Description) scheme (Erjavec, Monachini, 1997) was selected. Due to the fact that the MSD encoding scheme is extremely detailed and refined (one label can store information on up to 17 positions), there is a large number of ambiguous cases, i.e. one word is likely to have more than one possible labels. Experiences show that by applying the MSD encoding scheme, roughly every second word of the corpus is ambiguous. Disambiguation, therefore, required accurate and detailed work cumulating up to 64 person-months of manual annotation. Currently all possible labels as well as the selected ones are stored in the corpus.</Paragraph> <Paragraph position="8"> A unique feature of the corpus is that parallel to POS tagging, users' rules have been defined for each ambiguous word in a pre-selected (202 600word-long) section of the corpus. The aim of applying users' rules was to mark the relevant context (relevant set of words) that determines the selection of a certain POS tag. Users' rules apply before1, before2, ... after1, after2, ...</Paragraph> <Paragraph position="9"> predicates for marking the relevant context of a word4. The manually defined rules can then be generalised to regular disambiguation rules applicable to unknown texts as well. Out of the selected 202 600 words 114 951 were ambiguous.</Paragraph> <Paragraph position="10"> Annotators defined users' rules for these cases among which 26 912 different ones were found.</Paragraph> <Paragraph position="11"> The major advantage of the defined rules lies in their accuracy and specificity, wherefore they are an interesting and valuable source of additional linguistic information that can e.g. support the more precise training of machine learning algorithms.</Paragraph> </Section> class="xml-element"></Paper>