File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1903_metho.xml

Size: 9,991 bytes

Last Modified: 2025-10-06 14:09:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1903">
  <Title>Budapest, and the Research Institute for Linguistics at the Hungarian Academy of Sciences</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Humor morpho-syntactic analyser is a product
</SectionTitle>
    <Paragraph position="0"> of the MorphoLogic Ltd. Budapest.</Paragraph>
    <Paragraph position="1"> 4 The predicate after2 e.g., denotes the second word to the right of the focus word.</Paragraph>
    <Paragraph position="2"> After the completion of POS tagging, a project5 was initiated to encompass shallow syntactic parsing of the Szeged Corpus. The linguistic information identified by shallow syntactic parsing proves to be rich enough to support a number of large-scale NLP applications including information extraction (IE), text summarisation, machine translation, phrase identification in information retrieval, named entity identification, and a variety of text-mining operations. In order to achieve their goal, researchers of the University of Szeged, Department of Informatics, the MorphoLogic Ltd.</Paragraph>
    <Paragraph position="3"> Budapest, and the Research Institute for Linguistics at the Hungarian Academy of Sciences had to conduct some research concerning the syntax of Hungarian sentences, NP annotation schemes, and rules covering the recognition of phrases. Results showed that in Hungarian, nominal structures typically bear the most significant meaning (semantic content) within a sentence, therefore NP annotation seemed to be the most reasonable step forward.</Paragraph>
    <Paragraph position="4"> Shallow parsing was carried out on the entire Szeged Corpus 2.0 (1.2 million words). Automated pre-parsing was completed with the help of the CLaRK6 program, in which regular syntactic rules have been defined by linguistic experts for the recognition of NPs. Due to the fact that the CLaRK parser did not fully cover the occurring NP structures (its coverage was around 70%), manual validation and correction could not be avoided. In total, 250 thousand highest level NPs were found, and the deepest NP structure contained 9 NPs imbedded into each other. The majority of the hierarchic NP structures were between 1 to 3 NPs deep. Manual validation and correction lasted 60 person-months.</Paragraph>
    <Paragraph position="5"> As a continuation of shallow parsing, the clause structure (CPs) of the corpus sentences was also marked. Labelling clauses followed the same approach as earlier phases of NLP: it comprised an automatic pre-annotation followed by manual  correction and supplementation.</Paragraph>
    <Paragraph position="6"> 3 Use of the Szeged Corpus for training and testing machine learning algorithms Due to the accurate and exhaustive manual annotation, the resulting corpus (both first and second versions) could serve as an adequate 5 National Research and Development Programmes (NKFP) 2/017/2001 project funded by the Hungarian Ministry of Education, titled Information Extraction from Short Business News.</Paragraph>
    <Paragraph position="7"> 6 The CLaRK system was developed by Kiril Simov at the Bulgarian Academy of Sciences in the framework  of the BulTreeBank project.</Paragraph>
    <Paragraph position="8"> database for the training and testing of machine learning algorithms. The applicability of these algorithms in Hungarian NLP was extensively studied in the past couple of years (Horvath et al., 1999), (Hocza et al., 2003). Researchers of the University of Szeged experimented with different kind of POS tagging methods and compared their results based on accuracy. Brill's transformation-based learning method (Brill, 1995) worked with 96.52% per word accuracy when trained and tested on the corpus. The HMM-based TnT tagger (Brants, 2000) performed 96.18%, while the RGLearn rule-based tagger (Hocza et al., 2003) produced 94.54% accuracy. Researchers also experimented with the combination of the different learning methods in order to increase accuracy.</Paragraph>
    <Paragraph position="9"> The best accuracy result, delivered by combining the above three methods, was 96.95%. Overall results showed that despite the agglutinating nature of Hungarian language and the structural differences between Hungarian and other Indo-European languages, all of the mentioned methods can be used effectively for learning POS tagging. The applicability of machine learning methods for learning NP recognition rules was also investigated. The C 4.5 (Quinlan, 1993) and the RGLearn rule-based algorithms were selected for the learning process. NP recognition rules have been retrieved from the annotated corpus and were combined with manually defined expert rules. The main task of the NP recognition parser is to provide the best possible coverage of NP structures  The mentioned algorithms - although still under development - already perform between 80-90% accuracy (see Table 3.). Their performance strongly depends on the type of the processed text: phrase structures are recognised with better accuracy in news or technical type of texts than in student's compositions (where sentences are often grammatically inaccurate) or legal texts (where sentences are typically extremely long, and fragmented).</Paragraph>
    <Paragraph position="10"> As a continuation of the work, an automated method was developed to perform IE from short business news. The 200 thousand word long, short business news section of the corpus was used as the training database for the IE tool. In the preparatory phase, the selected section of the corpus was enriched with semantic information.</Paragraph>
    <Paragraph position="11"> Possible semantic roles, such as SELLER, BUYER, PRODUCT, PRICE, DATE etc., were associated with each word, and were stored in a semantic dictionary. The most typical events of business life were represented by so-called semantic frames describing the relations of the different semantic roles. Possible frames were defined manually by linguists and allowed mapping between the lexical representation and the semantic role of a word. Semantic mapping rules were acquired by machine learning algorithms that used the manually annotated semantic roles as their learning source. The recognition of semantic frames was also supported by the series of NLP methods described earlier (i.e. POS tagging and shallow parsing).</Paragraph>
    <Paragraph position="12"> During the developed information extraction process, the trained mapping tool takes a morpho-syntactically and syntactically annotated piece of text and performs two operations. First, it processes the morpho-syntactically disambiguated and shallow parsed text and assigns semantic roles to the words. The second operation determines relationships between the roles, i.e. maps semantic frames onto the existing structures. Semantic mapping is realised by simple pattern-matching methods using the frames previously defined by experts. Based on the results of the described operations, the mapping tool builds a semantic representation of the input text, already containing the required information. Results produced by this method were tested against the manually annotated corpus and showed that it identifies semantic roles with 94-99% accuracy and maps frames with up to 80% accuracy.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Current and future works
</SectionTitle>
    <Paragraph position="0"> Current works aim at a more detailed syntactic analysis of the Szeged Corpus. With this, developers intend to lay the foundation of a Hungarian treebank, which is planned to be enriched with detailed semantic information as well in the future. The development of a suitable technique for the recognition and annotation of named entities (e.g., multi-word proper nouns) and special tokens (e.g., time expressions, dates, measures, bank account numbers, web- and e-mail addresses, etc.) is also planned in the near future. Further works aim at building firstly domain specific, later general ontologies and at developing automated methods that allow for extensive semantic analysis and processing of Hungarian sentences.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> Corpus-based methods play an important role in empirical linguistics as well as in the application of machine learning algorithms. Annotated reference corpora, such as the Brown Corpus (Kucera, Francis, 1967), the Penn Treebank (Marcus et al., 1993), and the BNC (Leech et al., 2001.), have helped both the development of English computational linguistics tools and English corpus linguistics. Manual POS tagging and syntactic annotation are costly but allow one to build and improve sizable linguistic resources and also to train and evaluate automated analysers.</Paragraph>
    <Paragraph position="1"> . The NEGRA (Skut at al., 1997) POS tagged and syntactically annotated corpus of 355 thousand tokens was the first initiative in corpus linguistics for German. The more recent TIGER Treebank project (Brants et al., 2002) aims at building the largest and most extensively annotated treebank for German. Currently, it comprises 700 thousand tokens of newspaper text that were automatically analysed and manually checked. Considerable results were achieved for Czech in the framework of the Prague Dependency Treebank project (Hajic, 1998), and for Bulgarian in the BulTreeBank project (Simov et al., 2003) as well.</Paragraph>
    <Paragraph position="2"> The Szeged Corpus project is comparable both in size and in depth of analysis to the corpus and treebank initiatives mentioned above7. As the first such like initiative for Hungarian language, it is a valuable source for linguistic research and a suitable training and testing basis for machine applications and automated induction of linguistic knowledge.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML