File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2128_intro.xml
Size: 5,621 bytes
Last Modified: 2025-10-06 14:06:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2128"> <Title>Learning Constraint Grammar-style disambiguation rules using Inductive Logic Programming</Title> <Section position="3" start_page="0" end_page="775" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The success of the Constraint Grammar (CG) (Karlsson et al., 1995) approach to part of speech tagging and surface syntactic dependency parsing is due to the minutely hand-crafted grammar and two-level morphology lexicon, developed over several years.</Paragraph> <Paragraph position="1"> In the study reported here, the Progol machine-learning system was used to induce CG-style tag eliminating rules from a one million word part of speech tagged corpus of Swedish. Some 7 000 rules were induced. When tested on unseen data, 98% of the words retained the correct tag. There were still ambiguities left in the output, on an average 1.13 readings per word.</Paragraph> <Paragraph position="2"> In the following sections, the CG framework and the Progol machine learning system will be presented very briefly.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Constraint Grammar POS tagging </SectionTitle> <Paragraph position="0"> Constraint Grammar is a system for part of speech tagging and (shallow) syntactic dependency analysis of unrestricted text. In the following, only the part of speech tagging step will be discussed.</Paragraph> <Paragraph position="1"> The following as a typical 'reductionistic' example of a CG rule which discards a verbal reading of a word following a word unambiguously tagged as determiner (Tapanainen, 1996, page 12): REMOVE (V) IF (-iC DET) ; where V is the target tag to be discarded and -IC DET denotes the word immediately to the left (-I), unambiguously (C) tagged as determiner (DET). There are several types of rules, not only 'reductionistic' ones, making the CG formalism quite powerful. A full-scale CG has hundreds of rules. The developers of English CG report that 99.7% of the words retain their correct reading, and that 93-97% of the words are unambiguous after tagging (Karlsson et al., 1995, page 186). A parser applying the constraints is described in Tapanainen (1996).</Paragraph> </Section> <Section position="2" start_page="0" end_page="775" type="sub_section"> <SectionTitle> 1.2 Inductive Logic Programming </SectionTitle> <Paragraph position="0"> Inductive Logic Programming (ILP) is a combination of machine learning and logic programming, where the goal is to find a hypothesis, H, given examples, E, and background knowledge, B, such that the hypothesis along with the background knowledge logically implies the examples (Muggleton, 1995, page 2): BAH~E The examples are usually split into a positive, E +, and a negative, E-, subset.</Paragraph> <Paragraph position="1"> The ILP system used in this paper, CProgol Version 4.2, uses Horn clauses as the representational language. Progol creates, for each E +, a most specific clause -l-i and then searches through the lattice of hypotheses, from specific to more general, bounded by \[\] -< H -<-l-i to find the clause that maximally compresses the data where &quot;< (0-subsumption) is defined as Cl .-<C2 -' ~ ~O:cIOCC 2 and 12 is the empty clause. As an example, consider the two clauses:</Paragraph> <Paragraph position="3"> where Cl -< c2 under the substitution 0 = {Xla, YIb}.</Paragraph> <Paragraph position="4"> When Progol has found the clause that compresses the data the most, it is added to the background knowledge and all examples that axe redundant with respect to this new background knowledge are removed.</Paragraph> <Paragraph position="5"> More informally, Progol builds the most specific clause for each positive example. It then tries to find a more general version of the clause (with respect to the background knowledge and mode declarations, see below) that explains as many positive and as few negative examples as possible.</Paragraph> <Paragraph position="6"> Mode declarations specifying the properties of the rules have to be given by the user. A modeh declaration specifies the head of the rules, while modeb declarations specify what the bodies of the rules to induce might contain. The user also declares the types of arguments, and whether they are input or output arguments, or if an argument should be instantiated by Progol. Progol is freely available and documented in Muggleton (1995) and Roberts (1997).</Paragraph> </Section> <Section position="3" start_page="775" end_page="775" type="sub_section"> <SectionTitle> 1.3 The Stockholm-Umefi Corpus </SectionTitle> <Paragraph position="0"> The training material in the experiments reported here is sampled from a pre-release of the Stockholm-UmePS Corpus (SUC). SUC covers just over one million words of part of speech tagged Swedish text, sampled from different text genres (largely following the Brown corpus text categories). The first official release is now available on CD-ROM.</Paragraph> <Paragraph position="1"> The SUC tagset has 146 different tags, and the tags consist of a part of speech tag, e.g. VB (the verb) followed by a (possibly empty) set of morphological features, such as PRS (the present tense) and AKT (the active voice), etc. There are 25 different part of speech tags. Thus, many of the 146 tags represent different inflected forms.</Paragraph> <Paragraph position="2"> Examples of the tags are found in Table 1. The SUC tagging scheme is presented in Ejerhed et al. (1992).</Paragraph> </Section> </Section> class="xml-element"></Paper>