File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0909_metho.xml
Size: 14,044 bytes
Last Modified: 2025-10-06 14:15:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0909"> <Title>Unsupervised Lexical Learning with Categorial Grammars</Title> <Section position="3" start_page="59" end_page="60" type="metho"> <SectionTitle> 2 Categorial Grammar </SectionTitle> <Paragraph position="0"> Categorial Grammar (CG) (Wood, 1993; Steedman, 1993) provides a functional approach to lexicalised grammar, and so, can be thought of as defining a syntactic calculus. Below we describe the basic (AB) CG, although in future it will be necessary to pursue a more flexible version of the formalism.</Paragraph> <Paragraph position="1"> There is a set of atomic categories in CG, which are usually nouns (n), noun phrases (np) and sentences is). It is then possible to build up complex categories using the two slash operators &quot;/&quot; and &quot;\&quot;. IfA and B are categories then A/B is a category and A\B is a category. With basic CG there are just two rules for combining categories: the forward (FA) and backward (BA) .functional application rules. Following Steedman's notation (Steedman, 1993) these are:</Paragraph> <Paragraph position="3"> Therefore, for an intransitive verb like &quot;run&quot; the complex category is s\np and for a transitive verb like &quot;take&quot; it is (s\np)/np. In Figure 1 the parse derivation for &quot;John ate the apple&quot; is John ate the apple The CG described above has been shown to be weakly equivalent to context-free phrase structure grammars (Bar-Hillel et al., 1964). While such expressive power covers a large amount of natural language structure, it has been suggested that a more flexible and expressive formalism may capture natural language more accurately (Wood, 1993; Steedman, 1993).</Paragraph> <Paragraph position="4"> This has led to some distinct branches of research into usefully extending CG, which will be investigated in the future.</Paragraph> <Paragraph position="5"> CG has at least the following advantages for our task.</Paragraph> <Paragraph position="6"> * Learning the lexicon and the grammar is one task.</Paragraph> <Paragraph position="7"> * The syntax directly corresponds to the semantics. null The first of these is vital for the work presented here. Because the syntactic structure is defined by the complex categories assigned to the words, it is not necessary to have separate learning procedures for the lexicon and for the grammar rules. Instead, it is just one procedure for learning the lexical assignments to words. Secondly, the syntactic structure in CG parallels the semantic structure, which allows an elegant interaction between the two. While this feature of CG is not used in the current system, it could be used in the future to add semantic background knowledge to aid the learner (e.g. Buszkowski's discovery procedures (Buszkowski, 1987)).</Paragraph> </Section> <Section position="4" start_page="60" end_page="62" type="metho"> <SectionTitle> 3 The Learner </SectionTitle> <Paragraph position="0"> The system we have developed for learning lexicons and assigning parses to unannotated sentences is shown diagrammatically in Figure 2.</Paragraph> <Paragraph position="1"> In the following sections we explain the learning setting and the learning procedure respectively.</Paragraph> <Section position="1" start_page="60" end_page="61" type="sub_section"> <SectionTitle> 3.1 The Learning Setting </SectionTitle> <Paragraph position="0"> The input to the learning setting has five parts: the corpus, the lexicon, the CG rules, the set of legal categories and a probabilistic parser, which are discussed below.</Paragraph> <Paragraph position="1"> The Corpus The corpus is a set of unannotated positive examples represented in Prolog as facts containing a list of words e.g.</Paragraph> <Paragraph position="2"> ex ( \[mary, loved, a, computer\] ).</Paragraph> <Paragraph position="3"> The Lexicon The lexicon is a set of Prolog facts of the form: lex(Word, Category, Frequency).</Paragraph> <Paragraph position="4"> Where Word is a word, Category is a Prolog representation of the CG category assigned to that word and Frequency is the number of times this category has been assigned to this word up to the current point in the learning process. The Rules The CG functional application rules (see Section 2) are supplied to the learner. Extra rules may be added in future for fuller grammatical coverage.</Paragraph> <Paragraph position="5"> The Categories The learner has a complete set of the categories that can be assigned to a word in the lexicon. The complete set is shown in Table 1.</Paragraph> <Paragraph position="6"> The Parser The system employs a probabilistic chart parser, which calculates the N most probable parses, where N is the beam set by the user. The probability of a word being assigned a category is based on the relative frequency, which is calculated from the current lexicon. This probability is smoothed (for words that have not been given fixed categories prior to execution) to allow the possibility that the word may appear as other categories. For all categories for which the word has not appeared, it'is given a frequency of one. This is particularly useful for new words, as it ensures the category of a word is determined by its context. Each non-lexical edge in the chart has a probability calculated by multiplying the probabilities of the two edges that are combined to form it. Edges between two vertices are not added if there axe N edges labelled with the same category and a higher probability, between the same two vertices (if one has a lower probability it is replaced). Also, for efficiency, edges are not added between vertices if there is an edge already in place with a much higher probability. The chart in Figure 3 shows examples of edges that would not be added. The top half of the chart shows one parse and the bottom half another. If N was set to 1 then the dashed edge spanning all the vertices would not be added, as it has a lower probability than the other s edge covering the same vertices. Similarly, the dashed edge between the first and third vertices would not be added, as the probability of the n is so much lower than the probability of the np.</Paragraph> <Paragraph position="7"> It is important that the parser is efficient, as it is used on every example and each word in an example may be assigned any category. As will be seen it is also used extensively in selecting the best parses. In future we hope to investigate the possibility of using more restricted parsing techniques, e.g. deterministic parsing technology such as that described by Marcus (Marcus, 1980), to increase efficiency and allow larger scale experiments.</Paragraph> </Section> <Section position="2" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 3.2 The Learning Procedure </SectionTitle> <Paragraph position="0"> Having described the various components with which the learner is provided, we now describe how they are used in the learning procedure.</Paragraph> <Paragraph position="1"> Parsing the Examples Examples are taken from the corpus one at a time and parsed. Each example is stored with the group of parses generated for it, so they can be efficiently accessed in future. The parse that is selected (see below) as the current correct parse is maintained at the head of this group. The head parse contributes information to the lexicon and annotates the corpus. The parses are also used extensively for the efficiency of the parse selection module, as will be described below. When the parser fails to find an analysis of an example, either because it is ungrammatical, or because of the incompleteness of the coverage of the grammar, the system skips to the next example.</Paragraph> <Paragraph position="2"> The Parse Selector Once an example has been parsed, the N most probable parses are considered in turn to determine which can be used to make the most compressive lexicon (by a given measure), following the compression as learning approach of, for example, Wolff (Wolff, 1987). The current size measure for the lexicon ber of atomic categories within it. However, it is not enough to look at what a parse would add to the lexicon. The effect of changing the lexicon on the parses of previous examples must be considered. Changes in the frequency of assignments can cause the probabilities of previous parses to change and thus correct mistakes made earlier when the evidence from the lexicon was too weak to assign the correct parse. This correction is affected by reparsing previous examples that may be affected by the addition of the new parse to the lexicon. Not reparsing those examples that will not be affected, saves a great deal of time. In this way a new lexicon is built from the reparsed examples for each hypothesised parse of the current example. The parse leading to the most compressive of these is chosen. The amount of reparsing is also reduced by using stored parse information.</Paragraph> <Paragraph position="3"> This may appear an expensive way of determining which parse to select, but it enables the system to calculate the most compressive lexicon and keep an up-to-date annotation for the corpus. Also, the chart parser works in polynomial time and it is possible to do significant pruning, as outlined, so few sentences need to be reparsed each time. However, in the future we will look at ways of determining which parse to select that do not require complete reparsing.</Paragraph> <Paragraph position="4"> Lexicon Modification The final stage takes the current lexicon and replaces it with the lexicon built with the selected parse. The whole process is repeated until all the examples have been parsed. The final lexicon is left after the final modification. The most probable annotation of the corpus is the set of top-most parses after the final parse selection.</Paragraph> </Section> </Section> <Section position="5" start_page="62" end_page="63" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> Experiments were performed on three different corpora all containing only positive examples.</Paragraph> <Paragraph position="1"> Experiments were performed with and without a partial lexicon of closed-class words (words of categories with a finite number of members) with fixed categories and probabilities, e.g. determiners and prepositions. All experiments were carried out on a SGI Origin 2000.</Paragraph> <Paragraph position="2"> Experiments on Corpus 1 The first corpus was built from a context-free grammar (CFG), using a simple random generation algorithm.</Paragraph> <Paragraph position="3"> The CFG (shown in Figure 4) covers a range of simple declarative sentences with intransitive, transitive and ditransitive verbs and with adjectives. The lexicon of the CFG contained 39 words with an example of noun-verb ambiguity.</Paragraph> <Paragraph position="4"> The corpus consisted of 500 such sentences (Figure 5 shows examples). As the size of the lexicon was small and there was only a small amount of ambiguity, it was unnecessary to supply the partial lexicon, but the experiment was carried out for comparison. We also performed an experiment on 100 unseen examples to see how accurately they were parsed with the learned lexicon. The results were manually verified to determine how many sentences were parsed correctly. null</Paragraph> <Paragraph position="6"> with example lexical entries Experiments on Corpus 2 The second corpus was generated in the same way, but using extra rules (see Figure 6) to include prepositions, thus making the fragment of English ex ( \[mary, ran\] ).</Paragraph> <Paragraph position="7"> ex(\[john, gave, john, a, boy\]).</Paragraph> <Paragraph position="8"> ex(\[a, dog, called, the, fish, a, small, ugly, desk\]).</Paragraph> <Paragraph position="9"> more complicated. The lexicon used for generating the corpus was larger - 44 words in total. Again 500 examples were generated (see Figure 7 for examples) and experiments were carried out both with and without the partial lexicon. Again we performed an experiment on 100 unseen examples to see how accurately they are parsed.</Paragraph> <Paragraph position="10"> Experiments on Corpus 3 (The LLL Corpus) Finally, we performed experiments using the LLL corpus (Kazakov et al., 1998). This is a corpus of generated sentences for a substantial fragment of English. It is annotated with a certain amount of semantic information, which was ignored. The corpus contains 554 sentences, however, because of the restricted set of categories and CG rules, we limited the experiments to the 157 declarative sentences (895 words, with 152 unique words) in the corpus. Examples are shown in Figure 8. While our CG rules can handle a reasonable variety of declarative sentences it is by no means complete, not allowing any movement (e.g. topicalised sentences) or even any adverbs yet. This was, unsurprisingly, something of a limitation. Also, this corpus is very small and sparse, making learning difficult. It was determined to experiment to see how well the system performed under these conditions. Again we performed experiments with and without fixed closed-class words. Due to the lack of examples it was not possible to perform a test on unseen examples, which need to be pursued in the future.</Paragraph> <Paragraph position="11"> ex(\[no, manager, in, sandy, reads, every, machine\]).</Paragraph> <Paragraph position="12"> ex(\[the, manual, isnt, continuing\]).</Paragraph> <Paragraph position="13"> ex(\[no, telephone, sees, the, things\]) .</Paragraph> <Paragraph position="14"> All experiments were performed with the minimum number of categories needed to cover the corpus, so for example, in the experiments on Corpus 1 the categories for prepositions were not available to the parser. This will obviously affect the speed with which the learner performs. Also, the parser was restricted to two possible parses in each case.</Paragraph> </Section> class="xml-element"></Paper>