File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2916_metho.xml
Size: 28,497 bytes
Last Modified: 2025-10-06 14:10:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2916"> <Title>Unsupervised Grammar Induction by Distribution and Attachment</Title> <Section position="4" start_page="0" end_page="118" type="metho"> <SectionTitle> 2 Previous approaches </SectionTitle> <Paragraph position="0"> Distributional methods analyze text by alignment, aiming to find equivalence classes covering substitutable units. We align common portions of texts termed contexts, leaving distinct contiguous wordsequences, termed expressions. An expression and its context form an alignment pattern, which is de- null fined as:</Paragraph> <Paragraph position="2"> From this alignment pattern, we can extract context-free grammar rules:</Paragraph> <Paragraph position="4"> While the definition of expression is straightforward, the definition of context is problematic. We would like as much context as possible, but word-sequence contexts become less probable as their length increases, making learning harder. Therefore, simple models of context are preferred, although the precise definition varies between systems.</Paragraph> <Paragraph position="5"> Distributional approaches to grammar induction fall into two categories, depending on their treatment of nested structure. The first category covers Expectation-Maximization (EM) systems. These systems propose constituents based on analysis of text, then select a non-contradictory combination of constituents for each sentence that maximizes a given metric, usually parsing probability. EM has the advantage that constituent probabilities are only compared when constituents compete, which removes the inherent bias towards shorter constituents, which tend to have higher probability. However, EM methods are more susceptible to data sparsity issues associated with raw text, because there is no generalization during constituent proposal.</Paragraph> <Paragraph position="6"> Examples of EM learning systems are Context Distribution Clustering (CDC) (Clark, 2001) and Constituent-Context Model (CCM) (Klein, 2005, Chapter 5), which avoid the aforementioned data-sparsity issues by using a part-of-speech (POS) tagged corpus, rather than raw text. Alignment Based Learning (ABL) (van Zaanen, 2000) is the only EM system applied directly to raw text. ABL uses minimal String-Edit Distance between sentences to propose constituents, from which the most probable combination is chosen. However, ABL is relatively inefficient and has only been applied to small corpora.</Paragraph> <Paragraph position="7"> The second category is that of incremental learning systems. An incremental system analyzes a corpus in a bottom-up fashion: each time a new constituent type is found, it is inserted into the corpus to provide data for later learning. This has the advantage of easing the data-sparsity issues described above because infrequent sequences are clustered into more frequent non-terminal symbols. However, in incremental systems, constituents are compared directly, which can lead to a bias towards shorter constituents.</Paragraph> <Paragraph position="8"> The EMILE system (Adriaans, 1999) learns shallow languages in an incremental manner, and has been applied to natural language under the assumption that such languages are shallow. Shallowness is the property whereby, for any constituent type in a language, there exist well-supported minimal units of that type. EMILE aligns complete sentences only, attempting to isolate minimal units, which are then used to process longer sequences. This method is efficient because alignment is non-recursive. However, as a consequence, EMILE offers only a limited treatment of nested and recursive structures.</Paragraph> <Paragraph position="9"> A more comprehensive approach to learning nested structure is found in the ADIOS system (Solan et al., 2003). ADIOS enumerates all patterns of a given length, under the condition that each sequence must have non-empty contexts and expressions. These patterns are ranked using an information gain metric, and the best pattern at each iteration is rewritten into the graph, before pattern scanning begins again. ADIOS learns context-sensitive equivalence classes, but does not induce grammars, and has not been formally evaluated against treebanks.</Paragraph> <Paragraph position="10"> Grammar induction systems are evaluated using standard metrics for parser evaluation, and in particular, the EVALB algorithm1. The above systems have been evaluated with respect to the ATIS treebank. Compared with supervised parsers, these systems perform relatively poorly, with the strictly unsupervised EMILE and ABL systems recovering 16.8% and 35.6% of constituent structure respectively. The partially-supervised systems of CDC and CCM perform better, with the latter retrieving 47.6% of the constituent structure in ATIS. However, the strictly unsupervised systems of ABL, EMILE and ADIOS have not been evaluated on larger corpora, in part due to efficiency constraints.</Paragraph> </Section> <Section position="5" start_page="118" end_page="271" type="metho"> <SectionTitle> 3 Issues for distributional learning </SectionTitle> <Paragraph position="0"> There are many issues with distributional learning, especially when learning from raw text. First, previous systems hypothesize and select constituents according to the probability of their contexts: ABL, EMILE and CCM use the probability of proposed equivalence classes, or the equivalent context probability; ADIOS uses an information gain metric, again favouring probable contexts. However, when learning from raw text, this preference for hypotheses with more probable contexts means that open-class words will seldom be considered as contexts.</Paragraph> <Paragraph position="1"> In POS-based learners, it is possible to align open-class POS contexts. These contexts are demonstrably important despite low word probabilities, which suggests that selecting contexts on the basis of probability will be limited in success.</Paragraph> <Paragraph position="2"> The second problem relates to word-senses.</Paragraph> <Paragraph position="3"> Alignment proceeds by matching orthographic types, but these types can have numerous associated syntactic senses. For example, 'to' plays two distinct roles: infinitive marker or preposition. If we align using the orthographic type, we will often misalign words, as seen in the following alignment: I gave it to the man in the grey jacket John agreed to see me in 20 minutes Here, we are (mis)aligning a prepositional 'to', with an infinitive marker. The result would be a correctly identified noun-phrase, 'the man', and an incorrect structure, contradicting both the verb-group 'to see' and the noun-phrase 'me'. This problem does not affect POS-based learning systems, as POS tags are unambiguously assigned.</Paragraph> <Paragraph position="4"> Finally, grammar induction systems are typically inefficient, which prohibits training over large corpora. Distributional analysis is an expensive procedure, and must be performed for large numbers of word sequences. Previous approaches have tended to enumerate all alignment patterns, of which the best are selected using probabilistic metrics. However, given the preference for probable alignments, there is considerable wasted computation here, and it is on this issue that we shall focus.</Paragraph> <Paragraph position="5"> 4 A heuristic approach to alignment Rather than enumerating all word sequences in a corpus, we propose a heuristic for guiding distributional systems towards more favourable alignment patterns, in a system called Directed Alignment. In this system, we define context as the ordered pair of left- and right-context for a given constituent, <Cleft [?]Cright> , where Cleft and Cright are singleunits. The atomic units of this system are words, but learned constituents may also act as context-units.</Paragraph> <Paragraph position="6"> The probability of a pattern depends primarily on its contexts, since they are common to all matching sequences. We can reduce the task of finding probable alignments to simply finding probable contextpairs. However, we can reduce this further: for a context-pair to be probable, its components must also be probable. Therefore, rather than enumerating all patterns in the corpus, we direct the alignment procedure towards patterns where Cleft and Cright are probable.</Paragraph> <Paragraph position="7"> The first stage of direction creates an index for the corpus, compiling a list of unit types, where units are initially words. From this list of types, the most probable 1% are selected as context-units. These context-units are the only types allowed to fill the roles Cleft and Cright in alignment patterns.</Paragraph> <Paragraph position="8"> Alignments are created directly from the context-unit index. For each context-unit token cu in the index, we locate cu in the corpus and create an alignment pattern, such that cu is the left context (Cleft). Next, we scan the sequence of words following cu, extending the alignment pattern until another context-unit cu' is found, or a fixed length threshold is exceeded. If cu' is found, it fills the role of right context (Cright), and the completed alignment pattern is cached; otherwise, the pattern is disregarded. null Direction permits two forms of valid expressions in the context <cu[?]cu'> : 1. nc1 ...ncn, where each nci is a non-context 2. c1 ...cn, where each ci is a context-unit The first of these forms allows us to examine non-nested alignments. The second allows us to analyze nested alignments only after inner constituents have been learned. These constraints reduce the number of constituents under consideration at any time to a manageable level. As a result, we can scan very large numbers of alignment patterns with relatively little overhead.</Paragraph> <Paragraph position="9"> As an example, consider the following sequence, with context units underlined: put the whole egg , all the seasonings and vegetables into the bowl and process for 10 seconds until smoothly pured .</Paragraph> <Paragraph position="10"> This would be broken into non-recursive expressions2: null (put) the (whole egg) , all the (seasonings) and (vegetables) into the (bowl) and (process) for (10 seconds) until (smoothly pureed) .</Paragraph> <Paragraph position="11"> These expressions will be replaced by non-terminal unit representing the class of expressions, such that each class contains all units across the corpus that occur in the same context: NT0 the NT1 , all the NT2 and NT3 into the NT2 and NT4 for NT5 until NT6 .</Paragraph> <Paragraph position="12"> Following this generalization nested structures can be discovered using the same process.</Paragraph> <Paragraph position="13"> This approach has some interesting parallels with chunking techniques, most notably that of function-word phrase identification (Smith and Witten, 1993). This similarity is enforced by disallowing nested structures. Unlike chunking systems, however, this work will also attempt to recover nested structures by means of incremental learning.</Paragraph> <Section position="1" start_page="119" end_page="119" type="sub_section"> <SectionTitle> 4.1 Selecting alignment patterns </SectionTitle> <Paragraph position="0"> The direction process extracts a set of candidate alignments, and from this set we select the best alignment to rewrite as an equivalence class. Previous approaches offer a number of metrics for ranking constituents, based around constituent or context probability (ABL and CCM), Mutual Information (CDC), and information gain (ADIOS). We have implemented several of these metrics, but our experiences suggest that context probability is the most successful.</Paragraph> <Paragraph position="1"> The probability of an alignment is effectively the sum of all path probabilities through the alignment:</Paragraph> <Paragraph position="3"> where each pathleft,right is a unique word sequence starting with left and ending with right, under the 2For clarity, we have shown all alignments for the given sentence simultaneously. However, the learning process is incremental, so each alignment would be proposed during a distinct learning iteration.</Paragraph> <Paragraph position="4"> constraints on expressions described above. There is an important practical issue here: probability sums such as that in Equation 2 do not decrease when expressions are replaced with equivalence classes. To alleviate this problem, we rewrite the units when updating the distribution, but discard paths that match the current alignment. This prevents looping while allowing the rewritten paths to contribute to nested structures.</Paragraph> </Section> <Section position="2" start_page="119" end_page="119" type="sub_section"> <SectionTitle> 4.2 Generalizing expression classes </SectionTitle> <Paragraph position="0"> The model outlined above is capable of learning strictly context-sensitive constituents. While this does allow for nested constituents, it is problematic for generalization. Consider the following equivalence classes, which are proposed relatively early in Directed Alignment: the NT1 of the NT2 in Here, the non-terminals have been assigned on the basis of context-pairs: NT1 is defined by <the[?]of> and NT2 is defined by <the [?] in> . These types are distinct, although intuitively they account for simple noun-phrases. If we then propose an alignment pattern with NT1 as Cleft, it must be followed by 'of', which removes any possibility of generalizing 'of' and 'in'.</Paragraph> <Paragraph position="1"> We alleviate this problem by generalizing equivalence classes, using a simple clustering algorithm. For each new alignment, we compare the set of expressions with all existing expression classes, ranking the comparisons by the degree of overlap with the current alignment. If this degree of overlap exceeds a fixed threshold, the type of the existing class is assumed; otherwise, a new class is created.</Paragraph> </Section> <Section position="3" start_page="119" end_page="271" type="sub_section"> <SectionTitle> 4.3 Experiments, results and analysis </SectionTitle> <Paragraph position="0"> To evaluate our algorithm, we follow the standard approach of comparing the output of our system with that of a treebank. We use the EVALB algorithm, originally designed for evaluating supervised parsing systems, with identical configuration to that of (van Zaanen, 2000). However, we apply our algorithms to a different corpus: the written sub-corpus of the International Corpus of English, Great Britain Component (henceforth ICE-GB), with punctuation removed. This consists of 438342 words, in 22815 sentences. We also include a baseline instantiation rected Alignment applied to ICE-GB, showing both context-sensitive (DA) and clustered (DAcluster) alignment. The columns represent Unlabeled Precision, Unlabeled Recall, Unlabeled F-Score and the proportion of sentence with crossing brackets respectively. null of our algorithm, which chunks text into expressions between function words, which we refer to as Function-Word Bracketing (FWB).</Paragraph> <Paragraph position="1"> Table 1 summarizes the EVALB scores for two 500-iteration runs of Directed Alignment over ICE-GB: DA is the standard context-sensitive version of the algorithm; DAcluster is the version with context clustering. FWB precision is relatively low, with only 30% of proposed structures appearing in the treebank. Recall is even lower, with only 11% of structure retrieved. This is unsurprising, as no nested constructions are considered.</Paragraph> <Paragraph position="2"> In comparison, both versions of Directed Alignment perform significantly worse, with DAcluster being only fractionally better than standard DA. Experiments over more learning iterations suggest that the performance of DA converges on FWB, with few nested constituents discovered. Both variants of the system produce very poor performance, with very little nested structure recovered. While these results seem discouraging, it is worth investigating system performance further.</Paragraph> <Paragraph position="3"> Table 2, summarizes the success of the algorithm at discovering different types of constituent. Note that these results are unlabeled, so we are examining the proportion of each type of constituent in ICE-GB that has been identified. Here, Directed Alignment exhibits the most success at identifying nonclauses, of which the primary source of success is short sentence fragments. Around 10% of noun-phrases (NP), verb-phrases (VP) and subordinatephrases (SUBP) were recovered, this limited success reflects the nature of the constituents: all three have relatively simple constructions, whereby a single word represents the constituent. In contrast, con- null (noun) in ICE-GB, ranked by recall.</Paragraph> <Paragraph position="4"> stituent types that comprise multiple units, such as prepositional-phrases (PP), are seldom recovered.</Paragraph> <Paragraph position="5"> During learning in DAcluster, we induce generalized classes using the expression clustering algorithm. This generalization can be evaluated, comparing induced classes with those in the treebank using precision and recall. Table 2(a) shows the top five proposed classes matching the type noun (N) in ICE-GB during 500 iterations of context-sensitive Directed Alignment. There are 271 types matching noun, and as can be seen, the top five account for a very small proportion of all nouns, some 9.46% (recall).</Paragraph> <Paragraph position="6"> Table 2(b) shows the same analysis for Directed Alignment with class generalization. For noun matches, we can see that there are far fewer proposed classes (135), and that those classes are much more probable, the top five accounting for 17.47% POS contexts for NP, VP and PP constituents.</Paragraph> <Paragraph position="7"> of nouns in ICE-GB. The algorithm seems to be achieving some worthwhile generalization, which is reflected in a slight increase in EVALB scores for DAcluster. However, this increase is not a significant one, suggesting that this generalization is not sufficient to support distributional learning. We might expect this: attempting to cluster based on the low-frequency and polysemous words in expressions seems likely to produce unreliable clusters. 5 A closer look at distributional contexts The results discussed so far seem discouraging for the approach. However, there are good reasons why these results are so poor, and why we can expect little improvement in the current formulation. We can show some of these reasons by examining actual constituent-context distributions.</Paragraph> <Paragraph position="8"> Table 4 shows an analysis of the constituent types NP, VP and PP in ICE-GB, against the five most frequent POS tags3 occurring as left-context, constituent-start, constituent-end, and right-context. We distinguish the following POS categories as being primarily functional, as they account for the majority of context-units considered by Directed Alignment: prepositions (PREP), articles (ART), aux3The same trends can be shown for words, but a POS analysis is preferred for clarity and brevity.</Paragraph> <Paragraph position="9"> iliaries (AUX), sentence-starts (#STA#), pronouns (PRON), conjunctions (CONJ), particles (PTCL) and punctuation (PUNC).</Paragraph> <Paragraph position="10"> From Table 4, we can see that noun-phrases and verb-phrases are relatively well-suited to our approach. First, both types have strong functional left- and right-contexts: 58% of NP left-contexts and 50% of NP right-contexts are members of our functional POS; similarly, 43% of VP left-contexts and 49% of VP right-contexts are functional. This means that a probability-based model of context, such as ours, will find relatively strong support for these types. Second, both NP and VP have minimal unit types: nouns and pronouns for NP; verbs for VP. As a consequence, these types tend to carry more probability mass, since shorter sequences tend to be more frequent. We should expect our system to perform reasonably on NP and VP as a result.</Paragraph> <Paragraph position="11"> In contrast, prepositional-phrases are much less amenable to distributional analysis. First, PP tend to be longer, since they contain NP, and this has obvious repercussions for alignment probabilities.</Paragraph> <Paragraph position="12"> More damagingly, PP contexts are dominated by open-class words - the top 74% of PP left-contexts are nouns, verbs and adverbs. Therefore, a purely probabilistic distributional approach cannot account for prepositional-phrases, since learning data is too sparse. Previous approaches have relied upon open-class generalization to reduce this problem, but these methods suffer from the same problems of data sparsity, and as such are not reliable enough to resolve the issue.</Paragraph> </Section> </Section> <Section position="6" start_page="271" end_page="271" type="metho"> <SectionTitle> 6 Attachment </SectionTitle> <Paragraph position="0"> We have seen that strictly probabilistic distributional analysis is not sufficient to learn constituents from raw text. If we are to improve upon this, we must find a way to identify constituents from their component parts, as well as by contextual analysis. The constituent-context distributions in Table 4 give us some clues as to where to start: both noun-phrases and prepositional-phrases show very significant constituent-starts, with articles and pronouns starting 58% of NP, and prepositions starting 94% of all PP. These functional types would be identified as contexts in Directed Alignment, but the strong relation to their containing constituents would be ig- null nored.</Paragraph> <Paragraph position="1"> One method for achieving such an internal relationship might be to attach contexts to the expressions with which they co-occur, and we propose using such a method here. However, this requires that we have some criterion for deciding when and how expressions should be attached to their contexts. We use a measure based on STOP arguments (Collins, 1999), which allows us to condition the decision to insert a constituent boundary on the evidence we see for doing so. For raw text, the only boundaries that are explicitly marked are at the start and end of sentences, and it is this information we use to decide when to attach contexts to expressions4. In other words, if a context is likely to start a sentence, we assume it is also likely to start a constituent at other positions within a sentence.</Paragraph> <Paragraph position="2"> In order to calculate the likelihood of a particular context word w occurring at the start or end of a sentence, we simply use the bigram probabilities between w and the special symbols START and END, which denote the start and end of a sentence respectively. From these probabilities, we calculate Mutual Information MI(START,w) and MI(w,END).</Paragraph> <Paragraph position="3"> We prefer MI because it describes the strength of the relation between w and these special symbols without bias towards more probable words. From these MI values, we calculate a Directional Preference (DP) for the context word:</Paragraph> <Paragraph position="5"> This yields a number representing whether w is more likely to start or end a sentence. This number will be zero if we are equally likely to see w at the start or end of a sentence, negative if w is more likely to start a sentence, and positive if w is is more likely to end a sentence.</Paragraph> <Paragraph position="6"> Using DP, we can decide how to attach an expression to its contexts. For a given alignment, we consider the possibility of attaching the expression to neither context, the left-context, or the right-context, by comparing the DP for the left- and right-contexts.</Paragraph> <Paragraph position="7"> If the left-context shows a strong tendency to start sentences, and the right-context does not show a 4For this method to work, we assume that our corpus is segmented into sentences. This is not the case for speech, but for learning from text it seems a reasonable assumption.</Paragraph> <Paragraph position="8"> categorized by gold-type strong tendency to end sentences (i.e. there is an overall DP is negative), we attach the expression to its left-context; if the reverse situation is true, we attach the expression to its right context. Should the difference between these DP fall below a threshold, neither context is preferred, and the expression remains unattached.</Paragraph> <Paragraph position="9"> Let us consider a specific example of attachment.</Paragraph> <Paragraph position="10"> The first alignment considered by the system (when applied to ICE-GB) is: the NT1 of Here, we need to compare the likelihood of seeing a constituent start with 'the' with with the likelihood of seeing a constituent end with 'of'. Intuitively, 'the' occurs frequently at the start of a sentence, and never at the end. Consequently, it has a high negative DP. Meanwhile 'of' has a small negative DP. In combination, there is a high negative DP, so we attach the expression to the left-context, 'the'.</Paragraph> <Section position="1" start_page="271" end_page="271" type="sub_section"> <SectionTitle> 6.1 Experimental Analysis </SectionTitle> <Paragraph position="0"> We applied Directed Alignment with attachment based on STOP arguments (DASTOP ) to ICE-GB as before, running for 500 iterations. These results are shown in Table 5. The results are encouraging.</Paragraph> <Paragraph position="1"> Unlabeled precision increased by almost 50%, from 23.6% for DAcluster to 33.6%. Likewise, system recall increased dramatically, from 8.1% to 14.1%, up some 75%. Crossing-brackets increased slightly, but remained relatively low at 0.42.</Paragraph> <Paragraph position="2"> Table 6 shows the breakdown of EVALB scores for the major non-terminal types, as before. The improvement in EVALB scores is attributable to a marked increase in success at identifying prepositional-phrases, with a lesser increase in noun-phrase identification.</Paragraph> </Section> <Section position="2" start_page="271" end_page="271" type="sub_section"> <SectionTitle> 6.2 Discussion </SectionTitle> <Paragraph position="0"> The attachment procedure described above is more successful at discovering nested constituents than distributional methods. There are good reasons why this should be the case. First, attachment compresses the corpus, removing the bias towards shorter sequences. Indeed, the algorithm seems capable of retrieving complex constituents of up to ten words in length during the first 500 iterations.</Paragraph> <Paragraph position="1"> Second, the STOP-conditioning criterion, while somewhat ad hoc in relation to distributional methods, allows us to assess where constituent boundaries are likely to occur. As such, this can be seen as a rudimentary method for establishing argument relations, such as those observed in (Klein, 2005, Chapter 6).</Paragraph> <Paragraph position="2"> Despite these improvements, the attachment process also makes some systematic mistakes. Some of these may be attributed to discrepancies between the syntactic theory used to annotate the treebank and the attachment process. For example, verbs are routinely attached to their subjects before objects, contradicting the more traditional interpretation present in treebanks. Some of the remaining mistakes can be attributed to the misalignment, due to the orthographic match problem described in Section 3.</Paragraph> </Section> </Section> <Section position="7" start_page="271" end_page="271" type="metho"> <SectionTitle> 7 Future Work </SectionTitle> <Paragraph position="0"> The major problem when applying distributional methods to raw text is that of orthographic matching, which causes misalignments between alternative senses of a particular word-form. To reduce this problem, context-units must be classified in some way to disambiguate these different senses. Such classification could be used as a precursor to alignment in the system we have described.</Paragraph> <Paragraph position="1"> In addition, to better evaluate the quality of attachment, dependency representations and treebanks could be used, which do not have an explicit order on attachment. This would give a more accurate evaluation where subject-verb attachment is concerned.</Paragraph> </Section> class="xml-element"></Paper>