File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0714_intro.xml
Size: 10,641 bytes
Last Modified: 2025-10-06 14:01:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0714"> <Title>References</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Approach </SectionTitle> <Paragraph position="0"> At the heart of any iterative grammar induction system is a method, implicit or explicit, for deciding how to update the grammar. Two linguistic criteria for constituency in natural language grammars form the basis of this work (Radford, 1988): 1. External distribution: A constituent is a sequence of words which appears in various structural positions within larger constituents.</Paragraph> <Paragraph position="1"> paper are documented in Manning and Sch&quot;utze (1999, 413). 2. Substitutability: A constituent is a sequence of words with (simple) variants which can be substituted for that sequence.</Paragraph> <Paragraph position="2"> To make use of these intuitions, we use a distributional notion of context. Let a2 be a part-of-speech tag sequence. Every occurence of a2 will be in some context a3a4a2a6a5 , where a3 and a5 are the adjacent tags or sentence boundaries. The distribution over contexts in which a2 occurs is called its signature, which we denote by a7a9a8a10a2a12a11 .</Paragraph> <Paragraph position="3"> Criterion 1 regards constituency itself. Consider the tag sequences IN DT NN and IN DT. The former is a canonical example of a constituent (of category PP), while the later, though strictly more common, is, in general, not a constituent. Frequency alone does not distinguish these two sequences, but Criterion 1 points to a distributional fact which does. In particular, IN DT NN occurs in many environments. It can follow a verb, begin a sentence, end a sentence, and so on. On the other hand, IN DT is generally followed by some kind of a noun or adjective. This example suggests that a sequence's constituency might be roughly indicated by the entropy of its signature, a13a14a8a15a7a9a8a10a2a16a11a17a11 . This turns out to be somewhat true, given a few qualifications. Figure 1 shows the actual most frequent constituents along with their rankings by several other measures. Tag entropy by itself gives a list that is not particularly impressive. There are two primary causes for this.</Paragraph> <Paragraph position="4"> One is that uncommon but possible contexts have little impact on the tag entropy value. Given the skewed distribution of short sentences in the treebank, this is somewhat of a problem. To correct for this, let a7a19a18a20a8a10a2a16a11 be the uniform distribution over the observed contexts for a2 . Using a13a21a8a15a7a20a18a22a8a10a2a16a11a17a11 would have the obvious effect of boosting rare contexts, and the more subtle effect of biasing the rankings slightly towards more common sequences. However, while a13a21a8a15a7a9a8a10a2a16a11a17a11 presumably converges to some sensible limit given infinite data, a13a14a8a15a7a19a18a22a8a10a2a16a11a17a11 will not, as noise eventually makes all or most counts non-zero. Let a23 be the uniform distribution over all contexts. The scaled entropy a13a25a24a26a8a15a7a9a8a10a2a16a11a17a11a28a27a29a13a14a8a15a7a9a8a10a2a16a11a17a11a31a30a13a21a8a15a7a20a18a32a8a10a2a16a11a17a11a17a33a34a13a21a8a35a23a36a11a38a37 turned out to be a useful quantity in practice. Multiplying entropies is not theoretically meaningful, but this quantity does converge to a13a14a8a15a7a9a8a10a2a16a11a17a11 given infinite (noisy) data. The list for scaled entropy still has notable flaws, mainly relatively low ranks for common NPs, which does not hurt system perfor-</Paragraph> <Paragraph position="6"> counts, raw frequency, raw entropy, scaled entropy, boundary scaled entropy, and according to GREEDY-RE (see section 4.2).</Paragraph> <Paragraph position="7"> mance, and overly high ranks for short subject-verb sequences, which does.</Paragraph> <Paragraph position="8"> The other fundamental problem with these entropy-based rankings stems from the context features themselves. The entropy values will change dramatically if, for example, all noun tags are collapsed, or if functional tags are split. This dependence on the tagset for constituent identification is very undesirable. One appealing way to remove this dependence is to distinguish only two tags: one for the sentence boundary (#) and another for words.</Paragraph> <Paragraph position="9"> Scaling entropies by the entropy of this reduced signature produces the improved list labeled &quot;Boundary.&quot; This quantity was not used in practice because, although it is an excellent indicator of NP, PP, and intransitive S constituents, it gives too strong a bias against other constituents. However, neither system is driven exclusively by the entropy measure used, and duplicating the above rankings more accurately did not always lead to better end results.</Paragraph> <Paragraph position="10"> Criterion 2 regards the similarity of sequences.</Paragraph> <Paragraph position="11"> Assume the data were truly generated by a categorically unambiguous PCFG (i.e., whenever a token of a sequence is a constituent, its label is determined) and that we were given infinite data. If so, then two sequences, restricted to those occurrences where they are constituents, would have the same signatures. In practice, the data is finite, not statistically context-free, and even short sequences can be categorically ambiguous. However, it remains true that similar raw signatures indicate similar syntactic behavior. For example, DT JJ NN and DT NN have extremely similar signatures, and both are common NPs. Also, NN IN and NN NN IN have very similar signatures, and both are primarily non-constituents.</Paragraph> <Paragraph position="12"> For our experiments, the metric of similarity between sequences was the Jensen-Shannon divergence of the sequences' signatures:</Paragraph> <Paragraph position="14"> Where a39 KL is the Kullback-Leibler divergence between probability distributions. Of course, just as various notions of context are possible, so are various metrics between signatures. The issues of tagset dependence and data skew did not seem to matter for the similarity measure, and unaltered Jensen-Shannon divergence was used.</Paragraph> <Paragraph position="15"> Given these ideas, section 4.1 discusses a system whose grammar induction steps are guided by sequence entropy and interchangeability, and section 4.2 discusses a maximum likelihood system where the objective being maximized is the quality of the constituent/non-constituent distinction, rather than the likelihood of the sentences.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Problems with ML/MDL </SectionTitle> <Paragraph position="0"> Viewing grammar induction as a search problem, there are three principal ways in which one can induce a &quot;bad&quot; grammar: a55 Optimize the wrong objective function.</Paragraph> <Paragraph position="1"> a55 Choose bad initial conditions.</Paragraph> <Paragraph position="2"> a55 Be too sensitive to initial conditions.</Paragraph> <Paragraph position="3"> Our current systems primarily attempt to address the first two points. Common objective functions include maximum likelihood (ML) which asserts that a good grammar is one which best encodes or compresses the given data. This is potentially undesirable for two reasons. First, it is strongly data-dependent. The grammar a56 which maximizes</Paragraph> <Paragraph position="5"> a45a58a56a59a11 depends on the corpus a39 , which, in some sense, the core of a given language's phrase structure should not. Second, and more importantly, in an ML approach, there is pressure for the symbols and rules in a PCFG to align in ways which maximize the truth of the conditional independence assumptions embodied by that PCFG. The symbols and rules of a natural language grammar, on the other hand, represent syntactically and semantically coherent units, for which a host of linguistic arguments have been made (Radford, 1988). None of these arguments have anything to do with conditional independence; traditional linguistic constituency reflects only grammatical possibilty of expansion. Indeed, there are expected to be strong connections across phrases (such as are captured by argument dependencies). For example, in the tree-bank data used, CD CD is a common object of a verb, but a very rare subject. However, a linguist would take this as a selectional characteristic of the data set, not an indication that CD CD is not an NP. Of course, it could be that the ML and linguistic criteria align, but in practice they do not always seem to, and one should not expect that, by maximizing the former, one will also maximize the latter.</Paragraph> <Paragraph position="6"> Another common objective function is minimum description length (MDL), which asserts that a good analysis is a short one, in that the joint encoding of the grammar and the data is compact. The &quot;compact grammar&quot; aspect of MDL is perhaps closer to some traditional linguistic argumentation which at times has argued for minimal grammars on grounds of analytical (Harris, 1951) or cognitive (Chomsky and Halle, 1968) economy. However, some CFGs which might possibly be seen as the acquisition goal are anything but compact; take the Penn treebank covering grammar for an extreme example. Another serious issue with MDL is that the target grammar is presumably bounded in size, while adding more and more data will on average cause MDL methods to choose ever larger grammars.</Paragraph> <Paragraph position="7"> In addition to optimizing questionable objective functions, many systems begin their search procedure from an extremely unfavorable region of the grammar space. For example, the randomly weighted grammars in Carroll and Charniak (1992) rarely converged to remotely sensible grammars. As they point out, and quite independently of whether ML is a good objective function, the EM algorithm is only locally optimal, and it seems that the space of PCFGs is riddled with numerous local maxima.</Paragraph> <Paragraph position="8"> Of course, the issue of initialization is somewhat tricky in terms of the bias given to the system; for example, Brill (1994) begins with a uniformly right-branching structure. For English, right-branching structure happens to be astonishingly good both as an initial point for grammar learning and even as a baseline parsing model. However, it would be unlikely to perform nearly as well for a VOS language like Malagasy or VSO languages like Hebrew.</Paragraph> </Section> </Section> class="xml-element"></Paper>