XML Viewer - w01-0713

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0713_metho.xml
Size: 12,493 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0713">
  <Title>Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Mutual Information
</SectionTitle>
    <Paragraph position="0"> The criterion I propose is that with real constituents, there is high mutual information between the symbol occurring before the putative constituent and the symbol after - i.e. they are not independent. Note that this is unrelated to Magerman and Marcus's MI criterion which is the (generalised) mutual information of the sequence of symbols itself. I will justify this in three ways intuitively, mathematically and empirically.</Paragraph>
    <Paragraph position="1"> Intuitively, a true constituent like a noun phrase can appear in a number of different contexts. This is one of the traditional constituent tests. A noun phrase, for example, appears frequently either as the subject or the object of a sentence. If it appears at the beginning of a sentence it is accordingly quite likely to be followed by a finite verb. If on the other hand it appears after the finite verb, it is more likely to be followed by the end of the sentence or a preposition. A spurious constituent like PRP AT0 will be followed by an N-bar regardless of where it occurs. There is therefore no relation between what happens immediatly before it, and what happens immediately after it. Thus there will be a higher dependence or correlation with the true constituent than with the erroneous one.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Mathematical Justification
</SectionTitle>
    <Paragraph position="0"> We can gain some insight into the significance of the MI criterion by analysing it within the framework of SCFGs. We are interested in looking at the properties of the two-dimensional distributions of each non-terminal. The terminals are the part of speech tags of which there are a7 . For each terminal or non-terminal symbol a8 we define four distributions, a4a10a9a11a8a13a12a15a14a17a16a18a9a11a8a19a12a15a14a15a20a6a9a11a8a13a12a15a14a17a21a22a9a11a8a19a12 , over a7 or equivalently a7 -dimensional vectors.</Paragraph>
    <Paragraph position="1"> Two of these, a16a18a9a11a8a13a12 and a20a6a9a11a8a19a12 are just the prefix and suffix probability distributions for the symbol(Stolcke, 1995): the probabilities that the string derived from a8 begins (or ends) with a particular tag. The other two a4a10a9a11a8a13a12a15a14a17a21a22a9a11a8a19a12 for left distribution and right distribution, are the distributions of the symbols before and after the nonterminal. Clearly if a8 is a terminal symbol, the strings derived from it are all of length 1, and thus begin and end with a8 , giving a16a18a9a11a8a13a12 and a20a6a9a11a8a19a12 a very simple form.</Paragraph>
    <Paragraph position="2"> If we consider each non-terminal a23 in a SCFG, we can associate with it two random variables which we can call the internal and external variables. The internal random variable is the more familiar and ranges over the set of rules expanding that non-terminal. The external random variable, a24a26a25 , is defined as the context in which the non-terminal appears. Every non-root occurrence of a non-terminal in a tree will be generated by some rule a27 , that it appears on the right hand side of. We can represent this as a9a11a27a28a14a30a29a31a12 where a27 is the rule, and a29 is the index saying where in the right hand side it occurs. The index is necessary since the same non-terminal symbol might occur more than once on the right hand side of the same rule.</Paragraph>
    <Paragraph position="3"> So for each a23 , a24a26a25 can take only those values of a9a11a27a32a14a30a29a33a12 where a23 is the a29 th symbol on the right hand side of a27 .</Paragraph>
    <Paragraph position="4"> The independence assumptions of the SCFG imply that the internal and external variables are independent, i.e. have zero mutual information.</Paragraph>
    <Paragraph position="5"> This enables us to decompose the context distribution into a linear combination of the set of marginal distributions we defined earlier.</Paragraph>
    <Paragraph position="6"> Let us examine the context distribution of all occurrences of a non-terminal a23 with a particular value of a24a26a25 . We can distinguish three situations: the non-terminal could appear at the beginning, middle or end of the right hand side. If it occurs at the beginning of a rule a27 with left hand side a8 , and the rule is a8a35a34 a23a37a36a39a38a40a38a40a38 . then the terminal symbol that appears before a23 will be distributed exactly according to the symbol that occurs before a8 , i.e. a4a41a9a42a23a19a12a44a43a45a4a10a9a11a8a13a12 . The non-terminal symbol that occurs after a23 will be distributed according to the symbol that occurs at the beginning of the symbol that occurs after a23 in the right hand side of the rule, so a21a22a9a42a23a19a12a46a43a47a16a18a9a42a36a22a12 . By the independence assumption, the joint distribution is just the product of the two marginals.</Paragraph>
    <Paragraph position="8"> The total distribution of a23 will be the normalised expectation of these three with respect to a16a18a9a71a24 a25 a12 . Each of these distributions will have zero mutual information, and the mutual information of the linear combination will be less than or equal to the entropy of the variable combining them, a77a78a9a71a24a26a25a10a12 .</Paragraph>
    <Paragraph position="9"> In particular if we have</Paragraph>
    <Paragraph position="11"> We will have equality when the context distributions are sufficiently distinct. Therefore</Paragraph>
    <Paragraph position="13"> Thus a non-terminal that appears always in the same position on the right hand side of a particular rule, will have zero MI, whereas a non-terminal that appears on the right hand side of a variety of different rules will, or rather may, have high MI.</Paragraph>
    <Paragraph position="14"> This is of limited direct utility, since we do not know which are the non-terminals and which are other strings, but this establishes some circumstances under which the approach won't work. Some of these are constraints on the form of the grammar, namely that no non-terminal can appear in just a single place on the right hand side of a single rule. Others are more substantive constraints on the sort of languages that can be learned.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experimental Verification
</SectionTitle>
    <Paragraph position="0"> To implement this, we need some way of deciding a threshhold which will divide the sheep from the goats. A simple fixed threshhold is undesirable for a number of reasons. One problem with the current approach is that the maximum likelihood estimator of the mutual information is biased, and tends to over-estimate the mutual information with sparse data (Li, 1990). A second problem is that there is a &amp;quot;natural&amp;quot; amount of mutual information present between any two symbols that are close to each other, that decreases as the symbols get further apart. Figure 1 shows a graph of how the distance between two symbols affects the MI between them. Thus if we have a sequence of length 2, the symbols before and after it will have a distance of 3, and we would expect to have a MI of 0.05. If it has more than this, we might hypothesise it as a constituent; if it has less, we discard it.</Paragraph>
    <Paragraph position="1"> In practice we want to measure the MI of the clusters, since we will have many more counts, and that will make the MI estimate more accurate. We therefore compute the weighted average of this expected MI, according to the lengths of all the sequences in the clusters, and use that as the criterion. Table 4 shows how this criterion separates valid from invalid clusters. It eliminated 55 out of 100 clusters In Table 4, we can verify this empirically: this criterion does in fact filter out the undesirable sequences. Clearly this is a powerful technique for  is greater than the expected MI, and four invalid clusters which fail the test. The four invalid clusters clearly are not constituents according to traditional criteria.</Paragraph>
    <Paragraph position="2"> identifying constituents.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Minimum Description Length
</SectionTitle>
    <Paragraph position="0"> This technique can be incorporated into a grammar induction algorithm. We use the clustering algorithm to identify sets of sequences that can be derived from a single non-terminal. The MI criterion allows us to find the right places to cut the sentences up; we look for sequences where there are interesting long-range dependencies. Given these potential sequences, we can then hypothesise sets of rules with the same right hand side. This naturally suggests a minimum description length (MDL) or Bayesian approach (Stolcke, 1994; Chen, 1995). Starting with the maximum likelihood grammar, which has one rule for each sentence type in the corpus, and a single nonterminal, at each iteration we cluster all frequent strings, and filter according to the MI criterion discussed above.</Paragraph>
    <Paragraph position="1"> We then greedily select the cluster that will give the best immediate reduction in description length, calculated according to a theoretically optimal code. We add a new non-terminal with rules for each sequence in the cluster. If there is a sequence of length 1 with a non-terminal in it, then instead of adding a new nonterminal, we add rules expanding that old nonterminal. Thus, if we have a cluster which consists of the three sequences NP, NP PRP NP and NP PRF NP we would merely add the two rules NPa34 NP PRP NP and NPa34 NP PRF NP, rather than three rules with a new non-terminal on the left hand side. This allows the algorithm to learn recursive rules, and thus context-free grammars. null We then perform a partial parse of all the sentences in the corpus, and for each sentence select the path through the chart that provides the shortest description length, using standard dynamic programming techniques. This greedy algorithm is not ideal, but appears to be unavoidable given the computational complexity. Following this, we aggregate rules with the same right hand sides and repeat the operation.</Paragraph>
    <Paragraph position="2"> Since the algorithm only considers strings whose frequency is above a fixed threshhold, the application of a rule in rewriting the corpus will often result in a large number of strings being rewritten so that they are the same, thus bringing a particular sequence above the threshhold. Then at the next iteration, this sequence will be examined by the algorithm. Thus the algorithm progressively probes deeper into the structure of the corpus as syntactic variation is removed by the partial parse of low level constituents.</Paragraph>
    <Paragraph position="3"> Singleton rules require special treatment; I have experimented with various different options, without finding an ideal solution. The results presented here use singleton rules, but they are only applied when the result is necessary for the application of a further rule. This is a natural consequence of the shortest description length choice for the partial parse: using a singleton rule increases the description length.</Paragraph>
    <Paragraph position="4"> The MDL gain is very closely related to the mutual information of the sequence itself under standard assumptions about optimal codes (Cover and Thomas, 1991). Suppose we have two symbols a80 and a82 that occur a135a137a136 and a135a137a138 times in a corpus of length a23 and that the sequence a80a69a82 occurs a135a137a136a40a138 times. We could instead create a new symbol that represents a80a139a82 , and rewrite the corpus using this abbreviation. Since we would use it a135a137a136a40a138 times, each symbol would require  which is the point-wise mutual information between a80 and a82 .</Paragraph>
    <Paragraph position="5"> I ran the algorithm for 40 iterations. Beyond this point the algorithm appeared to stop producing plausible constituents. Part of the problem is to do with sparseness: it requires a large number of samples of each string to estimate the distributions reliably.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML