File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0713_intro.xml
Size: 2,803 bytes
Last Modified: 2025-10-06 14:01:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0713"> <Title>Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Distributional clustering </SectionTitle> <Paragraph position="0"> Distributional clustering has been used in many applications at the word level, but as has been noticed before (Finch et al., 1995), it can also be applied to the induction of grammars. Sets of tag sequences can be clustered together based on the contexts they appear in. In the work here I consider the context to consist of the part of speech tag immediately preceding the sequence and the tag immediately following it. The dependency between these is critical, as we shall see, so the context distribution therefore has a0a2a1 parameters, where a0 is the number of tags, rather than two good clusters the a3 a0 parameters it would have under an independence assumption. The context distribution can be thought of as a distribution over a two-dimensional matrix.</Paragraph> <Paragraph position="1"> The data set for all the results in this paper consisted of 12 million words of the British National Corpus, tagged according to the CLAWS-5 tag set, with punctuation removed.</Paragraph> <Paragraph position="2"> There are 76 tags; I introduced an additional tag to mark sentence boundaries. I operate exclusively with tags, ignoring the actual words. My initial experiment clustered all of the tag sequences in the corpus that occurred more than 5000 times, of which there were 753, using the a0 means algorithm with the a4a6a5 -norm or city-block metric applied to the context distributions. Thus sequences of tags will end up in the same cluster if their context distributions are similar; that is to say if they appear predominantly in similar contexts. I chose the cutoff of 5000 counts to be of the same order as the number of parameters of the distribution, and chose the number of clusters to be 100.</Paragraph> <Paragraph position="3"> To identify the frequent sequences, and to calculate their distributions I used the standard technique of suffix arrays (Gusfield, 1997), which allows rapid location of all occurrences of a desired substring.</Paragraph> <Paragraph position="4"> As expected, the results of the clustering showed clear clusters corresponding to syntactic constituents, two of which are shown in Table 1. Of course, since we are clustering all of the frequent sequences in the corpus we will also have clusters corresponding to parts of constituents, as can be seen in Table 2. We obviously would not want to hypothesise these as constituents: we therefore need some criterion for filtering out these spurious candidates.</Paragraph> </Section> class="xml-element"></Paper>