File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1085_intro.xml
Size: 5,329 bytes
Last Modified: 2025-10-06 14:03:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1085"> <Title>Contextual Dependencies in Unsupervised Word Segmentation[?]</Title> <Section position="4" start_page="673" end_page="674" type="intro"> <SectionTitle> 2 NGS and MBDP </SectionTitle> <Paragraph position="0"> The NGS and MBDP systems are similar in some ways: both are designed to find utterance boundaries in a corpus of phonemically transcribed utterances, with known utterance boundaries. Both also use approximate online search procedures, choosing and fixing a segmentation for each utterance before moving onto the next. In this section, we focus on the very different probabilistic models underlying the two systems. We show that the optimal solution under the NGS model is the unsegmented corpus, and suggest that this problem stems from the fact that the model assumes a uniform prior over hypotheses. We then present the MBDP model, which uses a non-uniform prior but is difficult to extend beyond the unigram case.</Paragraph> <Section position="1" start_page="673" end_page="673" type="sub_section"> <SectionTitle> 2.1 NGS </SectionTitle> <Paragraph position="0"> NGS assumes that each utterance is generated independently via a standard n-gram model. For simplicity, we will discuss the unigram version of the model here, although our argument is equally applicable to the bigram and trigram versions. The unigram model generates an utterance u according to the grammar in Figure 1, so</Paragraph> <Paragraph position="2"> where u consists of the words w1...wn and p$ is the probability of the utterance boundary marker $. This model can be used to find the highest probability segmentation hypothesis h given the data d by using Bayes' rule:</Paragraph> <Paragraph position="4"> NGS assumes a uniform prior P(h) over hypotheses, so its goal is to find the solution that maximizes the likelihood P(d|h).</Paragraph> <Paragraph position="5"> Using this model, NGS's approximate search technique delivers competitive results. However, the true maximum likelihood solution is not competitive, since it contains no utterance-internal word boundaries. To see why not, consider the solution in which p$ = 1 and each utterance is a single 'word', with probability equal to the empirical probability of that utterance. Any other solution will match the empirical distribution of the data less well. In particular, a solution with additional word boundaries must have 1 [?] p$ > 0, which means it wastes probability mass modeling unseen data (which can now be generated by concatenating observed utterances together).</Paragraph> <Paragraph position="6"> Intuitively, the NGS model considers the unsegmented solution to be optimal because it ranks all hypotheses equally probable a priori. We know, however, that hypotheses that memorize the input data are unlikely to generalize to unseen data, and are therefore poor solutions. To prevent memorization, we could restrict our hypothesis space to models with fewer parameters than the number of utterances in the data. A more general and mathematically satisfactory solution is to assume a non-uniform prior, assigning higher probability to hypotheses with fewer parameters. This is in fact the route taken by Brent in his MBDP model, as we shall see in the following section.</Paragraph> </Section> <Section position="2" start_page="673" end_page="674" type="sub_section"> <SectionTitle> 2.2 MBDP </SectionTitle> <Paragraph position="0"> MBDP assumes a corpus of utterances is generated as a single probabilistic event with four steps: 1. Generate L, the number of lexical types. 2. Generate a phonemic representation for each type (except the utterance boundary type, $). 3. Generate a token frequency for each type. 4. Generate an ordering for the set of tokens. In a final deterministic step, the ordered tokens are concatenated to create an unsegmented corpus. This means that certain segmented corpora will produce the observed data with probability 1, and all others will produce it with probability 0. The posterior probability of a segmentation given the data is thus proportional to its prior probability under the generative model, and the best segmentation is that with the highest prior probability. There are two important points to note about the MBDP model. First, the distribution over L assigns higher probability to models with fewer lexical items. We have argued that this is necessary to avoid memorization, and indeed the unsegmented corpus is not the optimal solution under this model, as we will show in Section 3. Second, the factorization into four separate steps makes it theoretically possible to modify each step independently in order to investigate the effects of the various modeling assumptions. However, the mathematical statement of the model and the approximations necessary for the search procedure make it unclear how to modify the model in any interesting way. In particular, the fourth step uses a uniform distribution, which creates a unigram constraint that cannot easily be changed. Since our research aims to investigate the effects of different modeling assumptions on lexical acquisition, we develop in the following sections a far more flexible model that also incorporates a preference for sparse solutions.</Paragraph> </Section> </Section> class="xml-element"></Paper>