File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1307_metho.xml

Size: 5,301 bytes

Last Modified: 2025-10-06 14:09:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1307">
  <Title>Statistics Learning and Universal Grammar: Modeling Word Segmentation</Title>
  <Section position="3" start_page="50" end_page="51" type="metho">
    <SectionTitle>
3 The Model
</SectionTitle>
    <Paragraph position="0"> To give a precise evaluation of SL in a realistic setting, we constructed a series of (embarrassingly simple) computational models tested on child-directed English.</Paragraph>
    <Paragraph position="1"> The learning data consists of a random sample of child-directed English sentences from the CHILDES database [19] The words were then phonetically transcribed using the Carnegie Mellon Pronunciation Dictionary, and were then grouped into syllables. Spaces between words are removed; however, utterance breaks are available to the modeled learner. Altogether, there are 226,178 words, consisting of 263,660 syllables.</Paragraph>
    <Paragraph position="2"> Implementing SL-based segmentation is straightforward. One first gathers pair-wise TPs from the training data, which are used to identify local minima and postulate word boundaries in the on-line processing of syllable sequences. Scoring is done for each utterance and then averaged. Viewed as an information retrieval problem, it is customary [20] to report both precision and recall of the performance. null The segmentation results using TP local minima are remarkably poor, even under the assumption that the learner has already syllabified the input perfectly. Precision is 41.6%, and recall is 23.3%; over half of the words extracted by the model are not actual English words, while close to 80% of actual words fail to be extracted. And it is straightforward why this is the case. In order for SL to be effective, a TP at an actual word boundary must be lower than its neighbors. Obviously, this condition cannot be met if the input is a sequence of monosyllabic words, for which a space must be postulated for every syllable; there are no local minima to speak of. While the pseudowords in [8] are uniformly three-syllables long, much of child-directed English consists of sequences of monosyllabic words: corpus statistics reveals that on average, a monosyllabic word is followed by another monosyllabic word 85% of time. As long as this is the case, SL cannot, in principle, work.</Paragraph>
  </Section>
  <Section position="4" start_page="51" end_page="51" type="metho">
    <SectionTitle>
4 Statistics Needs UG
</SectionTitle>
    <Paragraph position="0"> This is not to say that SL cannot be effective for word segmentation. Its application, must be constrained-like that of any learning algorithm however powerful-as suggested by formal learning theories [1-3]. The performance improves dramatically, in fact, if the learner is equipped with even a small amount of prior knowledge about phonological structures. Specifically, we assume, uncontroversially, that each word can have only one primary stress. (This would not work for functional words, however.) If the learner knows this, then it may limit the search for local minima only in the window between two syllables that both bear primary stress, e.g., between the two a's in the sequence &amp;quot;languageacquisition&amp;quot;. This assumption is plausible given that 7.5-month-old infants are sensitive to strong/weak prosodic distinctions [14].</Paragraph>
    <Paragraph position="1"> When stress information suffices, no SL is employed, so &amp;quot;bigbadwolf&amp;quot; breaks into three words for free. Once this simple principle is built in, the stress-delimited SL algorithm can achieve the precision of 73.5% and 71.2%, which compare favorably to the best performance reported in the literature [20]. (That work, however, uses an computationally prohibitive and psychological implausible algorithm that iteratively optimizes the entire lexicon.) null The computational models complement the experimental study that prosodic information takes priority over statistical information when both are available [21]. Yet again one needs to be cautious about the improved performance, and a number of unresolved issues need to be addressed by future work. It remains possible that SL is not used at all in actual word segmentation. Once the oneword-one-stress principle is built in, we may consider a model that does not use any statistics, hence avoiding the computational cost that is likely to be considerable. (While we don't know how infants keep track of TPs, there are clearly quite some work to do. Syllables in English number in the thousands; now take the quadratic for the potential number of pair-wise TPs.) It simply stores previously extracted words in the memory to bootstrap new words. Young children's familiar segmentation errors-&amp;quot;I was have&amp;quot; from be-have, &amp;quot;hiccing up&amp;quot; from hicc-up, &amp;quot;two dults&amp;quot;, from a-dult-suggest that this process does take place. Moreover, there is evidence that 8-month-old infants can store familiar sounds in the memory [22]. And finally, there are plenty of single-word utterances-up to 10% [23]that give many words for free. The implementation of a purely symbolic learner that recycles known words yields even better performance: a precision of 81.5% and recall of 90.1%.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML