File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-2012_metho.xml

Size: 12,821 bytes

Last Modified: 2025-10-06 14:08:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-2012">
  <Title>GraSp: Grammar learning from unlabelled speech corpora</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Learning from spoken language
</SectionTitle>
    <Paragraph position="0"> The current GraSp implementation completes a learning session in about one hour when fed with our main corpus.6 Such a session spans 2500-4000 iterations and delivers a lexicon rich 4 For perspicuity, two of the GraSped categories viz. 'can':(c2\c5)*(c5\c1) and 'we':(c2/c6)*c6 - are replaced in the table by functional equivalents.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 A caveat: Even if we do share some tools with other
</SectionTitle>
    <Paragraph position="0"> CG-based NL learning programmes, our goals are distinct, and our results do not compare easily with e.g. Kanazawa (1994), Watkinson (2000). In terms of philosophy, GraSp seems closer to connectionist approaches to NLL.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 The Danish corpus BySoc (person interviews). Size:
</SectionTitle>
    <Paragraph position="0"> 1.0 mio. words. Duration: 100 hours. Style: Labovian interviews. Transcription: Enriched orthography.</Paragraph>
    <Paragraph position="1"> Tagging: none. Ref.: http://www.cphling.dk/BySoc in microparadigms and microstructure. Lexical structure develops mainly around content words while most function words retain their initial category. The structure grown is almost fractal in character with lots of inter-connected categories, while the traditional large open classes [?] nouns, verbs, prepositions, etc. [?] are absent as such. The following sections present some samples from the main corpus session (Henrichsen 2000 has a detailed description).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Microparadigms
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> These four lexemes - or rather lexeme clusters chose to co-categorize. The collection does not resemble a traditional syntactic paradigm, yet the connection is quite clear: all four items appeared in the training corpus as names of primary schools.</Paragraph>
      <Paragraph position="3">  The final categories are superficially different, but are easily seen to be functionally equivalent. The same session delivered several other microparadigms: a collection of family members (in English translation: brother, grandfather, younger-brother, stepfather, sister-in-law, etc.), a class of negative polarity items, a class of mass terms, a class of disjunctive operators, etc.</Paragraph>
      <Paragraph position="4"> (Henrichsen 2000 6.4.2).</Paragraph>
      <Paragraph position="5"> GraSp-paradigms are usually small and almost always intuitively 'natural' (not unlike the small categories of L1 learners reported by e.g. Lucariello 1985).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Microgrammars
</SectionTitle>
      <Paragraph position="0"> GraSp'ed grammar rules are generally not of the kind studied within traditional phrase structure grammar. Still PSG-like 'islands' do occur, in the form of isolated networks of connected lexemes.</Paragraph>
      <Paragraph position="1">  Centred around lexeme 'Pauls', a microgrammar (of street names) has evolved almost directly translatable into rewrite rules:7</Paragraph>
      <Paragraph position="3"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Idioms and locutions
</SectionTitle>
      <Paragraph position="0"> Consider the five utterances of the main corpus containing the word 'rafle' (cast-diceINF):8 det gor den der er ikke noget at rafle om der der er ikke sa meget at rafle om der er ikke noget og rafle om saette sig ned og rafle lidt med fyrene der at rafle om der On most of its occurrences, 'rafle' takes part in the idiom &amp;quot;der er ikke noget/meget og/at rafle om&amp;quot;, often followed by a resumptive 'der' (literally: there is not anything/much and/to 7 Lexemes 'Sankt', 'Sct.', and 'Skt.' have in effect cocategorized, since it holds that (x/y)*y = x. This cocategorization is quite neat considering that GraSp is blind to the interior of lexemes. c9 and c22 are the categories of 'i' (in) and 'pa' (on).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 In writing, only two out of five would probably
</SectionTitle>
    <Paragraph position="0"> qualify as syntactically well-formed sentences.</Paragraph>
    <Paragraph position="1"> cast-diceINF about (there), meaning: this is not a subject of negotiations). Lexeme 'ikke' (category c8) occurs in the left context of 'rafle' more often than not, and this fact is reflected in the final category of 'rafle': rafle: ((c12\(c8\(c5\(c7\c5808))))/c7)/c42 Similarly for the lexemes 'der' (c7), 'er' (c5), 'at' (c12), and 'om' (c42) which are also present in the argument structure of the category, while the top functor is the initial 'rafle' category (c5808).</Paragraph>
    <Paragraph position="2"> The minimal context motivating the full rafle category is: ... der ... er ... ikke ... at ... rafle ... om ... der ... (&amp;quot;...&amp;quot; means that any amount and kind of material may intervene). This template is a quite accurate description of an acknowledged Danish idiom.</Paragraph>
    <Paragraph position="3"> Such idioms have a specific categorial signature in the GraSped lexicon: a rich, but flat argument structure (i.e. analyzed solely by sR) centered around a single low-frequency functor (analyzed by sL). Further examples with the  same signature: ... det ... kan ... man ... ikke ... fortaenke ... i ... ... det ... vil ... blaese ... pa ...</Paragraph>
    <Paragraph position="4"> ... ikke ... en ... kinamands ... chance ...</Paragraph>
    <Paragraph position="5"> - all well-known Danish locutions.9  There are of course plenty of simpler and faster algorithms available for extracting idioms. Most such algorithms however include specific knowledge about idioms (topological and morphological patterns, concepts of mutual information, heuristic and statistical rules, etc.). Our algorithm has no such inclination: it does not search for idioms, but merely finds them. Observe also that GraSp may induce idiom templates like the ones shown even from corpora without a single verbatim occurrence.</Paragraph>
    <Paragraph position="6"> 9 For entry rafle, Danish-Danish dictionary Politiken has this paradigmatic example: &amp;quot;Der er ikke noget at rafle om&amp;quot;. Also fortaenke, blaese, kinamands have examples near-identical with the learned templates.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Learning from exotic corpora
</SectionTitle>
    <Paragraph position="0"> In order to test GraSp as a general purpose learner we have used the algorithm on a range of non-verbal data. We have had GraSp study melodic patterns in musical scores and prosodic patterns in spontaneous speech (and even dnastructure of the banana fly). Results are not yet conclusive, but encouraging (Henrichsen 2002).</Paragraph>
    <Paragraph position="1"> When fed with HTML-formatted text, GraSp delivers a lexical patchwork of linguistic structure and HTML-structure. GraSp's uncritical appetite for context-free structure makes it a candidate for intelligent webcrawling. We are preparing an experiment with a large number of cloned learners to be let loose in the internet, reporting back on the structure of the documents they see. Since GraSp produces formatting definitions as output (rather than requiring it as input), the algorithm could save the www-programmer the troubles of preparing his web-crawler for this-and-that format.</Paragraph>
    <Paragraph position="2"> Of course such experiments are side-issues.</Paragraph>
    <Paragraph position="3"> However, as discussed in the next section, learning from non-verbal sources may serve as an inspiration in the L1 learning domain also.</Paragraph>
    <Paragraph position="4"> 4 Towards a model of L1 acquisition</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Artificial language learning
</SectionTitle>
      <Paragraph position="0"> Training infants in language tasks within artificial (i.e. semantically empty) languages is an established psycho-linguistic method. Infants have been shown able to extract structural information - e.g. rules of phonemic segmentation, prosodic contour, and even abstract grammar (Cutler 1994, Gomez 1999, Ellefson 2000) - from streams of carefully designed nonsense. Such results are an important source of inspiration for us, since the experimental conditions are relatively easy to simulate. We are conducting a series of 'retakes' with the GraSp learner in the subject's role.</Paragraph>
      <Paragraph position="1"> Below we present an example.</Paragraph>
      <Paragraph position="2"> In an often-quoted experiment, psychologist Jenny Saffran and her team had eight-monthsold infants listening to continuous streams of nonsense syllables: ti, do, pa, bu, la, go, etc. Some streams were organized in three-syllable 'words' like padoti and golabu (repeated in random order) while others consisted of the same syllables in random order. After just two minutes of listening, the subjects were able to distinguish the two kinds of streams.</Paragraph>
      <Paragraph position="3"> Conclusion: Infants can learn to identify compound words on the basis of structural clues alone, in a semantic vacuum.</Paragraph>
      <Paragraph position="4"> Presented with similar streams of syllables, the GraSp learner too discovers word-hood.</Paragraph>
      <Paragraph position="5">  ... ... ...</Paragraph>
      <Paragraph position="6"> It may be objected that such streams of presegmented syllables do not represent the experimental conditions faithfully, leaping over the difficult task of segmentation. While we do not yet have a definitive answer to this objection, we observe that replacing &amp;quot;pa do ti go la bu (..)&amp;quot; by &amp;quot;p a d o t i g o l a b u (..)&amp;quot; has the GraSp learner discover syllable-hood and word-hood on a par.11</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Naturalistic language learning
</SectionTitle>
      <Paragraph position="0"> Even if human learners can demonstrably learn structural rules without access to semantic and pragmatic cues, this is certainly not the typical L1 acquisition scenario. Our current learning model fails to reflect the natural conditions in a number of ways, being a purely syntactic calculus working on symbolic input organized in well-delimited strings. Natural learning, in contrast, draws on far richer input sources:  head, and golabu, bu. These choices are arbitrary. 11 The very influential Eimas (1971) showed onemonth-old infants to be able to distinguish /p/ and /b/. Many follow-ups have established that phonemic segmentation develops very early and may be innate. Any model of first language acquisition must be prepared to integrate such information sources.</Paragraph>
      <Paragraph position="1"> Among these, the extra-linguistic sources are perhaps the most challenging, since they introduce a syntactic-semantic interface in the model. As it seems, the formal simplicity of one-dimensional learning (cf. sect. 1.5) is at stake. If, however, semantic information (such as sensory data) could be 'syntactified' and included in the lexical structure in a principled way, single stratum learning could be regained. We are currently working on a formal upgrading of the calculus using a framework of constructive type theory (Coquant 1988, Ranta 1994). In CTT, the radical lexicalism of categorial grammar is taken even a step further, representing semantic information in the same data structure as grammatical and lexical information. This formal upgrading takes a substantial refinement of the Dis function (cf.</Paragraph>
      <Paragraph position="2"> sect. 1.3 E) as the determination of 'structural disorder' must now include contextual reasoning (cf. Henrichsen 1998). We are pursuing a design with s+ and s- as instructions to respectively insert and search for information in a CTT-style context.</Paragraph>
      <Paragraph position="3"> These formal considerations are reflections of our cognitive hypotheses. Our aim is to study learning as a radically data-driven process drawing on linguistic and extra-linguistic information sources on a par - and we should like our formal system to fit like a glove.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML