File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0613_intro.xml

Size: 4,679 bytes

Last Modified: 2025-10-06 14:01:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0613">
  <Title>Learning Word Meanings and Descriptive Parameter Spaces from Music</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
'sexy.'
2 Background
</SectionTitle>
    <Paragraph position="0"> In the general audio domain, work has recently been done (Slaney, 2002) that links sound samples to description using the labeled descriptions on the sample sets. In the visual domain, some work has been undertaken attempting to learn a link between language and multimedia. The lexicon-learning aspects in (Duygulu et al., 2002) study a set of fixed words applied to an image database and use a method similar to EM (expectation-maximization) to discover where in the image the terms (nouns) appear. (Barnard and Forsyth, 2000) outlines similar work.</Paragraph>
    <Paragraph position="1"> Regier has studied the visual grounding of spatial terms across languages, finding subtle effects that depend on the relative shape, size, and orientation of objects (Regier, 1996). Work on motion verb semantics include both procedural (action) based representations building on Petri Net formalisms (Bailey, 1997; Narayanan, 1997) and encodings of salient perceptual features (Siskind, 2001). In (Roy, 1999), we explored aspects of learning shape and color terms, and took first steps in perceptually-grounded grammar acquisition.</Paragraph>
    <Paragraph position="2"> We refer to a word as &amp;quot;grounded&amp;quot; if we are able to determine reliable perceptual or procedural associations of the word that agree with normal usage. However, encoding single terms in isolation is only a first step in sensorymotor grounding. Lexicographers have traditionally studied lexical semantics in terms of lexical relations such as opposition, hyponymy, and meronymy (Cruse, 1986).</Paragraph>
    <Paragraph position="3"> We have made initial investigations into the perceptual grounding of lexical relations. We argue that gradations or linguistic parameter spaces (such as fast ... slow or big ... small) are necessary to describe high-dimensional perceptual input.</Paragraph>
    <Paragraph position="4">  learning language and music models. A bank of systems (with a distributed computing back-end connecting them) listens to multiple genres of radio streams and hones an acoustic model. When a new artist is detected from the metadata, our cultural representation crawler extracts language used to describe the artist and adds to our language model. Concurrently, we learn relations between the music and language models to ground language terms in perception. null Our first approach to this problem was in (Whitman and Rifkin, 2002), in which we learned the descriptions of music by a combination of automated web crawls for artist description and analysis of the spectral content of their music. The results for that work, which appear in Figure 1 and Table 1, show that we can accurately predict (well above an impossibly low baseline) a label on a held-out test set of music. We also see encouraging results in the set of terms that were accurately predicted. In effect we can draw an imaginary line in the form of a confidence threshold around our results and assign certain types of terms 'grounded' while others are 'ungroundable.' In Table 1 above, we note that terms like 'electronic' and 'vocal' that would appear in the underlying perceptual feature space get high scores while more culturally-influenced terms like 'gorgeous' and 'sexy' do not do as well. We have recently extended this work (Whitman et al., 2003) by learning parameters in the same manner. Just because we know the spectral shape of 'quiet' and 'loud' (as in Figure 1) we cannot infer any sort of connecting space between them unless we know that they are antonyms. In this work, we infer such gradation spaces through the use of a lexical knowledge base, 'grounding' such parameters through perception. As well, to capture important time-aware gradations such as 'fast...slow' we introduce a new machine listening  the musical group 'Portishead' from community metadata. null representation that allows for far more perceptual generality in the time domain than our previous work's single frame-based power spectral density. Our current platform for retrieving audio and description is shown in Figure 2. We acknowledge previous work on the computational study of adjectival scales as in (Hatzivassiloglou and McKeown, 1993), where a system could group gradation scales using a clustering algorithm. The polar representation of adjectives discussed in (Miller, 1990) also influenced our system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML