File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0104_intro.xml

Size: 4,024 bytes

Last Modified: 2025-10-06 14:02:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0104">
  <Title>Automatic Acquisition of Feature-Based Phonotactic Resources</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> This paper combines two hitherto distinct areas of research, namely automata induction and typed feature theory, for the purposes of acquiring phonotactic resources for use in speech technology. In order to illustrate the methodology a small annotated data set for Italian has been chosen1; however, given annotated data, the techniques can be applied to any language thus supporting language documentation at the phonotactic level and eventually building up a catalogue of reusable multilingual phonotactic resources. null There are numerous ways in which phonotactic information has been represented for use in speech technology applications ranging from phrase structure rules to n-grams. In this paper, the feature-based phonotactic automaton of the Time Map model (Carson-Berndsen, 1998) is used as the representational device. A phonotactic automaton describes all permissible sound combinations of a language within the domain of a syllable in terms of a finite state automaton, describing not only actual lexicalised syllables but also idiosyncratic gaps which would be considered well-formed by a native speaker of a language. The advantage of this representation of phonotactic constraints in the context of speech recognition is that it allows out-of-vocabulary items (new words) to be classified as well-formed if they adhere to the constraints.</Paragraph>
    <Paragraph position="1"> Furthermore, since the phonotactic automaton constrains with respect to the syllable domain, it provides a more flexible and linguistically motivated</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Multilingual European Speech Database.
</SectionTitle>
      <Paragraph position="0"> context than n-grams which restrict their context to a domain of fixed length (the n-1 preceding units).</Paragraph>
      <Paragraph position="1"> A phonotactic automaton describes language-specific constraints. Therefore, in order to develop multilingual phonotactic resources, phonotactic automata for different languages must be produced.</Paragraph>
      <Paragraph position="2"> Phonotactic automata for German and English have already been constructed for the Time Map model using manual techniques (Carson-Berndsen, 1998; Carson-Berndsen and Walsh, 2000). Since manual construction of phonotactic automata is time consuming and laborious, more recently focus has been placed on combining manual and automatic techniques in order to reduce the level of required human linguistic expertise. This will become more important when lesser-studied languages are addressed when an expert may not always be available. The techniques presented here are regarded as support tools for language documentation which allow inferences to be made based on generalisations found in an annotated data set. The linguist is free to accept or reject the suggestions made by the system.</Paragraph>
      <Paragraph position="3"> In what follows, a technique is described in which phonotactic automata are acquired automatically given annotated data for a language. While this technique describes all forms found in the data, acquired automata cannot be considered complete since the data is likely to be sparse (in this paper we illustrate this using a small data sample). However, by combining phonotactic automata with a typed feature classification of sounds encountered in the data, it is possible to highlight not only distributional similarities, but also phonetic similarities which can be used to predict gaps in the representation. These can be presented to a user (at least a native speaker of a language) who can accept or reject these. Accepted forms are then integrated into the phonotactic automaton.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML