File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1009_metho.xml
Size: 14,586 bytes
Last Modified: 2025-10-06 14:14:32
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1009"> <Title>Name pronunciation in German text-to-speech synthesis</Title> <Section position="3" start_page="50" end_page="51" type="metho"> <SectionTitle> 3 Productive name components </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 3.1 Database </SectionTitle> <Paragraph position="0"> Our training material is based on publically available data extracted from a phone and address directory of Germany. The database is provided on CD-ROM (D-Info, 1995). It lists all customers of Deutsche Telekom by name, street address, city, phone number, and postal code. The CD-l~OM contains data retrieval and export software. The database is somewhat inconsistent in that information for some fields is occasionally missing, more than one person is listed in the name field, business information is added to the name field, first names and street names are abbreviated. Yet, due to its listing of more than 30 million customer records it provides an exhaustive coverage of name-related phenomena in German.</Paragraph> </Section> <Section position="2" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 3.2 City names </SectionTitle> <Paragraph position="0"> The data retrieval software did not provide a way to export a complete list of cities, towns, and villages; thus we searched for all records listing city halls, township and municipality administrations and the like, and then exported the pertinent city names. This method yielded 3,837 city names, approximately 15% of all the cities (including urban districts) covered in the database. It is reasonable to assume, however, that this corpus provided sufficient coverage of lexical and morphological subcomponents of city names.</Paragraph> <Paragraph position="1"> We extracted graphemic substrings of different lengths from all city names. The length of the strings varied from 3 to 7 graphemes. Useful substrings were selected using frequency analysis (automatically) and native speaker intuition (manually). The final list of morphologically meaningful substrings consisted of 295 entries. In a recall test, these 295 strings accounted for 2,969 of the original list of city names, yielding a coverage of 2,969/3,837 = 77.4%.</Paragraph> </Section> <Section position="3" start_page="50" end_page="50" type="sub_section"> <SectionTitle> 3.3 First names </SectionTitle> <Paragraph position="0"> The training corpus for first names and street names was assembled based on data from the four largest cities in Germany: Berlin, Hamburg, KJln (Cologne) and Miinchen (Munich). These four cities also provide an approximately representative geographical and regional/dialectal coverage. The size and geography criteria were also applied to the selection of the test material which was extracted from the cities of Frankfurt am Main and Dresden (see Evaluation).</Paragraph> <Paragraph position="1"> We retrieved all available first names from the records of the four cities and collected those whose frequency exceeded 100. To this corpus we added the most popular male and female (10 each) names given to newborn children in the years 1995/96, in both the former East and West Germany, according to an official statistical source on the internet. The corpus also contains interesting spelling variants (Helmut/Hellmuth) as well as peculiarities attributable to regional tastes and fashions (Maik, Maia). The total number of first names in our list is 754.</Paragraph> <Paragraph position="2"> No attempt was made to arrive at some form of morphological decomposition despite several obvious recurring components, such as <-hild>, <-bert>, <-fried>; the number of these components is very small, and they are not productive in name-forming processes anymore.</Paragraph> </Section> <Section position="4" start_page="50" end_page="51" type="sub_section"> <SectionTitle> 3.4 Streets </SectionTitle> <Paragraph position="0"> We retrieved all available street names from the records of the four cities. The street names were split up into their individual word-like components, i.e., a street name like Konrad-Adenauer-Platz created three separate entries: Konrad, Adenauer, and Platz. This list was then sorted and made unique.</Paragraph> <Paragraph position="1"> The type inventory of street name components was then used to collect lexically and semantically meaningful components, which we will henceforth conveniently call 'morphemes'. In analogy to the procedure for city names, these morphemes were used in a recall test on the original street name component type list. This approach was successively applied to the street name inventory of the four cities, starting with Mfinchen, exploiting the result of this first round in the second city, Berlin, applying the combined result of this second round on the third city, and so on.</Paragraph> <Paragraph position="2"> Table 1 gives the numbers corresponding to the steps of the procedure just described. The number of morphemes collected from the four cities is 1,940. The selection criterion was frequency: Component types occurring repeatedly within a city database were considered as productive or marginally productive. The 1,940 morphemes recall 11,241 component types out of the total of 26,841 (or 41.9%), leaving 15,600 types (or 58.1%) that are unaccounted for ('residuals') by the morphemes.</Paragraph> <Paragraph position="3"> Residuals that occur in at least two out of four cities (2,008) were then added to the list of 1,940 morphemes. The reasoning behind this is that there are component types that occur exactly once in a given city but do occur in virtually every city. To give a concrete example: There is usually only one Hauptstrafle ('main street') in any given city but you almost certainly do find a Hauptstrafle in every city. After some editing and data clean-up, the final list of linguistically motivated street name morphemes contained 3,124 entries.</Paragraph> </Section> </Section> <Section position="4" start_page="51" end_page="53" type="metho"> <SectionTitle> 4 Compositional model of street </SectionTitle> <Paragraph position="0"> names In this section we will present a compositional model of street names that is based on a morphological word model and also includes a phonetic syllable model. We will also describe the implementation of these models in the form of a finite-state transducer.</Paragraph> <Section position="1" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 4.1 Naming schemes for streets in German </SectionTitle> <Paragraph position="0"> Evidently, there is a finite list of lexical items that almost unambiguously mark a name as a street name; among these items are Strafle, Weg, Platz, Gasse, Allee, Markt and probably a dozen more. These street name markers are used to construct street names involving persons (Stephan-Lochner-Strafle, Kennedyallee), geographical places (Tiibinger Allee), or objects (Chrysanthemenweg, Containerbahnho\]); street names with local, regional or dialectal peculiarities (Sb'bendieken, HJglstieg); and finally intransparent street names (Kriisistrafle, Damaschkestrafle). Some names of the latter type may actually refer to persons' names but the origin is not transparent to the native speaker.</Paragraph> <Paragraph position="1"> street name decomposition in German.</Paragraph> </Section> <Section position="2" start_page="51" end_page="53" type="sub_section"> <SectionTitle> 4.2 Building a generative transducer for </SectionTitle> <Paragraph position="0"> street names The component types collected from the city, first name and street databases were integrated into a combined list of 4,173 productive name components: 295 from city names, 754 from first names, 3,124 from street names. Together with the basic street name markers, these components were used to construct a name analysis module. The module was implemented as a finite-state transducer using Richard Sproat's lexiools (Sproat, 1995), a toolkit for creating finite-state machines from linguistic descriptions. The module is therefore compatible with the other text analysis components in the German TTS system (MSbius, 1997) that were all developed in the same FSM technology framework.</Paragraph> <Paragraph position="1"> One of the lextools, the program arclist, is particularly well suited for name analysis. The tool facilitates writing a finite-state grammar that describes words of arbitrary morphological complexity and length (Sproat, 1995). In the TTS system it is also applied to the morphological analysis of compounds and unknown words.</Paragraph> <Paragraph position="2"> Figure 1 shows parts of the arclist source file for street name decomposition. The arc which describes the transition from the initial state &quot;START&quot; to the state &quot;ROOT&quot; is labeled with C/ (Epsilon, the empty string). The transition from &quot;ROOT&quot; to the state &quot;FIRST&quot; is defined by three large families of arcs which represent the lists of first names, productive city name components, and productive street name components, respectively, as described in the previous section.</Paragraph> <Paragraph position="3"> The transition from &quot;ROOT&quot; to &quot;FIRST&quot; which is labeled SyllModel is a place holder for a phonetic syllable model. This syllable model reflects the phonotactics and the segmental structure of syllables in German, or rather their correlates on the orthographic surface. This allows the module to analyze substrings of names that are unaccounted for by the explicitly listed name components (see 'residuals' in the previous section) in arbitrary locations in a complex name. A detailed discussion of the syllable model is presented elsewhere (MSbius, 1997).</Paragraph> <Paragraph position="4"> From the state &quot;FIRST&quot; there is a transition back to &quot;ROOT&quot;, either directly or via the state &quot;FUGE', thereby allowing arbitrarily long concatenations of name components. Labels on the arcs to &quot;FUGE&quot; represent infixes ('Fugen') that German word forming grammar requires as insertions between components within a compounded word in certain cases, such as Wilhelm+s+platz or Linde+n+hof. The final state &quot;END&quot; can only be reached from &quot;FIRST&quot; by way of &quot;SUFFIX&quot;. This transition is defined by a family of arcs which represents common inflectional and derivational suffixes. On termination the word is tagged with the label 'name' which can be used as part-of-speech information by other components of the TTS system.</Paragraph> <Paragraph position="5"> Most arc labels are weighted by being assigned a cost. Weights are a convenient way to describe and predict linguistic alternations. In general, such a description can be based on an expert's analysis of linguistic data and his or her intuition, or on statistical probabilities derived from annotated corpora. Works by Riley (Riley, 1994) and Yarowsky (Yarowsky, 1994) are examples of inferring models of linguistic alternation from large corpora. However, these methods require a database that is annotated for all relevant factors, and levels on these factors. Despite our large raw corpus, we lack the type of database resources required by these methods.</Paragraph> <Paragraph position="6"> Thus, all weights in the text analysis components of GerTTS are currently based on linguistic intuition; they are assigned such that after integration of the name component in the general text analysis system, direct hits in the general-purpose lexicon will be less expensive than name analyses (see Discussion). No weights or costs are assigned to the most frequently occurring street name components, previously intro- null street name Dachsteinhohenheckenalleenplatz.</Paragraph> <Paragraph position="7"> duced as street name markers, making them more likely to be used during name decomposition. The orthographic strings are annotated with symbols for primary (') and secondary (&quot;) lexical stress. The symbol {++} indicates a morpheme boundary.</Paragraph> <Paragraph position="8"> The finite-state transducer that this grammar is compiled into is far too complex to be usefully diagrammed here. For the sake of exemplification, let us instead consider the complex fictitious street name Dachsteinhohenheckenalleenplatz. Figure 2 shows the transducer corresponding to the sub-grammar that performs the decomposition of this name. The path through the graph is as follows: The arc between the initial state &quot;START&quot; and &quot;ROOT&quot; is labeled with a word boundary {##} and zero cost (0). From here we take the arc with the label d'ach and a cost of 0.2 to state &quot;FIRST&quot;. The next name component that can be found in the grammar is stein; we have to return to &quot;ROOT&quot; by way of an arc that is labeled with a morph boundary and a cost of 0.1. The next known component is hecke, leaving a residual string hohen which has to be analyzed by means of the syllable model. Applying the syllable model is expensive because we want to cover the name string with as many known components as possible. The costs actually vary depending upon the number of syllables in the residual string and the number of graphemes in each syllable; the string hohen would thus have be decomposed into a root hohe and the 'Fuge' n. For the sake of simplicity we assign a flat cost of 10.0 in our toy example. In the transition between hecke and allee a 'Fuge' (n) has to be inserted. The cost of the following morph boundary is higher (0.5) than usual in order to favor components that do not require infixation.</Paragraph> <Paragraph position="9"> Another Fuge has to be inserted after allee. The cost of the last component, platz, is zero because this is one of the customary street name markers. Finally, the completely analyzed word is tagged as a name, and a word boundary is appended on the way to the final state &quot;END&quot;.</Paragraph> <Paragraph position="10"> The morphological information provided by the name analysis component is exploited by the phonological or pronunciation rules. This component of the linguistic analysis is implemented using a modified version of the Kaplan and Kay rewrite rule algorithm (Kaplan and Kay, 1994).</Paragraph> </Section> </Section> class="xml-element"></Paper>