File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2156_metho.xml
Size: 9,022 bytes
Last Modified: 2025-10-06 14:13:44
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2156"> <Title>Machine-Readable Dictionaries in Text-to-Speech Systems</Title> <Section position="3" start_page="972" end_page="974" type="metho"> <SectionTitle> 3 Methodology and Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="972" end_page="972" type="sub_section"> <SectionTitle> 3.1 Collecting Data </SectionTitle> <Paragraph position="0"> As stated al)ove, out of ahnost 89,000 headwords in the dictionary, 874 phonemic pairs,which represents 71% of the total, were found. This is due to the fact that (a) the lookup occurs only on non: inflected words, thus a limited sample of the language, (b) because the dictionary consists of a list of isolated words, it does not aceonnt for inter: word boundary phenomena. Sitme French liaison plays such all important role in tile phonology of French, a look at phonetic data from a corpus mnst be giw'n in order to achieve fnll coverage. A per tion of the llansard French corpus (over 2.3 million words) wa.s used h)r this purpose. Graplmme-tophoneme software \[14\] was utilized in order to convert l&quot;rench orthography into phonemes. For the sake of comparison, both the phonetic transcrip: tion from the corpus and the one from the MRI) were converted into a unique set of i)honemes.</Paragraph> <Paragraph position="1"> Typical outt>ut front the dictionary looks like: ABACA \[abaka\] n. m.</Paragraph> <Paragraph position="2"> ABASOURI)IR \[abazuRdiR,\]; \[abasuRdiR\] AI~ASOUR1)ISSANT, ANTI&quot; \[abazu RdisA, At AI~NI&quot;I'EUt{, EUSE \[abat8R, 7z\] n.</Paragraph> <Paragraph position="3"> ABCE'S \[absE; apsE\] n. m.</Paragraph> <Paragraph position="4"> ABDOMINAL, ALE, AUX \[abd>minal, el adj.</Paragraph> <Paragraph position="5"> ABI)OMINO-ABDUCTION \[abdyksjO\] n. f.</Paragraph> <Paragraph position="6"> A small sample of the tlansard followed by tile ascii transcription is shown below: l)re'sident de la Compagnie d'Ame',mgement du barreau de.</Paragraph> <Paragraph position="7"> X Monsieur X I)e'pute' anrien Ministre Pre'sident du Conseil.</Paragraph> <Paragraph position="8"> prezidA d& la kOpaNi d amenaZmA dy bare d& iks m&sju dis depyte Asjl ministr prezid dy kOsEj As all experhnent, we compared triphones extracted fronl dictionary data and corpora. A greedy algorithm 3 to locate the most common coocurrences between ortlmgraphy and transcription was run on the data sets. A sample of the. corpus and dictionary results are given in the Table below. The table shows in the leftmost two coh|mns the top twenty triphones and occurring frequencies extracted from the Hansard corpus, whereas the righthand columns show dictionary results. Notice the discrepancy between tlmse lists; for the top twenty triphones, there are only two overlaps, sjO and jO*. The levels of commonality between the triphones of the tIansard and the dictionary (5% of commonality for the top 100 triphones and 15% of commonality for the top 1000 triphones) is interesting to observe.</Paragraph> <Paragraph position="9"> The preliminary results indicate that the coarticulatory effects derived from the corpus data will be usefnl, in particular for languages like French where liaison plays a major role. This remains to be tested in the TTS system.</Paragraph> </Section> <Section position="2" start_page="972" end_page="974" type="sub_section"> <SectionTitle> 3.2 Related Work </SectionTitle> <Paragraph position="0"> Although the statistical analysis of MRDs has focussed primarily on definitions and translations, \[5\] used the prommciation field as data. A dictionary of over 110,000 entries containing 51,219 common words and 59,625 proper nouns, \[17\] was used for selecting candidate units that were further utilized in the set of concatenative units (diphones, triphones, and longer milts) for synthesis. The phonemic string was split according to ten language-dependent segmentation principles. For example, the word &quot;abacus&quot; \['ab-o-kos\] was first transformed into cuttable units as follows: \[#'a,'~b,bo,ok,ko,os,s#\]. Once each dictionary word was split, the duplicates were removed and the remaining units formed the set of concatenative units. At the end of this operation, a rather long list was obtained that was pruned by methods such as reduction of secondary and primary stress into one stress in order to keep only one +stress/-stress distinction. Techniques were shown that allow the selection of a minimal set of word pairs for inter-word junctures; every candidate unit inside and across word sequence was included. The same strategy was replicated on the Collins Spanish-English dictionary by \[6\]. In this fashion, the dictionary was used as a sample of the language in the sense that it assnmes that most of the phonemic combinations of the language were present.</Paragraph> <Paragraph position="1"> The most straightforward way, but in the long run not the nlost flexible, is to parse the phonetic information out of the prommciation field. The )ronunciation field information can generally be ~onsulted by a TTS system within the grapheme;o-phoneme module. Additional rules for pro:esses such as inter-word assimilation, juncture, md prosodic contouring need to be added, since solated word pronunciation couhl already be bantied by look-up table. Although appealing, there ~re two major drawbacks to this approach: (a) dictionary pronunciation fields are often not )honetically fine-grained enough for acceptable speech output. For example, the pronunciation for &quot;inquest&quot; is given ill W7 as /'in-,kwest/, but of course the nasal will assimilate in place to the velar, giving /i0-kwest/. Without assimilation, the perceptual cffect is of two words: &quot;in quest&quot; and would be misleading. Again, the human user will a.ssimilatc naturally, but a text to speech system must figure out such details, since artieulatory ease is not a factor in most synthesis systems. One way to solve this problem is to impose such assimilation on input from the pronunciation field by a set of post-processing rules. Although this solution wouht be correct in the majority of cases, blanket application of such rules is not always appropriate for lexical exceptions. For example, assimilation is optional for words like &quot;uncaring&quot;, in this case related to the morphological structure of the lexical item. A TTS system will probably already have snch rules since they are inherent in the graphemc-to-phoneme approach. Thus, it could be argued that there is no need for the dictionary prommciation, since with a complete and comprehensive grapheme-to-phoneme conversion system, a list which requires post-processing is simply inadequate and unnecessary. Tiffs is the approach taken, for example, by \[14\], who makes use of small word lists (the main dictionary being 25K stored forms) and several affix tables to recognize graphemic forms, which arc then transformed into phonemic reprcsentations; (b) only a small percentage of possible words are listed with prommciations in a dictionary. For example, Wcbster's Sevcnth contains about 70,000 headwords, but is missing words like &quot;computerize&quot; and &quot;computerization&quot; since they came into frequent use in the language after the 1963 publication date. Two solutions to this problem present themselves. One is to expand the word list from tile dictionary to include run-on's, as illustrated in examples (3) and (4), and discussed in Section 2.3.</Paragraph> <Paragraph position="2"> The other is to build a morphological generator, using headwords, part of sl)eech , and other information as input, discussed in Section 2.2 that would be invoked when the word does not tigure in tl,e headword list.</Paragraph> </Section> </Section> <Section position="4" start_page="974" end_page="974" type="metho"> <SectionTitle> 5 Final Remarks </SectionTitle> <Paragraph position="0"> Although limitations ('lcarly constrain the use of MRI)s in TTS, we have demonstrated in this paper that it is more cost eflqcient to post process underspecilie(l dictionary information such as inflection, pronunciation, and part-of-speech, rather than generate rules from scratch to arrive at the same end point. For speech synthesis, thc data is not always perfect, and often must be postprocessed. This paper h~us demonstrated ways we have successfully used dictionary data in 'FTS systems, ways wc have post-processed data to make it morc useful, and ways data Camlot bc easily post-processed or used.</Paragraph> <Paragraph position="1"> Of course, for any TTS system, the power of the dictionary data can be found at the lexical, t)hrmqal, and idiom level. Although any word list such ,-Ls a dictionary is by definition closed, whereas language is open-ended, dictionary data has proven to be usefid from both a theoretical and practical point of view.</Paragraph> </Section> class="xml-element"></Paper>