Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 729–736,
Sydney, July 2006. c©2006 Association for Computational Linguistics
Phoneme-to-Text Transcription System with an Infinite Vocabulary
Shinsuke Mori Daisuke Takuma Gakuto Kurata
IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd.
1623-14 Shimotsuruma Yamato-shi, 242-8502, Japan
mori@fw.ipsj.or.jp
Abstract
The noisy channel model approach is suc-
cessfully applied to various natural lan-
guage processing tasks. Currently the
main research focus of this approach is
adaptation methods, how to capture char-
acteristics of words and expressions in a
target domain given example sentences in
that domain. As a solution we describe a
method enlarging the vocabulary of a lan-
guage model to an almost infinite size and
capturing their context information. Espe-
cially the new method is suitable for lan-
guages in which words are not delimited
by whitespace. We applied our method
to a phoneme-to-text transcription task in
Japanese and reduced about 10% of the er-
rors in the results of an existing method.
1 Introduction
The noisy channel model approach is being suc-
cessfully applied to various natural language pro-
cessing (NLP) tasks, such as speech recognition
(Jelinek, 1985), spelling correction (Kernighan
et al., 1990), machine translation (Brown et al.,
1990), etc. In this approach an NLP system
is composed of two modules: one is a task-
dependent part (an acoustic model for speech
recognition) which describes a relationship be-
tween an input signal sequence and a word, the
other is a language model (LM) which measures
the likelihood of a sequence of words as a sen-
tence in the language. Since the LM is a common
part, its improvement augments the accuracies of
all NLP systems based on a noisy channel model.
Recently the main research focus of LM is shift-
ing to the adaptation method, how to capture the
characteristics of words and expressions in a tar-
get domain. The standard adaptation method is to
prepare a corpus in the application domain, count
the frequencies of words and word sequences, and
manually annotate new words with their input sig-
nal sequences to be added to the vocabulary. It is
now easy to gather machine-readable sentences in
various domains because of the ease of publication
and access via the Web (Kilgarriff and Grefen-
stette, 2003). In addition, traditional machine-
readable forms of medical reports or business re-
ports are also available. When we need to develop
an NLP system in various domains, there is a huge
but unannotated corpus.
For languages, such as Japanese and Chinese, in
which the words are not delimited by whitespace,
one encounters a word identification problem be-
fore counting the frequencies of words and word
sequences. To solve this problem one must have a
good word segmenter in the domain of the corpus.
The only robust and reliable word segmenter in the
domain is, however, a word segmenter based on
the statistics of the lexicons in the domain! Thus
we are obliged to pay a high cost for the manual
annotation of a corpus for each new subject do-
main.
In this paper, we propose a novel framework for
building an NLP system based on a noisy chan-
nel model with an almost infinite vocabulary. In
our method, first we estimate the probability of a
word boundary existing between two characters at
each point of a raw corpus in the target domain.
Using these probabilities we regard the corpus as
a stochastically segmented corpus (SSC). We then
estimate word D2-gram probabilities from the SSC.
Then we build an NLP system, the phoneme-to-
text transcription system in this paper. To de-
scribe the stochastic relationship between a char-
acter sequence and its phoneme sequence, we also
propose a character-based unknown word model.
With this unknown word model and a word D2-
gram model estimated from the SSC, the vocab-
ulary of our LM, a set of known words with their
context information, is expanded from words in a
729
small annotated corpus to an almost infinite size,
including all substrings appearing in the large cor-
pus in the target domain. In experiments, we esti-
mated LMs from a relatively small annotated cor-
pus in the general domain and a large raw corpus
in the target domain. A phoneme-to-text transcrip-
tion system based on our LM and unknown word
model eliminated about 10% of the errors in the
results of an existing method.
2 Task Complexity
In this section we explain the phoneme-to-text
transcription task which our new framework is ap-
plied to.
2.1 Phoneme-to-text Transcription
To input a sentence in a language using a device
with fewer keys than the alphabet we need some
kind of transcription system. In French stenotypy,
for example, a special keyboard with 21 keys is
used to input French letters with accents (Der-
ouault and Merialdo, 1986). A similar problem
arises when we write an e-mail in any language
with a mobile phone or a PDA. For languages
with a much larger character set, such as Chi-
nese, Japanese, and Korean, a transcription system
called an input method is indispensable for writing
on a computer (Lunde, 1998).
The task we chose for the evaluation of
our method is phoneme-to-text transcription in
Japanese, which can also be regarded as a pseudo-
speech recognition in which the acoustic model
is perfect. In order to input Japanese to a com-
puter, the user types phoneme sequences and the
computer offers possible transcription candidates
in the descending order of their estimated simi-
larities to the characters the user wants to input.
BD
Then the user chooses the proper one.
2.2 Ambiguities
A phoneme sequence in Japanese (written in sans-
serif font in this paper) is highly ambiguous for
a computer. There are many possible word se-
quences with similar pronunciations. These am-
biguities are mainly due to three factors:
AF Homonyms: There are many words sharing the
same phoneme sequences. In the spoken lan-
guage, they are less ambiguous since they are
BD
Generally one of Japanese phonogram sets is used as
phoneme. A phonogram is input by a combination of un-
ambiguous ASCII characters.
pronounced with different intonations. Intona-
tional signals are, however, omitted in the input
of phoneme-to-text transcription.
AF Lack of word boundaries: A word of a long
sequence of phonemes can be split into sev-
eral shorter words, such as frequent content
words, particles, etc. (ex. CP-D6CX-CVCP-D8D3-D9/thanks
vs. CP-D6CX/ant CVCP/is D8D3-D9/ten).
AF Variations in writing: Some words have more
than one acceptable spellings. For example,	�
��/CUD9-D6CX-CZD3-D1CX/bank-transfer is often writ-
ten as	�/CUD9-D6CX-CZD3-D1CX omitting two verbal end-
ings, especially in business writing.
Most of these ambiguities are not difficult to re-
solve for a native speaker who is familiar with the
domain. So the transcription system should offer
the candidate word sequences for each context and
domain.
2.3 Available Resources
Generally speaking, three resources are available
for a phoneme-to-text transcription based on the
noisy channel model:
AF annotated corpus:
a small corpus in the general domain annotated
with word boundary information and phoneme
sequences for each word
AF single character dictionary:
a dictionary containing all possible phoneme se-
quences for each single character
AF raw corpus in the target domain:
a collection of text samples in the target do-
main extracted from the Web or documents in
machine-readable form
3 Language Model and its Application
A stochastic LM C5 is a function from a sequence
of characters DC BECG
A3
to the probability. The sum-
mation over all possible sequences of characters
must be equal to or less than 1. This probability is
used as the likelihood in the NLP system.
3.1 Word C6-gram Model
The most famous LM is an D2-gram model based
on words. In this model, a sentence is regarded as
a word sequence DB
CW
BD
(BP DB
BD
DB
BE
A1A1A1DB
CW
) and words
are predicted from beginning to end:
C5
DBBND2
B4DBB5BP
CWB7BD
CH
CXBPBD
C8B4DB
CX
CYDB
CXA0BD
CXA0D2B7BD
B5BN
730
where DB
CX
B4CX AK BCB5 and DB
CWB7BD
is a special symbol
called a BUCC (boundary token). Since it is impossi-
ble to define the complete vocabulary, we prepare
a special token CDCF for unknown words and an un-
known word spelling DC
CW
BC
BD
is predicted by the fol-
lowing character-based D2-gram model after CDCF is
predicted by C5
DBBND2
:
C5
DCBND2
B4DC
CW
BC
BD
B5BP
CW
BC
B7BD
CH
CXBPBD
C8B4DC
CX
CYDC
CXA0BD
CXA0D2B7BD
B5BN (1)
where DC
CX
B4CX AK BCB5 and DC
CW
BC
B7BD
is a special symbol BUCC.
Thus, when DB
CX
is outside of the vocabulary CF,
C8B4DB
CX
CYDB
CXA0BD
CXA0D2B7BD
B5BPC5
DCBND2
B4DB
CX
B5C8B4CDCFCYDB
CXA0BD
CXA0D2B7BD
B5BM
3.2 Automatic Word Segmentation
Nagata (1994) proposed a stochastic word seg-
menter based on a word D2-gram model to solve
the word segmentation problem. According to this
method, the word segmenter divides a sentence DC
into a word sequence with the highest probability
CMDB BP argmax
DBBPDC
C5
DBBND2
B4DBB5BM
Nagata (1994) reported an accuracy of about 97%
on a test corpus in the same domain using a learn-
ing corpus of 10,945 sentences in Japanese.
3.3 Phoneme-to-text Transcription
A phoneme-to-text transcription system based on
an LM CC (Mori et al., 1999) receives a phoneme
sequence DD and returns a list of candidate sen-
tences B4DC
BD
BN DC
BE
BN A1A1A1B5 in descending order of the
probability C8B4DCCYDDB5:
CCB4DDB5BPB4DC
BD
BN DC
BE
BN A1A1A1B5BN
where CX AK CY B8 C8B4DC
CX
CYDDB5 AL C8B4DC
CY
CYDDB5BM
Similar to speech recognition, the probability is
decomposed into two independent parts: a pronun-
ciation model (PM) and an LM.
C8B4DC
CX
CYDDB5 AL C8B4DC
CY
CYDDB5
B8
C8B4DDCYDC
CX
B5C8B4DC
CX
B5
C8B4DDB5
AL
C8B4DDCYDC
CY
B5C8B4DC
CY
B5
C8B4DDB5
B8 C8B4DDCYDC
CX
B5C8B4DC
CX
B5 AL C8B4DDCYDC
CY
B5C8B4DC
CY
B5 (2)
B4
A1
A1
A1
C8B4DDB5 is independent of DC
CX
and DC
CY
BMB5
In this formula C8B4DCB5 is an LM representing the
likelihood of a sentence DC. For the LM, we can
use a word D2-gram model we explained above.
The other part in the above formula C8B4DDCYDCB5 is a
PM representing the probability that a given sen-
tence DC is pronounced as DD. Since it is impossible
to collect the phoneme sequences DD for all pos-
sible sentences DC, the model is decomposed into
a word-based model C5
DD
in which the words are
pronounced independently
C5
DDBNDB
B4DDCYDBB5BP
CW
CH
CXBPBD
C8B4DD
CX
CYDB
CX
B5BN (3)
where DD
CX
is a phoneme sequence corresponding to
the word DB
CX
and the condition DD BP DD
CW
BD
is met.
The probabilities C8B4DD
CX
CYDB
CX
B5 are estimated from
a corpus in which each word is annotated with a
phoneme sequence as follows:
C8B4DD
CX
CYDB
CX
B5BP
CUB4DD
CX
BN DB
CX
B5
CUB4DB
CX
B5
BN (4)
where CUB4CTB5 stands for the frequency of an event CT
in the corpus. For unknown words no transcription
model has been proposed and the phoneme-to-text
transcription system (Mori et al., 1999) simply re-
turns the phoneme sequence itself.
BE
This is done
by replacing the unknown word model based on
the Japanese character set C5
DCBND2
B4DCB5 by a model
based on the phonemic alphabet C5
DDBND2
B4DDB5.
Thus the candidate evaluation metric of a
phoneme-to-text transcription (Mori et al., 1999)
composed of the word D2-gram model and the
word-based pronunciation model is as follows:
C8B4DDCYDCB5C8B4DCB5BP
CW
CH
CXBPBD
C8B4DD
CX
CYDB
CX
B5C8B4DB
CX
B5
C8B4DD
CX
CYDB
CX
B5C8B4DB
CX
B5 (5)
BP
B4
C8B4DB
CX
CYDB
CXA0BD
CXA0D2B7BD
B5C8B4DD
CX
CYDB
CX
B5 if DB
CX
BECFBN
C8B4CDCFCYDB
CXA0BD
CXA0D2B7BD
B5C5
DDBND2
B4DD
CX
B5 if DB
CX
BIBECFBM
4 LM Estimation from a Stochastically
Segmented Corpus (SSC)
To cope with segmentation errors, the concept
of stochastic segmentation is proposed (Mori and
Takuma, 2004). In this section, we briefly explain
a method of calculating word D2-gram probabilities
on a stochastically segmented corpus in the target
domain. For a detailed explanation and proofs of
the mathematical soundness, please refer to the pa-
per (Mori and Takuma, 2004).
BE
One of the Japanese syllabaries Katakana is used to spell
out imported words by imitating their Japanese-constrained
pronunciation and the phoneme sequence itself is the correct
transcription result for them. Mori et. al. (1999) reported that
approximately 33.0% of the unknown words in a test corpus
were imported words.
731
xk+1xbnnex
bn+1
x
wn
x i xb
1
xe
1
xb
2
e
2
x
1ww2
1-P
bn
()1-P
bn+1
()P
nePP
ie
1
P
e
2
b
2
1-P()1-P
b
1
()r
1
n
f(w ) =
Figure 1: Word D2-gram frequency in a stochastically segmented corpus (SSC).
4.1 Stochastically Segmented Corpus (SSC)
A stochastically segmented corpus (SSC) is de-
fined as a combination of a raw corpus BV
D6
(here-
after referred to as the character sequence DC
D2
D6
BD
)
and word boundary probabilities C8
CX
that a word
boundary exists between two characters DC
CX
and
DC
CXB7BD
. Since there are word boundaries before the
first character and after the last character of the
corpus, C8
BC
BP C8
D2
D6
BPBD.
In (Mori and Takuma, 2004), the word bound-
ary probabilities are defined as follows. First the
word boundary estimation accuracy AB of an auto-
matic word segmenter is calculated on a test cor-
pus with word boundary information. Then the
raw corpus is segmented by the word segmenter.
Finally C8
CX
is set to be AB for each CX where the word
segmenter put a word boundary and C8
CX
is set to
be BD A0 AB for each CX where it did not put a word
boundary. We adopted the same method in the ex-
periments.
4.2 Word D2-gram Frequency
Word D2-gram frequencies on an SSC is calculated
as follows:
Word 0-gram frequency: This is defined as an
expected number of words in the SSC:
CUB4A1B5BPBDB7
D2
D6
A0BD
CG
CXBPBD
C8
CX
BM
Word D2-gram frequency (D2 AL BD): Let us think
of a situation (see Figure 1) in which a word se-
quence DB
D2
BD
occurs in the SSC as a subsequence
beginning at the B4CX B7BDB5-th character and end-
ing at the CZ-th character and each word DB
D1
in the word sequence is equal to the character
sequence beginning at the CQ
D1
-th character and
ending at the CT
D1
-th character (DC
CT
D1
CQ
D1
BP DB
D1
BN BD AK
BKD1 AK D2; CT
D1
B7BDBP CQ
D1B7BD
BN BD AKBKD1 AK D2A0 BD;
CQ
BD
BP CX B7BD; CT
D2
BP CZ). The word D2-gram fre-
quency of a word sequence CU
D6
B4DB
D2
BD
B5 in the SSC is
defined by the summation of the stochastic fre-
quency at each occurrence of the character se-
quence of the word sequence DB
D2
BD
over all of the
occurrences in the SSC:
CU
D6
B4DB
D2
BD
B5BP
CG
B4CXBNCT
D2
BD
B5BEC7
D2
C8
CX
BE
BG
D2
CH
D1BPBD
BK
BO
BM
CT
D1
A0BD
CH
CYBPCQ
D1
B4BDA0C8
CY
B5
BL
BP
BN
C8
CT
D1
BF
BH
BN
where CT
D2
BD
BP B4CT
BD
BNCT
BE
BNA1A1A1BNCT
D2
B5 and C7
D2
BP
CUB4CXBNCT
D2
BD
B5CYDC
CT
D1
CQ
D1
BP DB
D1
BNBD AK D1 AK D2CV.
4.3 Word D2-gram probability
Similar to the word D2-gram probability estimation
from a decisively segmented corpus, word D2-gram
probabilities in an SSC are estimated by the maxi-
mum likelihood estimation method as relative val-
ues of word D2-gram frequencies:
C8
D6
B4DBB5 BP
CU
D6
B4DBB5
CU
D6
B4A1B5
BN
C8
D6
B4DB
D2
CYDB
D2A0BD
BD
B5 BP
C8
D6
B4DB
D2
BD
B5
C8
D6
B4DB
D2A0BD
BD
B5
B4D2 AL BEB5BM
5 Phoneme-to-Text Transcription with
an Infinite Vocabulary
The vocabulary of an LM estimated from an
SSC consists of all subsequences occurring in it.
Adding a module describing a stochastic relation-
ship between these subsequences and input signal
sequences, we can build a phoneme-to-text tran-
scription system equipped with an almost infinite
vocabulary.
5.1 Word Candidate Enumeration
Given a phoneme sequence as an input, the dic-
tionary of a phoneme-to-text transcription system
described in Subsection 3.3 returns pairs of a word
and a probability per Equation (4). Similarly, the
dictionary of a phoneme-to-text system with an in-
finite vocabulary must be able to take a phoneme
sequence DD and return all possible pairs of a char-
acter sequence DB and the probability C8B4DDCYDBB5 as
word candidates. This is done as follows:
1. First we prepare a single character dictionary
containing all characters DC in the language an-
notated with their all possible phoneme se-
quences CH
DC
BP CUDD
BD
BN DD
BE
BN BMBMBMBN DD
CZ
CV.For
732
example, the Japanese single character dictio-
nary contains a character DC BP “�” annotated
with its all possible phoneme sequences CH
�
BP
CUCQCXBNCWCXBNCYCXD8D7D9BNCZCPBND2CXBND2CXCRCWCXBND2CXD8CV.
2. Then we build a phoneme-to-text transcrip-
tion system for single characters equipped with
the vocabulary consisting of the union set of
phoneme sequences for all characters. Given
a phoneme sequence DD, this module returns all
possible character sequences DB with its gener-
ation probability C8B4DDCYDBB5. For example, given
a subsequence of the input phoneme sequence
DD BP D2CXD8D8CTD6CT, this module returns CF BP CU��
�BN�	�BN�	��BN����BN��	�BN��	�
�BNA1A1A1CVas a word candidate set along with their
generation probabilities.
3. There are various methods to calculate the
probability C8B4DDCYDBB5. The only condition is that
given DB BP DC
BD
DC
BE
A1A1A1DC
D1
, C8B4DDCYDBB5 must be a
stochastic language model (cf. Section 3) on the
alphabet CH. In the experiments, we assumed the
uniform distribution of phoneme sequences for
each character as follows:
BF
C8B4DDCYDBB5BPC8B4DDCYDC
BD
DC
BE
A1A1A1DC
D1
B5BP
D1
CH
CXBPBD
BD
CYCH
DC
CX
CY
BM (6)
The module we described above receives a
phoneme sequence and enumerates its decomposi-
tions to subsequences contained in the single char-
acter dictionary. This module is implemented us-
ing a dynamic programming method. In the ex-
periments we limited the maximum length of the
input to 16 phonemes.
5.2 Modeling Contexts of Word Candidates
Word D2-gram probability estimated from an SSC
may not be as accurate as an LM estimated from a
corpus segmented appropriately by hand. Thus we
use the following interpolation technique:
C8B4DB
CX
CYC0
CX
B5BPAL
D7
C8
D7
B4DB
CX
CYC0
CX
B5B7AL
D6
C8
D6
B4DB
CX
CYC0
CX
B5BN
where C0
CX
is history before DB
CX
, C8
D7
is the probabil-
ity estimated from a segmented corpus BV
D7
, and C8
D6
is the probability estimated by our method from a
raw corpus BV
D6
. The AL
D7
and AL
D6
are interpolation
coefficients which are estimated by the deleted in-
terpolation method (Jelinek et al., 1991).
BF
More precisely, it may happen that the same phoneme
sequence is generated from a character sequence in multiple
ways. In this case the generation probability is calculated as
the summation over all possible generations.
In the experiments, the word bi-gram model in
our phoneme-to-text transcription system is com-
bined with word bi-gram probabilities estimated
from an SSC. Thus the phoneme-to-text transcrip-
tion system of our new framework refers to the
following LM to measure the likelihood of word
sequences:
C8B4DB
CX
B5 (7)
BP
BK
BQ
BQ
BQ
BQ
BQ
BQ
BQ
BO
BQ
BQ
BQ
BQ
BQ
BQ
BQ
BM
AL
D7
C8
D7
B4DB
CX
CYDB
CXA0BD
B5B7AL
D6
C8
D6
B4DB
CX
CYDB
CXA0BD
B5
if DB
CX
BECFBN
AL
D7
C8
D7
B4CDCFCYDB
CXA0BD
B5C5
DCBND2
B4DB
CX
B5B7AL
D6
C8
D6
B4DB
CX
CYDB
CXA0BD
B5
if DB
CX
BIBECFCMDB
CX
BECB
D6
BN
AL
D7
C8
D7
B4CDCFCYDB
CXA0BD
B5C5
DCBND2
B4DB
CX
B5BN
A1
A1
A1
C8
D6
B4DB
CX
B5BPBC
if DB
CX
BIBECFCMDB
CX
BIBECB
D6
BN
where CB
D6
is the set of all subsequences appearing
in the SSC.
Our LM based on Equation (7) and an existing
LM (cf. Equation (5)) behave differently when
they predict an out-of-vocabulary word appearing
in the SSC, that is DB
CX
BIBE CF CM DB
CX
BE CB
D6
.In
this case our LM has reliable context informa-
tion on the OOV word to help the system choose
the proper word. Our system also clearly func-
tions better than the LM interpolated with a word
D2-gram model estimated from the automatic seg-
mentation result of the corpus when the result is a
wrong segmentation. For example, when the au-
tomatic segmentation result of the sequence “�
��” (the abbreviation of Japan TV broadcasting
corporation) has a word boundary between “�”
and “�,” the uni-gram probability C8B4���B5 is
equal to 0 and an OOV word “���”isnever
enumerated as a candidate.
BG
To the contrary, us-
ing our method C8B4���B5 BQ BC when the sequence
“���” appears in the SSC at least once. Thus
the sequence is enumerated as a candidate word.
In addition, when the sequence appears frequently
in the SSC, C8B4���B5 AT BC and the word may ap-
pear at a high position in the candidate list even if
the automatic segmenter always wrongly segments
the sequence into “�” and “��.”
5.3 Default Character for Phoneme
In very rare cases, it happens that the input
phoneme sequence cannot be decomposed into
phoneme sequences in the vocabulary and those
BG
Two word fragments “�” and “��” may be enumer-
ated as word candidates. The notion of word may be neces-
sary for the user’s facility. However, we do not discuss the
necessity of the notion of word in the phoneme-to-text tran-
scription system.
733
corresponding to subsequences of the SSC and,
as a result, the transcription system does not out-
put any candidate sentence. To avoid this sit-
uation, we prepare a default character for every
phoneme and the transcription system also enu-
merates the default character for each phoneme. In
Japanese from the viewpoint of transcription ac-
curacy, it is better to set the default characters to
katakana, which are used mainly for translitera-
tion of imported words. Since a katakana is pro-
nunced uniquely (CYCH
DC
CX
CY BPBD),
C8B4DDCYDBB5BPC8B4DDCYDC
BD
DC
BE
A1A1A1DC
D1
B5BPBDBM (8)
From Equations (4), (6), and (8), the PM of our
transcription system is as follows:
C8B4DD
CX
CYDB
CX
B5 (9)
BP
BK
BQ
BQ
BQ
BQ
BQ
BQ
BO
BQ
BQ
BQ
BQ
BQ
BQ
BM
CUB4DD
CX
BN DB
CX
B5
CUB4DB
CX
B5
BN if DB
CX
BECFBN
D1
CH
CYBPBD
BD
CYCH
DC
CY
CY
BN if DB
CX
BIBECFCMDB
CX
BECB
D6
BN
BDBN if DB
CX
BIBECFCMDB
CX
BIBECB
D6
BN
where DB
CX
BP DC
BD
DC
BE
A1A1A1DC
D1
.
5.4 Phoneme-to-Text Transcription with an
Infinite Vocabulary
Finally, the transcription system with an infinite
vocabulary enumerates candidate sentence DC BP
DB
BD
DB
BE
A1A1A1DB
CW
in the descending order of the follow-
ing evaluation function value composed of an LM
C8B4DB
CX
B5 defined by Equation (7) and a PM C8B4DD
CX
CYDB
CX
B5
defined by Equation (9):
C8B4DDCYDCB5C8B4DCB5BP
CW
CH
CXBPBD
C8B4DD
CX
CYDB
CX
B5C8B4DB
CX
B5
Note that there are only three cases since the case
decompositions in Equation (7) and Equation (9)
are identical.
6 Evaluation
As an evaluation of our phoneme-to-text transcrip-
tion system, we measured transcription accuracies
of several systems on test corpora in two domains:
one is a general domain in which we have a small
annotated corpus with word boundary information
and phoneme sequence for each word, and the
other is a target domain in which only a large raw
corpus is available. As the transcription result, we
took the word sequence of the highest probability.
In this section we show the results and evaluate
our new framework.
Table 1: Annotated corpus in general domain
#sentences #words #chars
learning 20,808 406,021 598,264
test 2,311 45,180 66,874
Table 2: Raw corpus in the target domain
#sentences #words #chars
learning 797,345 — 17,645,920
test 1,000 — 20,935
6.1 Conditions on the Experiments
The segmented corpus used in our experiments is
composed of articles extracted from newspapers
and example sentences in a dictionary of daily
conversation. Each sentence in the corpus is seg-
mented into words and each word is annotated
with a phoneme sequence. The corpus was di-
vided into ten parts. The parameters of the model
were estimated from nine of them (learning) and
the model was tested on the remaining one (test).
Table 1 shows the corpus size. Another corpus
we used in the experiments is composed of daily
business reports. This corpus is not annotated
with word boundary information nor phoneme se-
quence for each word. For evaluation, we se-
lected 1,000 sentences randomly and annotated
them with the phoneme sequences to be used as
a test set. The rest was used for LM estimation
(see Table 2).
6.2 Evaluation Criterion
The criterion we used for transcription systems is
precision and recall based on the number of char-
acters in the longest common subsequence (LCS)
(Aho, 1990). Let C6
BVC7CA
be the number of char-
acters in the correct sentence, C6
CBCHCB
be that in the
output of a system, and C6
C4BVCB
be that of the LCS
of the correct sentence and the output of the sys-
tem, so the recall is defined as C6
C4BVCB
BPC6
BVC7CA
and
the precision as C6
C4BVCB
BPC6
CBCHCB
.
6.3 Models for Comparison
In order to clarify the difference in the usages of
the target domain corpus, we built four transcrip-
tion systems and compared their accuracies. Be-
low we explain the models in detail.
Model BU: Baseline
A word bi-gram model built from the segmented
general domain corpus.
734
Table 3: Phoneme-to-text transcription accuracy.
word bi-gram from raw corpus unknown General domain Target domain
the annotated corpus usage word model Precision Recall Precision Recall
BU Yes No No 89.80% 92.30% 68.62% 78.40%
BW Yes Auto. Seg. No 92.67% 93.42% 80.59% 86.19%
BW
BC
Yes Auto. Seg. Yes 92.52% 93.17% 90.35% 93.48%
CB Yes Stoch. Seg. Yes 92.78% 93.40% 91.10% 94.09%
The vocabulary contains 10,728 words appearing
in more than one corpora of the nine learning cor-
pora. The automatic word segmenter used to build
the other three models is based on the method ex-
plained in Section 3 with this LM.
Model BW: Decisive segmentation
A word bi-gram model estimated from the au-
tomatic segmentation result of the target corpus
interpolated with model BU.
Model BW
BC
: Decisive segmentation
Model BW extended with our PM for unknown
words
Model CB: Stochastic segmentation
A word bi-gram model estimated from the SSC
in the target domain interpolated with model BU
and equipped with our PM for unknown words
6.4 Evaluation
Table 3 shows the transcription accuracy of the
models. A comparison of the accuracies in the
target domain of the Model BU and Model BW con-
firms the well known fact that even an automatic
segmentation result containing errors helps an LM
improve its performance. The accuracy of Model
BW in the general domain is also higher than that of
Model BU. From this result we can say that over-
adaptation has not occurred.
Model BW
BC
, equipped with our PM for unknown
words, is a natural extension of Model BW, a model
based on an existing method. The accuracy of
Model BW
BC
is higher than that of Model BW in the tar-
get domain, but worse in the general domain. This
is because the vocabulary of Model BW
BC
is enlarged
with the words and the word fragments contained
in the automatic segmentation result. Though no
study has been reported on the method of Model
BW
BC
, below we take Model BW
BC
as an existing method
for a more severe evaluation.
Comparing the accuracies of Model BW
BC
and
Model CB in both domain, it can be said that using
our method we can build a more accurate model
than the existing methods. The main reason is that
Table 4: Relationship between the raw corpus size
and the accuracies.
Raw corpus size Precision Recall
BDBMBJBIBHA2BDBC
BH
chars (1/100) 89.18% 92.32%
BDBMBJBIBHA2BDBC
BI
chars (1/10) 90.33% 93.40%
BDBMBJBIBHA2BDBC
BJ
chars (1/1) 91.10% 94.09%
our phoneme model PM is able to enumerate tran-
scription candidates for out-of-vocabulary words
and word D2-gram probabilities estimated from the
SSC helps the model choose the appropriate ones.
A detailed study of Table 3 tells us that the re-
duction rate of character error rate (BDBCBCB1A0recall)
of Model CB in the target domain (9.36%) is much
larger than that in the general domain (3.37%).
The reason for this is that the automatic word seg-
menter tends to make mistakes around character-
istic words and expressions in the target domain
and our method is much less influenced by those
segmentation errors than the existing method is.
In order to clarify the relationship between the
size of the SSC and the transcription accuracy, we
calculated the accuracies while changing the size
of the SSC (1/1, 1/10, 1/100). The result, shown
in Table 4, shows that we can still achieve a fur-
ther improvement just by gathering more example
sentences in the target domain.
The main difference between the models is the
LM part. Thus the accuracy increase is yielded by
the LM improvements. This fact indicates that we
can expect a similar improvement in other gener-
ative NLP systems using the noisy channel model
by expanding the LM vocabulary with context in-
formation to an infinite size.
7 Related Work
The well-known methods for the unknown word
problem are classified into two groups: one is to
use an unknown word model and the other is to
extract word candidates from a corpus before the
application. Below we describe the relationship
735
between these methods and the proposed method.
In the method using an unknown word model,
first the generation probability of an unknown
word is modeled by a character D2-gram, and then
an NLP system, such as a morphological analyzer,
searches for the best solution considering the pos-
sibility that all subsequences might be unknown
words (Nagata, 1994; Bazzi and Glass, 2000).
In the same way, we can build a phoneme-to-
text transcription system which can enumerate un-
known word candidates, but the LM is not able to
refer to lexical context information to choose the
appropriate word, since the unknown words are
modeled to be generated from a single state. We
solved this problem by allowing the LM to refer to
information from an SSC.
When a machine-readable corpus in the target
domain is available, we can extract word candi-
dates from the corpus with a certain criterion and
use them in application. An advantage of this
method is that all of the occurrences of each can-
didate in the corpus are considered. Nagata (1996)
proposed a method calculating word candidates
with their uni-gram frequencies using a forward-
backward algorithm. and reported that the accu-
racy of a morphological analyzer can be improved
by adding the extracted words to its vocabulary.
Comparing our method with this research, it can
be said that our method executes the word can-
didate enumeration and their context calculation
dynamically at the time of the solution search for
an NLP task, phoneme-to-text transcription here.
One of the advantages of our framework is that
the system considers all substrings in the corpus
as word candidates (that is the recall of the word
extraction is 100%) and a higher accuracy is ex-
pected using a consistent criterion, namely the
generation probability, for the word candidate enu-
meration process and solution search process.
The framework we propose in this paper, en-
larging the vocabulary to an almost infinite size,
is general and applicable to many other NLP sys-
tems based on the noisy channel model, such as
speech recognition, statistical machine translation,
etc. Our framework is potentially capable of im-
proving the accuracies in these tasks as well.
8 Conclusion
In this paper we proposed a generative NLP sys-
tem with an almost infinite vocabulary for lan-
guages without obvious word boundary informa-
tion in written texts. In the experiments we com-
pared four phoneme-to-text transcription systems
in Japanese. The transcription system equipped
with an infinite vocabulary showed a higher accu-
racy than the baseline model and the model based
on the existing method. These results show the
efficacy of our method and tell us that our ap-
proach is promising for the phoneme-to-text tran-
scription task or other NLP systems based on the
noisy channel model.
References
Alfred V. Aho. 1990. Algorithms for finding pat-
terns in strings. In Handbook of Theoretical Com-
puter Science, volume A: Algorithms and Complex-
ity, pages 273–278. Elseveir Science Publishers.
Issam Bazzi and James R. Glass. 2000. Modeling out-
of-vocabulary words for robust speech recognition.
In Proc. of the ICSLP2000.
Peter F. Brown, John Cocke, Stephen A. Della Pietra,
Vincent J. Della Pietra, Frederick Jelinek, John D.
Lafferty, Robert L. Mercer, and Paul S. Roossin.
1990. A statistical approach to machine translation.
Computational Linguistics, 16(2):79–85.
Anne-Marie Derouault and Bernard Merialdo. 1986.
Natural language modeling for phoneme-to-text
transcription. IEEE PAMI, 8(6):742–749.
Frederick Jelinek, Robert L. Mercer, and Salim
Roukos. 1991. Principles of lexical language
modeling for speech recognition. In Advances in
Speech Signal Processing, chapter 21, pages 651–
699. Dekker.
Frederick Jelinek. 1985. Self-organized language
modeling for speech recognition. Technical report,
IBM T. J. Watson Research Center.
Mark D. Kernighan, Kenneth W. Church, and
William A. Gale. 1990. A spelling correction pro-
gram based on a noisy channel model. In Proc. of
the COLING90, pages 205–210.
Adam Kilgarriff and Gregory Grefenstette. 2003. In-
troduction to the special issue on the web as corpus.
Computational Linguistics, 29(3):333–347.
Ken Lunde. 1998. CJKV Information Processing.
O’Reilly & Associates.
Shinsuke Mori and Daisuke Takuma. 2004. Word
n-gram probability estimation from a Japanese raw
corpus. In Proc. of the ICSLP2004.
Shinsuke Mori, Tsuchiya Masatoshi, Osamu Yamaji,
and Makoto Nagao. 1999. Kana-kanji conver-
sion by a stochastic model. Transactions of IPSJ,
40(7):2946–2953. (in Japanese).
Masaaki Nagata. 1994. A stochastic Japanese morpho-
logical analyzer using a forward-DP backward-A
A3
n-best search algorithm. In Proc. of the COLING94,
pages 201–207.
Masaaki Nagata. 1996. Automatic extraction of
new words from Japanese texts using generalized
forward-backward search. In EMNLP.
736
