Unsupervised Learning of Dependency Structure for Language Modeling  
Jianfeng Gao  
Microsoft Research, Asia  
49 Zhichun Road, Haidian District 
Beijing 100080 China 
jfgao@microsoft.com  
Hisami Suzuki  
Microsoft Research 
One Microsoft Way 
 Redmond WA 98052 USA  
hisamis@microsoft.com 
 
Abstract 
This paper presents a dependency language 
model (DLM) that captures linguistic con-
straints via a dependency structure, i.e., a set 
of probabilistic dependencies that express 
the relations between headwords of each 
phrase in a sentence by an acyclic, planar, 
undirected graph. Our contributions are 
three-fold. First, we incorporate the de-
pendency structure into an n-gram language 
model to capture long distance word de-
pendency. Second, we present an unsuper-
vised learning method that discovers the 
dependency structure of a sentence using a 
bootstrapping procedure. Finally, we 
evaluate the proposed models on a realistic 
application (Japanese Kana-Kanji conver-
sion). Experiments show that the best DLM 
achieves an 11.3% error rate reduction over 
the word trigram model. 
1 Introduction 
In recent years, many efforts have been made to 
utilize linguistic structure in language modeling, 
which for practical reasons is still dominated by 
trigram-based language models. There are two 
major obstacles to successfully incorporating lin-
guistic structure into a language model: (1) captur-
ing longer distance word dependencies leads to 
higher-order n-gram models, where the number of 
parameters is usually too large to estimate; (2) 
capturing deeper linguistic relations in a language 
model requires a large annotated training corpus 
and a decoder that assigns linguistic structure, 
which are not always available. 
This paper presents a new dependency language 
model (DLM) that captures long distance linguistic 
constraints between words via a dependency 
structure, i.e., a set of probabilistic dependencies 
that capture linguistic relations between headwords 
of each phrase in a sentence. To deal with the first 
obstacle mentioned above, we approximate 
long-distance linguistic dependency by a model that 
is similar to a skipping bigram model in which the 
prediction of a word is conditioned on exactly one 
other linguistically related word that lies arbitrarily 
far in the past. This dependency model is then in-
terpolated with a headword bigram model and a 
word trigram model, keeping the number of pa-
rameters of the combined model manageable. To 
overcome the second obstacle, we used an unsu-
pervised learning method that discovers the de-
pendency structure of a given sentence using an 
Expectation-Maximization (EM)-like procedure. In 
this method, no manual syntactic annotation is 
required, thereby opening up the possibility for 
building a language model that performs well on a 
wide variety of data and languages. The proposed 
model is evaluated using Japanese Kana-Kanji 
conversion, achieving significant error rate reduc-
tion over the word trigram model.   
2 Motivation 
A trigram language model predicts the next word 
based only on two preceding words, blindly dis-
carding any other relevant word that may lie three 
or more positions to the left. Such a model is likely 
to be linguistically implausible: consider the Eng-
lish sentence in Figure 1(a), where a trigram model 
would predict cried from next seat, which does not 
agree with our intuition. In this paper, we define a 
dependency structure of a sentence as a set of 
probabilistic dependencies that express linguistic 
relations between words in a sentence by an acyclic, 
planar graph,
 
where two related words are con-
nected by an undirected graph edge (i.e., we do not 
differentiate the modifier and the head in a de-
pendency). The dependency structure for the sen-
tence in Figure 1(a) is as shown; a model that uses 
this dependency structure would predict cried from 
baby, in agreement with our intuition. 
 
 
(a) [A baby] [in the next seat] cried [throughout the flight] 
 
 
 
(b) [g12242g2012g2046g2062/g1991] [g13863g11541g7119/g2025] [g13534/g2025] [g5096/g2018] [g2005g2014g2019] [g7493g1983g2017/g1983g2010]
 
Figure 1. Examples of dependency structure. (a) A 
dependency structure of an English sentence. Square 
brackets indicate base NPs; underlined words are the 
headwords. (b) A Japanese equivalent of (a). Slashes 
demarcate morpheme boundaries; square brackets 
indicate phrases (bunsetsu).  
A Japanese sentence is typically divided into 
non-overlapping phrases called bunsetsu. As shown 
in Figure 1(b), each bunsetsu consists of one con-
tent word, referred to here as the headword H, and 
several function words F. Words (more precisely, 
morphemes) within a bunsetsu are tightly bound 
with each other, which can be adequately captured 
by a word trigram model. However, headwords 
across bunsetsu boundaries also have dependency 
relations with each other, as the diagrams in Figure 
1 show. Such long distance dependency relations 
are expected to provide useful and complementary 
information with the word trigram model in the task 
of next word prediction.  
In constructing language models for realistic 
applications such as speech recognition and Asian 
language input, we are faced with two constraints 
that we would like to satisfy: First, the model must 
operate in a left-to-right manner, because (1) the 
search procedures for predicting words that corre-
spond to the input acoustic signal or phonetic string 
work left to right, and (2) it can be easily combined 
with a word trigram model in decoding. Second, the 
model should be computationally feasible both in 
training and decoding. In the next section, we offer 
a DLM that satisfies both of these constraints.  
3 Dependency Language Model 
The DLM attempts to generate the dependency 
structure incrementally while traversing the sen-
tence left to right. It will assign a probability to 
every word sequence W and its dependency struc-
ture D. The probability assignment is based on an 
encoding of the (W, D) pair described below. 
Let W be a sentence of length n words to which 
we have prepended <s> and appended </s> so that 
w
0
 = <s>, and w
n+1
 = </s>. In principle, a language 
model recovers the probability of a sentence P(W) 
over all possible D given W by estimating the joint 
probability P(W, D): P(W) = ∑
D
 P(W, D). In prac-
tice, we used the so-called maximum approximation 
where the sum is approximated by a single term 
P(W, D
*
): 
∑
∗
≈=
D
DWPDWPWP ),(),()(
. 
(1) 
Here, D
*
 is the most probable dependency structure 
of the sentence, which is generally discovered by 
maximizing P(W, D): 
D
DWPD ),(maxarg=
∗
. 
(2) 
Below we restrict the discussion to the most prob-
able dependency structure of a given sentence, and 
simply use D to represent D
*
. In the remainder of 
this section, we first present a statistical dependency 
parser, which estimates the parsing probability at 
the word level, and generates D incrementally while 
traversing W left to right. Next, we describe the 
elements of the DLM that assign probability to each 
possible W and its most probable D, P(W, D). Fi-
nally, we present an EM-like iterative method for 
unsupervised learning of dependency structure.  
3.1 Dependency parsing 
The aim of dependency parsing is to find the most 
probable D of a given W by maximizing the prob-
ability P(D|W). Let D be a set of probabilistic de-
pendencies d, i.e. d ∈ D. Assuming that the de-
pendencies are independent of each other, we have 
∏
∈
=
Dd
WdPWDP )|()|(
 
(3) 
where P(d|W) is the dependency probability condi-
tioned by a particular sentence.
1
 It is impossible to 
estimate P(d|W) directly because the same sentence 
is very unlikely to appear in both training and test 
data. We thus approximated P(d|W) by P(d), and 
estimated the dependency probability from the 
training corpus. Let d
ij
 = (w
i
, w
j
) be the dependency 
                                                      
1
 The model in Equation (3) is not strictly probabilistic 
because it drops the probabilities of illegal dependencies 
(e.g., crossing dependencies).  
between w
i
 and w
j
. The maximum likelihood esti-
mation (MLE) of P(d
ij
) is given by  
),(
),,(
)(
ji
ji
ij
wwC
RwwC
dP =
 
(4) 
where C(w
i
, w
j
, R) is the number of times w
i
 and w
j
 
have a dependency relation in a sentence in training 
data, and C(w
i
, w
j
) is the number of times w
i
 and w
j
 
are seen in the same sentence. To deal with the data 
sparseness problem of MLE, we used the backoff 
estimation strategy similar to the one proposed in 
Collins (1996), which backs off to estimates that 
use less conditioning context. More specifically, we 
used the following three estimates: 
4
4
4
32
32
23
1
1
1
δ
η
δδ
ηη
δ
η
=
+
+
== EEE
, 
(5) 
Where  
),,(
1
RwwC
ji
=η , ),(
1 ji
wwC=δ ,  
),*,(
2
RwC
i
=η , ,*)(
2 i
wC=δ ,  
),(*,
3
RwC
j
=η , )(*,
3 j
wC=δ ,  
)(*,*,
4
RC=η , (*,*)
4
C=δ .  
in which * indicates a wild-card matching any 
word. The final estimate E is given by linearly 
interpolating these estimates: 
))1()(1(
42232111
EEEE λλλλ −+−+=  (6) 
where λ
1
 and λ
2
 are smoothing parameters.  
Given the above parsing model, we used an ap-
proximation parsing algorithm that is O(n
2
). Tradi-
tional techniques use an optimal Viterbi-style algo-
rithm (e.g., bottom-up chart parser) that is O(n
5
).
2
 
Although the approximation algorithm is not 
guaranteed to find the most probable D, we opted 
for it because it works in a left-to-right manner, and 
is very efficient and simple to implement. In our 
experiments, we found that the algorithm performs 
reasonably well on average, and its speed and sim-
plicity make it a better choice in DLM training 
where we need to parse a large amount of training 
data iteratively, as described in Section 3.3.  
The parsing algorithm is a slightly modified 
version of that proposed in Yuret (1998). It reads a 
sentence left to right; after reading each new word 
                                                      
2
 For parsers that use bigram lexical dependencies, Eis-
ner and Satta (1999) presents parsing algorithms that are 
O(n
4
) or O(n
3
). We thank Joshua Goodman for pointing 
this out.  
w
j
, it tries to link w
j
 to each of its previous words w
i
, 
and push the generated dependency d
ij
 into a stack. 
When a dependency crossing or a cycle is detected 
in the stack, the dependency with the lowest de-
pendency probability in conflict is eliminated. The 
algorithm is outlined in Figures 2 and 3. 
DEPENDENCY-PARSING(W) 
1 for j  � 1 to LENGTH(W) 
2 for i  � j-1 downto 1 
3 PUSH d
ij
 = (w
i
, w
j
) into the stack D
j
  
4 if a dependency cycle (CY) is detected in D
j
 
(see Figure 3(a)) 
5 REMOVE d, where 
)(minarg dPd
CYd∈
=
 
6 while a dependency crossing (CR) is detected 
in D
j
 (see Figure 3(b)) do 
7 REMOVE d, where 
)(minarg dPd
CRd∈
=
 
8 OUTPUT(D) 
Figure 2. Approximation algorithm of dependency 
parsing 
 
 
 
 
 
 
(a) (b) 
Figure 3. (a) An example of a dependency cycle: given 
that P(d
23
) is smaller than P(d
12
) and P(d
13
), d
23
 is 
removed (represented as dotted line). (b) An example of 
a dependency crossing: given that P(d
13
) is smaller than 
P(d
24
), d
13
 is removed. 
Let the dependency probability be the measure of 
the strength of a dependency, i.e., higher probabili-
ties mean stronger dependencies. Note that when a 
strong new dependency crosses multiple weak 
dependencies, the weak dependencies are removed 
even if the new dependency is weaker than the sum 
of the old dependencies.
3
 Although this action 
results in lower total probability, it was imple-
mented because multiple weak dependencies con-
nected to the beginning of the sentence often pre-
                                                      
3
 This operation leaves some headwords disconnected; in 
such a case, we assumed that each disconnected head-
word has a dependency relation with its preceding 
headword.  
w
1
 w2 w
3
 
w
1
 
w
2
 
w
3
 
w
4
 
vented a strong meaningful dependency from being 
created. In this manner, the directional bias of the 
approximation algorithm was partially compen-
sated for.
4
 
3.2 Language modeling 
The DLM together with the dependency parser 
provides an encoding of the (W, D) pair into a se-
quence of elementary model actions. Each action 
conceptually consists of two stages. The first stage 
assigns a probability to the next word given the left 
context. The second stage updates the dependency 
structure given the new word using the parsing 
algorithm in Figure 2. The probability P(W, D) is 
calculated as:  
=),( DWP  (7) 
∏
=
−−−−−
ΦΦ
n
j
jjj
j
jjjj
wDWDPDWwP
1
11111
)),,(|()),(|(
 
=Φ
−−−
)),,(|(
111 jjj
j
j
wDWDP  
(8) 
∏
=
−−−
j
i
j
i
j
jj
j
i
ppDWpP
1
1111
),...,,,|(
. 
 
Here (W
j-1
, D
j-1
) is the word-parse (j-1)-prefix that 
D
j-1
 is a dependency structure containing only those 
dependencies whose two related words are included 
in the word (j-1)-prefix, W
j-1
. w
j
 is the word to be 
predicted. D
j-1
j
 is the incremental dependency 
structure that generates D
j
 = D
j-1
 || D
j-1
j
 (|| stands for 
concatenation) when attached to D
j-1
; it is the de-
pendency structure built on top of D
j-1
 and the 
newly predicted word w
j
 (see the for-loop of line 2 
in Figure 2). p
i
j
 denotes the ith action of the parser at 
position j in the word string: to generate a new 
dependency d
ij
, and eliminate dependencies with 
the lowest dependency probability in conflict (see 
lines 4 – 7 in Figure 2). Φ is a function that maps the 
history (W
j-1
, D
j-1
) onto equivalence classes. 
The model in Equation (8) is unfortunately in-
feasible because it is extremely difficult to estimate 
the probability of p
i
j
 due to the large number of 
parameters in the conditional part. According to the 
parsing algorithm in Figure 2, the probability of 
                                                      
4
 Theoretically, we should arrive at the same dependency 
structure no matter whether we parse the sentence left to 
right or right to left. However, this is not the case with the 
approximation algorithm. This problem is called direc-
tional bias. 
each action p
i
j
  depends on the entire history (e.g. 
for detecting a dependency crossing or cycle), so 
any mapping Φ that limits the equivalence classi-
fication to less context suitable for model estima-
tion would be very likely to drop critical conditional 
information for predicting p
i
j
. In practice, we ap-
proximated P(D
j-1
j
| Φ(W
j-1
, D
j-1
), w
j
) by P(D
j
|W
j
) of 
Equation (3), yielding P(W
j
, D
j
) ≈ P(W
j
| Φ(W
j-1
, 
D
j-1
)) P(D
j
|W
j
). This approximation is probabilisti-
cally deficient, but our goal is to apply the DLM to a 
decoder in a realistic application, and the perform-
ance gain achieved by this approximation justifies 
the modeling decision.  
Now, we describe the way P(w
j
|Φ(W
j-1
,D
j-1
)) is 
estimated. As described in Section 2, headwords 
and function words play different syntactic and 
semantic roles capturing different types of de-
pendency relations, so the prediction of them can 
better be done separately. Assuming that each word 
token can be uniquely classified as a headword or a 
function word in Japanese, the DLM can be con-
ceived of as a cluster-based language model with 
two clusters, headword H and function word F. We 
can then define the conditional probability of w
j
 
based on its history as the product of two factors: 
the probability of the category given its history, and 
the probability of w
j
 given its category. Let h
j
 or f
j
 be 
the actual headword or function word in a sentence, 
and let H
j
 or F
j
 be the category of the word w
j
. 
P(w
j
|Φ(W
j-1
,D
j-1
)) can then be formulated as: 
=Φ
−−
)),(|(
11 jjj
DWwP   
(9) 
)),,(|()),(|(
1111 jjjjjjj
HDWwPDWHP
−−−−
Φ×Φ  
)),,(|()),(|(
1111 jjjjjjj
FDWwPDWFP
−−−−
Φ×Φ+ . 
We first describe the estimation of headword 
probability P(w
j 
| Φ(W
j-1
, D
j-1
), H
j
). Let HW
j-1
 be the 
headwords in (j-1)-prefix, i.e., containing only 
those headwords that are included in W
j-1
. Because 
HW
j-1
 is determined by W
j-1
, the headword prob-
ability can be rewritten as P(w
j 
| Φ(W
j-1
, HW
j-1
, D
j-1
), 
H
j
). The problem is to determine the mapping Φ so 
as to identify the related words in the left context 
that we would like to condition on. Based on the 
discussion in Section 2, we chose a mapping func-
tion that retains (1) two preceding words w
j-1
 and 
w
j-2
 in W
j-1
, (2) one preceding headword h
j-1
 in 
HW
j-1
, and (3) one linguistically related word w
i
 
according to D
j-1
. w
i
 is determined in two stages: 
First, the parser updates the dependency structure 
D
j-1 
incrementally to D
j
 assuming that the next word 
is w
j
. Second, when there are multiple words that 
have dependency relations with w
j
 in D
j
, w
i
 is se-
lected using the following decision rule: 
),|(maxarg
),(:
RwwPw
ij
Dwww
i
jjii
∈
=
, (10) 
where the probability P(w
j
 | w
i
, R) of the word w
j
 
given its linguistic related word w
i
 is computed 
using MLE by Equation (11): 
∑
=
j
w
ji
ji
ij
RwwC
RwwC
RwwP
),,(
),,(
),|(
. 
(11) 
We thus have the mapping function Φ(W
j-1
, HW
j-1
, 
D
j-1
) = (w
j-2, 
w
j-1
, h
j-1
, w
i
). The estimate of headword 
probability is an interpolation of three probabilities:  
=Φ
−−
)),,(|(
11 jjjj
HDWwP   (12) 
),|((
121 jjj
HhwP
−
λλ   
)),|()1(
2
RwwP
ij
λ−+  
 
),,|()1(
121 jjjj
HwwwP
−−
−+ λ . 
 
Here P(w
j
|w
j-2
,
 
w
j-1
,
 
H
j
) is the word trigram prob-
ability given that w
j
 is a headword, P(w
j
|h
j-1
,
 
H
j
) is 
the headword bigram probability, and λ
1
, λ
2 
∈[0,1] 
are  the interpolation weights optimized on held-out 
data. 
We now come back to the estimate of the other 
three probabilities in Equation (9). Following the 
work in Gao et al. (2002b), we used the unigram 
estimate for word category probabilities, (i.e., 
P(H
j
|Φ(W
j-1
, D
j-1
)) ≈ P(H
j
) and P(F
j 
| Φ(W
j-1
, D
j-1
)) ≈ 
P(F
j
)), and the standard trigram estimate for func-
tion word probability (i.e., P(w
j 
|Φ(W
j-1
,D
j-1
),F
j
) ≈ 
P(w
j 
| w
j-2, 
w
j-1, 
F
j
)). Let C
j
 be the category of w
j
; we 
approximated P(C
j
)× P(w
j
|w
j-2
,
 
w
j-1, 
C
j
) by P(w
j 
| w
j-2, 
w
j-1
). By separating the estimates for the probabili-
ties of headwords and function words, the final 
estimate is given below: 
P(w
j 
| Φ(W
j-1
,
 
D
j-1
))= (13) 
)|()(((
121 −jjj
hwPHP λλ
)),|()1(
2
RwwP
ij
λ−+
),|()1(
121 −−
−+
jjj
wwwPλ  
 
w
j
: headword  
),|(
12 −− jjj
wwwP   









 
w
j
: function word  
All conditional probabilities in Equation (13) are 
obtained using MLE on training data. In order to 
deal with the data sparseness problem, we used a 
backoff scheme (Katz, 1987) for parameter estima-
tion. This backoff scheme recursively estimates the 
probability of an unseen n-gram by utilizing 
(n–1)-gram estimates. In particular, the probability 
of Equation (11) backs off to the estimate of 
P(w
j
|R), which is computed as: 
N
RwC
RwP
j
j
),(
)|( =
, 
(14) 
where N is the total number of dependencies in 
training data, and C(w
j
, R) is the number of de-
pendencies that contains w
j
. To keep the model size 
manageable, we removed all n-grams of count less 
than 2 from the headword bigram model and the 
word trigram model, but kept all long-distance 
dependency bigrams that occurred in the training 
data. 
3.3 Training data creation 
This section describes two methods that were used 
to tag raw text corpus for DLM training: (1) a 
method for headword detection, and (2) an unsu-
pervised learning method for dependency structure 
acquisition. 
In order to classify a word uniquely as H or F, 
we used a mapping table created in the following 
way. We first assumed that the mapping from 
part-of-speech (POS) to word category is unique 
and fixed;
5
 we then used a POS-tagger to generate a 
POS-tagged corpus, which are then turned into a 
category-tagged corpus.
6
 Based on this corpus, we 
created a mapping table which maps each word to a 
unique category: when a word can be mapped to 
either H or F, we chose the more frequent category 
in the corpus. This method achieved a 98.5% ac-
curacy of headword detection on the test data we 
used. 
Given a headword-tagged corpus, we then used 
an EM-like iterative method for joint optimization 
of the parsing model and the dependency structure 
of training data. This method uses the maximum 
likelihood principle, which is consistent with lan-
                                                      
5
 The tag set we used included 1,187 POS tags, of which 
102 counted as headwords in our experiments. 
6
 Since the POS-tagger does not identify phrases (bun-
setsu), our implementation identifies multiple headwords 
in phrases headed by compounds.  
guage model training. There are three steps in the 
algorithm: (1) initialize, (2) (re-)parse the training 
corpus, and (3) re-estimate the parameters of the 
parsing model. Steps (2) and (3) are iterated until 
the improvement in the probability of training data 
is less than a threshold. 
Initialize: We set a window of size N and assumed 
that each headword pair within a headword N-gram 
constitutes an initial dependency. The optimal value 
of N is 3 in our experiments. That is, given a 
headword trigram (h
1
, h
2
, h
3
), there are 3 initial 
dependencies: d
12
, d
13
, and d
23
. From the initial 
dependencies, we computed an initial dependency 
parsing model by Equation (4).  
(Re-)parse the corpus: Given the parsing model, 
we used the parsing algorithm in Figure 2 to select 
the most probable dependency structure for each 
sentence in the training data. This provides an up-
dated set of dependencies. 
Re-estimate the parameters of parsing model: 
We then re-estimated the parsing model parameters 
based on the updated dependency set. 
4 Evaluation Methodology 
In this study, we evaluated language models on the 
application of Japanese Kana-Kanji conversion, 
which is the standard method of inputting Japanese 
text by converting the text of a syllabary-based 
Kana string into the appropriate combination of 
Kanji and Kana. This is a similar problem to speech 
recognition, except that it does not include acoustic 
ambiguity. Performance on this task is measured in 
terms of the character error rate (CER), given by the 
number of characters wrongly converted from the 
phonetic string divided by the number of characters 
in the correct transcript.  
For our experiments, we used two newspaper 
corpora, Nikkei and Yomiuri Newspapers, both of 
which have been pre-word-segmented. We built 
language models from a 36-million-word subset of 
the Nikkei Newspaper corpus, performed parameter 
optimization on a 100,000-word subset of the Yo-
miuri Newspaper (held-out data), and tested our 
models on another 100,000-word subset of the 
Yomiuri Newspaper corpus. The lexicon we used 
contains 167,107 entries.  
Our evaluation was done within a framework of 
so-called “N-best rescoring” method, in which a list 
of hypotheses is generated by the baseline language 
model (a word trigram model in this study), which 
is then rescored using a more sophisticated lan-
guage model. We use the N-best list of N=100,g2
whose “oracle” CER (i.e., the CER of the hy-
potheses with the minimum number of errors) is 
presented in Table 1, indicating the upper bound on 
performance. We also note in Table 1 that the per-
formance of the conversion using the baseline tri-
gram model is much better than the state-of-the-art 
performance currently available in the marketplace, 
presumably due to the large amount of training data 
we used, and to the similarity between the training 
and the test data.  
Baseline Trigram Oracle of 100-best 
3.73% 1.51%
Table 1. CER results of baseline and 100-best list 
5 Results 
The results of applying our models to the task of 
Japanese Kana-Kanji conversion are shown in 
Table 2. The baseline result was obtained by using a 
conventional word trigram model (WTM).
7
 HBM 
stands for headword bigram model, which does not 
use any dependency structure (i.e. λ
2 
= 1 in Equation 
(13)). DLM_1 is the DLM that does not use head-
word bigram (i.e. λ
 2 
= 0 in Equation (13)). DLM_2 
is the model where the headword probability is 
estimated by interpolating the word trigram prob-
ability, the headword bigram probability, and the 
probability given one previous linguistically related 
word in the dependency structure. 
Although Equation (7) suggests that the word 
probability P(w
j
|Φ(W
j-1
,D
j-1
)) and the parsing model 
probability can be combined through simple multi-
plication, some weighting is desirable in practice, 
especially when our parsing model is estimated 
using an approximation by the parsing score 
P(D|W). We therefore introduced a parsing model 
weight PW: both DLM_1 and DLM_2 models were 
built with and without PW. In Table 2, the PW- 
prefix refers to the DLMs with PW = 0.5, and the 
DLMs without PW- prefix refers to DLMs with PW 
= 0. For both DLM_1 and DLM_2, models with the 
parsing weight achieve better performance; we 
                                                      
7
 For a detailed description of the baseline trigram model, 
see Gao et al. (2002a).  
therefore discuss only DLMs with the parsing 
weight for the rest of this section. 
Model λ
1
 λ
2
 CER CER reduction 
WTM ---- ---- 3.73% ---- 
HBM 0.2 1 3.40% 8.8% 
DLM_1  0.1 0 3.48% 6.7% 
PW-DLM_1 0.1 0 3.44% 7.8% 
DLM_2 0.3 0.7 3.33% 10.7% 
PW-DLM_2 0.3 0.7 3.31% 11.3% 
Table 2. Comparison of CER results 
By comparing both HBM and PW-LDM_1 models 
with the baseline model, we can see that the use of 
headword dependency contributes greatly to the 
CER reduction: HBM outperformed the baseline 
model by 8.8% in CER reduction, and PW-LDM_1 
by 7.8%. By combining headword bigram and 
dependency structure, we obtained the best model 
PW-DLM_2 that achieves 11.3% CER reduction 
over the baseline. The improvement achieved by 
PW-DLM_2 over the HBM is statistically signifi-
cant according to the t test (P<0.01). These results 
demonstrate the effectiveness of our parsing tech-
nique and the use of dependency structure for lan-
guage modeling. 
6 Discussion 
In this section, we relate our model to previous 
research and discuss several factors that we believe 
to have the most significant impact on the per-
formance of DLM. The discussion includes: (1) the 
use of DLM as a parser, (2) the definition of the 
mapping function Φ, and (3) the method of unsu-
pervised dependency structure acquisition. 
One basic approach to using linguistic structure 
for language modeling is to extend the conventional 
language model P(W) to P(W, T), where T is a parse 
tree of W. The extended model can then be used as a 
parser to select the most likely parse by T
*
 = arg-
max
T
 P(W, T). Many recent studies (e.g., Chelba 
and Jelinek, 2000; Charniak, 2001; Roark, 2001) 
adopt this approach. Similarly, dependency-based 
models (e.g., Collins, 1996; Chelba et al., 1997) use 
a dependency structure D of W instead of a parse 
tree T, where D is extracted from syntactic trees. 
Both of these models can be called grammar-based 
models, in that they capture the syntactic structure 
of a sentence, and the model parameters are esti-
mated from syntactically annotated corpora such as 
the Penn Treebank. DLM, on the other hand, is a 
non-grammar-based model, because it is not based 
on any syntactic annotation: the dependency struc-
ture used in language modeling was learned directly 
from data in an unsupervised manner, subject to two 
weak syntactic constraints (i.e., dependency struc-
ture is acyclic and planar).
8
 This resulted in cap-
turing the dependency relations that are not pre-
cisely syntactic in nature within our model. For 
example, in the conversion of the string below, the 
word g6469 ban 'evening' was correctly predicted in 
DLM by using the long-distance bigram g6586~g6469 
asa~ban 'morning~evening', even though these two 
words are not in any direct syntactic dependency 
relationship:  
g2597g5096g2025g10367g3037g2025g3304g2018g6586g1940g5936g9244g2061g2708g1993g1940g6469g2022g4238g3668g2061g11541g1985 
'asks for instructions in the morning and submits 
daily reports in the evening'  
Though there is no doubt that syntactic dependency 
relations provide useful information for language 
modeling, the most linguistically related word in the 
previous context may come in various linguistic 
relations with the word being predicted, not limited 
to syntactic dependency. This opens up new possi-
bilities for exploring the combination of different 
knowledge sources in language modeling.  
Regarding the function Φ that maps the left 
context onto equivalence classes, we used a simple 
approximation that takes into account only one 
linguistically related word in left context. An al-
ternative is to use the maximum entropy (ME) 
approach (Rosenfeld, 1994; Chelba et al., 1997). 
Although ME models provide a nice framework for 
incorporating arbitrary knowledge sources that can 
be encoded as a large set of constraints, training and 
using ME models is extremely computationally 
expensive. Our working hypothesis is that the in-
formation for predicting the new word is dominated 
by a very limited set of words which can be selected 
heuristically: in this paper, Φ is defined as a heu-
ristic function that maps D to one word in D that has 
the strongest linguistic relation with the word being 
predicted, as in (8). This hypothesis is borne out by 
                                                      
8
 In this sense, our model is an extension of a depend-
ency-based model proposed in Yuret (1998). However, 
this work has not been evaluated as a language model 
with error rate reduction.  
an additional experiment we conducted, where we 
used two words from D that had the strongest rela-
tion with the word being predicted; this resulted in a 
very limited gain in CER reduction of 0.62%, which 
is not statistically significant (P>0.05 according to 
the t test).  
The EM-like method for learning dependency 
relations described in Section 3.3 has also been 
applied to other tasks such as hidden Markov model 
training (Rabiner, 1989), syntactic relation learning 
(Yuret, 1998), and Chinese word segmentation 
(Gao et al., 2002a). In applying this method, two 
factors need to be considered: (1) how to initialize 
the model (i.e. the value of the window size N), and 
(2) the number of iterations. We investigated the 
impact of these two factors empirically on the CER 
of Japanese Kana-Kanji conversion. We built a 
series of DLMs using different window size N and 
different number of iterations. Some sample results 
are shown in Table 3: the improvement in CER 
begins to saturate at the second iteration. We also 
find that a larger N results in a better initial model 
but makes the following iterations less effective. 
The possible reason is that a larger N generates 
more initial dependencies and would lead to a better 
initial model, but it also introduces noise that pre-
vents the initial model from being improved. All 
DLMs in Table 2 are initialized with N = 3 and are 
run for two iterations.  
Iteration N = 2 N = 3 N = 5 N = 7 N = 10 
Init. 3.552% 3.523% 3.540% 3.514 % 3.511% 
1 3.531% 3.503% 3.493% 3.509% 3.489% 
2 3.527% 3.481% 3.483% 3.492% 3.488% 
3 3.526% 3.481% 3.485% 3.490% 3.488% 
Table 3. CER of DLM_1 models initialized with dif-
ferent window size N, for 0-3 iterations 
7 Conclusion 
We have presented a dependency language model 
that captures linguistic constraints via a dependency 
structure – a set of probabilistic dependencies that 
express the relations between headwords of each 
phrase in a sentence by an acyclic, planar, undi-
rected graph. Promising results of our experiments 
suggest that long-distance dependency relations can 
indeed be successfully exploited for the purpose of 
language modeling.   
There are many possibilities for future im-
provements. In particular, as discussed in Section 6, 
syntactic dependency structure is believed to cap-
ture useful information for informed language 
modeling, yet further improvements may be possi-
ble by incorporating non-syntax-based dependen-
cies. Correlating the accuracy of the dependency 
parser as a parser vs. its utility in CER reduction 
may suggest a useful direction for further research.  
Reference 
Charniak, Eugine. 2001. Immediate-head parsing for 
language models. In ACL/EACL 2001, pp.124-131.  
Chelba, Ciprian and Frederick Jelinek. 2000. Structured 
Language Modeling. Computer Speech and Language, 
Vol. 14, No. 4. pp 283-332.  
Chelba, C, D. Engle, F. Jelinek, V. Jimenez, S. Khu-
danpur, L. Mangu, H. Printz, E. S. Ristad, R. 
Rosenfeld, A. Stolcke and D. Wu. 1997. Structure and 
performance of a dependency language model. In 
Processing of Eurospeech, Vol. 5, pp 2775-2778.  
Collins, Michael John. 1996. A new statistical parser 
based on bigram lexical dependencies. In ACL 
34:184-191. 
Eisner, Jason and Giorgio Satta. 1999. Efficient parsing 
for bilexical context-free grammars and head 
automaton grammars. In ACL 37: 457-464. 
Gao, Jianfeng, Joshua Goodman, Mingjing Li and 
Kai-Fu Lee. 2002a. Toward a unified approach to sta-
tistical language modeling for Chinese. ACM Trans-
actions on Asian Language Information Processing, 
1-1: 3-33.  
Gao, Jianfeng, Hisami Suzuki and Yang Wen. 2002b. 
Exploiting headword dependency and predictive 
clustering for language modeling. In EMNLP 2002: 
248-256. 
Katz, S. M. 1987. Estimation of probabilities from sparse 
data for other language component of a speech recog-
nizer. IEEE transactions on Acoustics, Speech and 
Signal Processing, 35(3): 400-401. 
Rabiner, Lawrence R. 1989. A tutorial on hidden Markov 
models and selected applications in speech recognition. 
Proceedings of IEEE 77:257-286. 
Roark, Brian. 2001. Probabilistic top-down parsing and 
language modeling. Computational Linguistics, 17-2: 
1-28. 
Rosenfeld, Ronald. 1994. Adaptive statistical language 
modeling: a maximum entropy approach. Ph.D. thesis, 
Carnegie Mellon University.  
Yuret, Deniz. 1998. Discovery of linguistic relations 
using lexical attraction. Ph.D. thesis, MIT. 
