Proceedings of the Ninth International Workshop on Parsing Technologies (IWPT), pages 65–73,
Vancouver, October 2005. c©2005 Association for Computational Linguistics
Machine Translation as Lexicalized Parsing with Hooks
Liang Huang
Dept. of Computer & Information Science
University of Pennsylvania
Philadelphia, PA 19104
Hao Zhang and Daniel Gildea
Computer Science Department
University of Rochester
Rochester, NY 14627
Abstract
We adapt the “hook” trick for speeding up
bilexical parsing to the decoding problem
for machine translation models that are
based on combining a synchronous con-
text free grammar as the translation model
with an n-gram language model. This
dynamic programming technique yields
lower complexity algorithms than have
previously been described for an impor-
tant class of translation models.
1 Introduction
In a number of recently proposed synchronous
grammar formalisms, machine translation of new
sentences can be thought of as a form of parsing on
the input sentence. The parsing process, however,
is complicated by the interaction of the context-free
translation model with an m-gram1 language model
in the output language. While such formalisms ad-
mit dynamic programming solutions having poly-
nomial complexity, the degree of the polynomial is
prohibitively high.
In this paper we explore parallels between transla-
tion and monolingual parsing with lexicalized gram-
mars. Chart items in translation must be augmented
with words from the output language in order to cap-
ture language model state. This can be thought of as
a form of lexicalization with some similarity to that
of head-driven lexicalized grammars, despite being
unrelated to any notion of syntactic head. We show
1We speak of m-gram language models to avoid confusion
with n, which here is the length of the input sentence for trans-
lation.
that techniques for parsing with lexicalized gram-
mars can be adapted to the translation problem, re-
ducing the complexity of decoding with an inversion
transduction grammar and a bigram language model
from O(n7) to O(n6). We present background on
this translation model as well as the use of the tech-
nique in bilexicalized parsing before describing the
new algorithm in detail. We then extend the al-
gorithm to general m-gram language models, and
to general synchronous context-free grammars for
translation.
2 Machine Translation using Inversion
Transduction Grammar
The Inversion Transduction Grammar (ITG) of Wu
(1997) is a type of context-free grammar (CFG) for
generating two languages synchronously. To model
the translational equivalence within a sentence pair,
ITG employs a synchronous rewriting mechanism to
relate two sentences recursively. To deal with the
syntactic divergence between two languages, ITG
allows the inversion of rewriting order going from
one language to another at any recursive level. ITG
in Chomsky normal form consists of unary produc-
tion rules that are responsible for generating word
pairs:
X → e/f
X → e/epsilon1
X → epsilon1/f
where e is a source language word, f is a foreign lan-
guage word, and epsilon1 means the null token, and binary
production rules in two forms that are responsible
for generating syntactic subtree pairs:
X → [Y Z]
65
and
X → 〈Y Z〉
The rules with square brackets enclosing the
right-hand side expand the left-hand side symbol
into the two symbols on the right-hand side in the
same order in the two languages, whereas the rules
with angled brackets expand the left hand side sym-
bol into the two right-hand side symbols in reverse
order in the two languages. The first class of rules
is called straight rule. The second class of rules is
called inverted rule.
One special case of 2-normal ITG is the so-called
Bracketing Transduction Grammar (BTG) which
has only one nonterminal A and two binary rules
A → [AA]
and
A → 〈AA〉
By mixing instances of the inverted rule with
those of the straight rule hierarchically, BTG can
meet the alignment requirements of different lan-
guage pairs. There exists a more elaborate version
of BTG that has 4 nonterminals working together
to guarantee the property of one-to-one correspon-
dence between alignments and synchronous parse
trees. Table 1 lists the rules of this BTG. In the
discussion of this paper, we will consider ITG in 2-
normal form.
By associating probabilities or weights with the
bitext production rules, ITG becomes suitable for
weighted deduction over bitext. Given a sentence
pair, searching for the Viterbi synchronous parse
tree, of which the alignment is a byproduct, turns out
to be a two-dimensional extension of PCFG parsing,
having time complexity of O(n6), where n is the
length of the English string and the foreign language
string. A more interesting variant of parsing over bi-
text space is the asymmetrical case in which only the
foreign language string is given so that Viterbi pars-
ing involves finding the English string “on the fly”.
The process of finding the source string given its tar-
get counterpart is decoding. Using ITG, decoding is
a form of parsing.
2.1 ITG Decoding
Wu (1996) presented a polynomial-time algorithm
for decoding ITG combined with an m-gram lan-
guage model. Such language models are commonly
used in noisy channel models of translation, which
find the best English translation e of a foreign sen-
tence f by finding the sentence e that maximizes the
product of the translation model P(f|e) and the lan-
guage model P(e).
It is worth noting that since we have specified ITG
as a joint model generating both e and f, a language
model is not theoretically necessary. Given a foreign
sentence f, one can find the best translation e∗:
e∗ = argmax
e
P(e, f)
= argmax
e
summationdisplay
q
P(e, f, q)
by approximating the sum over parses q with the
probability of the Viterbi parse:
e∗ = argmax
e
maxq P(e, f, q)
This optimal translation can be computed in using
standard CKY parsing over f by initializing the
chart with an item for each possible translation of
each foreign word in f, and then applying ITG rules
from the bottom up.
However, ITG’s independence assumptions are
too strong to use the ITG probability alone for ma-
chine translation. In particular, the context-free as-
sumption that each foreign word’s translation is cho-
sen independently will lead to simply choosing each
foreign word’s single most probable English trans-
lation with no reordering. In practice it is beneficial
to combine the probability given by ITG with a local
m-gram language model for English:
e∗ = argmax
e
maxq P(e, f, q)Plm(e)α
with some constant language model weight α. The
language model will lead to more fluent output by
influencing both the choice of English words and the
reordering, through the choice of straight or inverted
rules. While the use of a language model compli-
cates the CKY-based algorithm for finding the best
translation, a dynamic programming solution is still
possible. We extend the algorithm by storing in each
chart item the English boundary words that will af-
fect the m-gram probabilities as the item’s English
string is concatenated with the string from an adja-
cent item. Due to the locality of m-gram language
66
Structural Rules Lexical Rules
S → A
S → B
S → C
A → [AB]
A → [BB]
A → [CB]
A → [AC]
A → [BC]
A → [CC]
B → 〈AA〉
B → 〈BA〉
B → 〈CA〉
B → 〈AC〉
B → 〈BC〉
B → 〈CC〉
C → ei/fj
C → epsilon1/fj
C → ei/epsilon1
Table 1: Unambiguous BTG
model, only m−1 boundary words need to be stored
to compute the new m-grams produced by combin-
ing two substrings. Figure 1 illustrates the combi-
nation of two substrings into a larger one in straight
order and inverted order.
3 Hook Trick for Bilexical Parsing
A traditional CFG generates words at the bottom of
a parse tree and uses nonterminals as abstract rep-
resentations of substrings to build higher level tree
nodes. Nonterminals can be made more specific to
the actual substrings they are covering by associ-
ating a representative word from the nonterminal’s
yield. When the maximum number of lexicalized
nonterminals in any rule is two, a CFG is bilexical.
A typical bilexical CFG in Chomsky normal form
has two types of rule templates:
A[h] → B[h]C[hprime]
or
A[h] → B[hprime]C[h]
depending on which child is the head child that
agrees with the parent on head word selection.
Bilexical CFG is at the heart of most modern statisti-
cal parsers (Collins, 1997; Charniak, 1997), because
the statistics associated with word-specific rules are
more informative for disambiguation purposes. If
we use A[i, j, h] to represent a lexicalized con-
stituent, β(·) to represent the Viterbi score function
applicable to any constituent, and P(·) to represent
the rule probability function applicable to any rule,
Figure 2 shows the equation for the dynamic pro-
gramming computation of the Viterbi parse. The two
terms of the outermost max operator are symmetric
cases for heads coming from left and right. Contain-
ing five free variables i,j,k,hprime,h, ranging over 1 to
n, the length of input sentence, both terms can be
instantiated in n5 possible ways, implying that the
complexity of the parsing algorithm is O(n5).
Eisner and Satta (1999) pointed out we don’t have
to enumerate k and hprime simultaneously. The trick,
shown in mathematical form in Figure 2 (bottom) is
very simple. When maximizing over hprime, j is irrele-
vant. After getting the intermediate result of maxi-
mizing over hprime, we have one less free variable than
before. Throughout the two steps, the maximum
number of interacting variables is 4, implying that
the algorithmic complexity is O(n4) after binarizing
the factors cleverly. The intermediate result
max
hprime,B
[β(B[i, k, hprime]) · P(A[h] → B[hprime]C[h])]
can be represented pictorially as
C[h]
A
i k . The
same trick works for the second max term in
Equation 1. The intermediate result coming from
binarizing the second term can be visualized as
A
k
B[h]
j. The shape of the intermediate re-
sults gave rise to the nickname of “hook”. Melamed
(2003) discussed the applicability of the hook trick
for parsing bilexical multitext grammars. The anal-
ysis of the hook trick in this section shows that it is
essentially an algebraic manipulation. We will for-
mulate the ITG Viterbi decoding algorithm in a dy-
namic programming equation in the following sec-
tion and apply the same algebraic manipulation to
produce hooks that are suitable for ITG decoding.
4 Hook Trick for ITG Decoding
We start from the bigram case, in which each de-
coding constituent keeps a left boundary word and
67
t
u11 u12 v12v11 u21 u22 v22v21
X
Y Z[ ]
Ss
u21
X
Y Z
Ss t
< >
v21 v22 u11 u12 v11 v12u22
(a) (b)
Figure 1: ITG decoding using 3-gram language model. Two boundary words need to be kept on the left (u)
and right (v) of each constituent. In (a), two constituents Y and Z spanning substrings s, S and S, t of the
input are combined using a straight rule X → [Y Z]. In (b), two constituents are combined using a inverted
rule X → 〈Y Z〉. The dashed line boxes enclosing three words are the trigrams produced from combining
two substrings.
β(A[i, j, h]) = max



max
k,hprime,B,C
bracketleftBig
β(B[i, k, hprime]) · β(C[k, j, h]) · P(A[h] → B[hprime]C[h])
bracketrightBig
,
max
k,hprime,B,C
bracketleftBig
β(B[i, k, h]) · β(C[k, j, hprime]) · P(A[h] → B[h]C[hprime])
bracketrightBig


 (1)
max
k,hprime,B,C
bracketleftBig
β(B[i, k, hprime]) · β(C[k, j, h]) · P(A[h] → B[hprime]C[h])
bracketrightBig
= max
k,C
bracketleftbigg
max
hprime,B
bracketleftBig
β(B[i, k, hprime]) · P(A[h] → B[hprime]C[h])
bracketrightBig
· β(C[k, j, h])
bracketrightbigg
Figure 2: Equation for bilexical parsing (top), with an efficient factorization (bottom)
a right boundary word. The dynamic programming
equation is shown in Figure 3 (top) where i,j,k range
over 1 to n, the length of input foreign sentence, and
u,v,v1,u2 (or u,v,v2,u1) range over 1 to V , the size
of English vocabulary. Usually we will constrain the
vocabulary to be a subset of words that are probable
translations of the foreign words in the input sen-
tence. So V is proportional to n. There are seven
free variables related to input size for doing the max-
imization computation. Hence the algorithmic com-
plexity is O(n7).
The two terms in Figure 3 (top) within the first
level of the max operator, corresponding to straight
rules and inverted rules, are analogous to the two
terms in Equation 1. Figure 3 (bottom) shows how to
decompose the first term; the same method applies
to the second term. Counting the free variables en-
closed in the innermost max operator, we get five: i,
k, u, v1, and u2. The decomposition eliminates one
free variable, v1. In the outermost level, there are
six free variables left. The maximum number of in-
teracting variables is six overall. So, we reduced the
complexity of ITG decoding using bigram language
model from O(n7) to O(n6).
The hooks k
X
Zu u2
i that we have built for de-
coding with a bigram language model turn out to be
similar to the hooks for bilexical parsing if we focus
on the two boundary words v1 and u2 (or v2 and u1)
68
β(X[i, j, u, v]) = max





max
k,v1,u2,Y,Z
bracketleftbigg β(Y [i, k, u, v
1]) · β(Z[k, j, u2, v])
· P(X → [Y Z]) · bigram(v1, u2)
bracketrightbigg
,
max
k,v2,u1,Y,Z
bracketleftbigg β(Y [i, k, u
1, v]) · β(Z[k, j, u, v2])
· P(X → 〈Y Z〉) · bigram(v2, u1)
bracketrightbigg





(2)
max
k,v1,u2,Y,Z
bracketleftBig
β(Y [i, k, u, v1]) · β(Z[k, j, u2, v]) · P(X → [Y Z]) · bigram(v1, u2)
bracketrightBig
= max
k,u2,Z
bracketleftbigg
max
v1,Y
bracketleftBig
β(Y [i, k, u, v1]) · P(X → [Y Z]) · bigram(v1, u2)
bracketrightBig
· β(Z[k, j, u2, v])
bracketrightbigg
Figure 3: Equation for ITG decoding (top), with an efficient factorization (bottom)
that are interacting between two adjacent decoding
constituents and relate them with the hprime and h that
are interacting in bilexical parsing. In terms of al-
gebraic manipulation, we are also rearranging three
factors (ignoring the non-lexical rules), trying to re-
duce the maximum number of interacting variables
in any computation step.
4.1 Generalization to m-gram Cases
In this section, we will demonstrate how to use the
hook trick for trigram decoding which leads us to a
general hook trick for any m-gram decoding case.
We will work only on straight rules and use icons
of constituents and hooks to make the equations eas-
ier to interpret.
The straightforward dynamic programming equa-
tion is:
i
X
u1u2 v1v2
j = max
v11,v12,u21,u22,
k,Y,Z
u22
i k j
XY Z
u1u2 v2v1
][
v11v12 u21
(3)
By counting the variables that are dependent
on input sentence length on the right hand side
of the equation, we know that the straightfor-
ward algorithm’s complexity is O(n11). The max-
imization computation is over four factors that
are dependent on n: β(Y [i, k, u1, u2, v11, v12]),
β(Z[k, j, u21, u22, v1, v2]), trigram(v11, v12, u21),
and trigram(v12, u21, u22). As before, our goal is
to cleverly bracket the factors.
By bracketing trigram(v11, v12, u21) and
β(Y [i, k, u1, u2, v11, v12]) together and maximizing
over v11 and Y , we can build the the level-1 hook:
u21
i k
X Z
u1u2
][
v12
= max
v11,Y
u21
i k
XY Z
u1u2
][
v11v12
The complexity is O(n7).
Grouping the level-1 hook and
trigram(v12, u21, u22), maximizing over v12,
we can build the level-2 hook:
u21
i k
X Z
u1u2
][
u22
= maxv
12
u21
i k
X Z
u1u2
][
v12 u22
The complexity is O(n7). Finally,
we can use the level-2 hook to com-
bine with Z[k, j, u21, u22, v1, v2] to build
X[i, j, u1, u2, v1, v2]. The complexity is O(n9)
after reducing v11 and v12 in the first two steps.
i
X
u1u2 v1v2
j = max
u21,u22,k,Z
u22
i k j
X Z
u1u2 v2v1
][
u21
(4)
Using the hook trick, we have reduced the com-
plexity of ITG decoding using bigrams from O(n7)
to O(n6), and from O(n11) to O(n9) for trigram
69
case. We conclude that for m-gram decoding of
ITG, the hook trick can change the the time com-
plexity from O(n3+4(m−1)) to O(n3+3(m−1)). To
get an intuition of the reduction, we can compare
Equation 3 with Equation 4. The variables v11 and
v12 in Equation 3, which are independent of v1 and
v2 for maximizing the product have been concealed
under the level-2 hook in Equation 4. In general,
by building m − 1 intermediate hooks, we can re-
duce m − 1 free variables in the final combination
step, hence having the reduction from 4(m − 1) to
3(m − 1).
5 Generalization to Non-binary Bitext
Grammars
Although we have presented our algorithm as a de-
coder for the binary-branching case of Inversion
Transduction Grammar, the same factorization tech-
nique can be applied to more complex synchronous
grammars. In this general case, items in the dy-
namic programming chart may need to represent
non-contiguous span in either the input or output
language. Because synchronous grammars with in-
creasing numbers of children on the right hand side
of each production form an infinite, non-collapsing
hierarchy, there is no upper bound on the number
of discontinuous spans that may need to be repre-
sented (Aho and Ullman, 1972). One can, however,
choose to factor the grammar into binary branching
rules in one of the two languages, meaning that dis-
continuous spans will only be necessary in the other
language.
If we assume m is larger than 2, it is likely that
the language model combinations dominate com-
putation. In this case, it is advantageous to factor
the grammar in order to make it binary in the out-
put language, meaning that the subrules will only
need to represent adjacent spans in the output lan-
guage. Then the hook technique will work in the
same way, yielding O(n2(m−1)) distinct types of
items with respect to language model state, and
3(m−1) free indices to enumerate when combining
a hook with a complete constituent to build a new
item. However, a larger number of indices point-
ing into the input language will be needed now that
items can cover discontinuous spans. If the gram-
mar factorization yields rules with at most R spans
in the input language, there may be O(n2R) dis-
tinct types of chart items with respect to the input
language, because each span has an index for its
beginning and ending points in the input sentence.
Now the upper bound of the number of free in-
dices with respect to the input language is 2R + 1,
because otherwise if one rule needs 2R + 2 in-
dices, say i1,··· , i2R+2, then there are R + 1 spans
(i1, i2),··· , (i2R+1, i2R+2), which contradicts the
above assumption. Thus the time complexity at the
input language side is O(n2R+1), yielding a total al-
gorithmic complexity of O(n3(m−1)+(2R+1)).
To be more concrete, we will work through a 4-
ary translation rule, using a bigram language model.
The standard DP equation is:
i
u v
j
A
= maxv
3,u1,v1,u4,v4,u2,k
1,k2,k3,B,C,D,E
B C D E
A
v3u u1 v1 u4 v4 u2 v
i k1 k2 k3 j (5)
This 4-ary rule is a representative difficult case.
The underlying alignment pattern for this rule is as
follows:
D
C
E
B
A
It is a rule that cannot be binarized in the bitext
space using ITG rules. We can only binarize it in
one dimension and leave the other dimension having
discontinuous spans. Without applying binarization
and hook trick, decoding parsing with it according
to Equation 5 requires time complexity of O(n13).
However, we can build the following partial con-
stituents and hooks to do the combination gradually.
The first step finishes a hook by consuming one
bigram. Its time complexity is O(n5):
C D E
A
u1u
k2 k3 = max
v3,B
B C D E
A
u v3 u1
k2 k3
The second step utilizes the hook we just built and
builds a partial constituent. The time complexity is
O(n7):
70
D E
A
u v1
i k1 k2 k3 = max
u1,C
C D E
A
u u1 v1
i k1 k2 k3
By “eating” another bigram, we build the second
hook using O(n7):
D E
A
u u4
i k1 k2 k3 = max
v1
D E
A
u v1 u4
i k1 k2 k3
We use the last hook. This step has higher com-
plexity: O(n8):
E
A
u v4
i k1 k2 j = max
u4,k3,D
v4u4
k2 k3
D E
A
jk1i
u
The last bigram involved in the 4-ary rule is com-
pleted and leads to the third hook, with time com-
plexity of O(n7):
E
A
jk2k1i
u u2
= maxv
4
E
A
u v4 u2
i k1 k2 j
The final combination is O(n7):
i
u v
j
A
= max
u2,k1,k2,E
u
i k1 k2
E
A
u2
j
v
The overall complexity has been reduced to
O(n8) after using binarization on the output side and
using the hook trick all the way to the end. The result
is one instance of our general analysis: here R = 2,
m = 2, and 3(m − 1) + (2R + 1) = 8.
6 Implementation
The implementation of the hook trick in a practi-
cal decoder is complicated by the interaction with
pruning. If we build hooks looking for all words
in the vocabulary whenever a complete constituent
is added to the chart, we will build many hooks
that are never used, because partial hypotheses with
many of the boundary words specified by the hooks
may never be constructed due to pruning. In-
stead of actively building hooks, which are inter-
mediate results, we can build them only when we
need them and then cache them for future use. To
make this idea concrete, we sketch the code for bi-
gram integrated decoding using ITG as in Algo-
rithm 1. It is worthy of noting that for clarity we
are building hooks in shape of
v
k j
v’
Z
, instead
of
X
Y v
k j
v’
as we have been showing in the
previous sections. That is, the probability for the
grammar rule is multiplied in when a complete con-
stituent is built, rather than when a hook is created.
If we choose the original representation, we would
have to create both straight hooks and inverted hooks
because the straight rules and inverted rules are to be
merged with the “core” hooks, creating more speci-
fied hooks.
7 Conclusion
By showing the parallels between lexicalization for
language model state and lexicalization for syntac-
tic heads, we have demonstrated more efficient al-
gorithms for previously described models of ma-
chine translation. Decoding for Inversion Transduc-
tion Grammar with a bigram language model can be
done in O(n6) time. This is the same complexity
as the ITG alignment algorithm used by Wu (1997)
and others, meaning complete Viterbi decoding is
possible without pruning for realistic-length sen-
tences. More generally, ITG with an m-gram lan-
guage model is O(n3+3(m−1)), and a synchronous
context-free grammar with at most R spans in the
input language is O(n3(m−1)+(2R+1)). While this
improves on previous algorithms, the degree in n
is probably still too high for complete search to
be practical with such models. The interaction of
the hook technique with pruning is an interesting
71
Algorithm 1 ITGDecode(Nt)
for all s, t such that 0 ≤ s < t ≤ Nt do
for all S such that s < S < t do
a3 straight rule
for all rules X → [Y Z] ∈ G do
for all (Y, u1, v1) possible for the span of (s, S) do
a3 a hook who is on (S, t), nonterminal as Z, and outside expectation being v1 is required
if not exist hooks(S, t, Z, v1) then
build hooks(S, t, Z, v1)
end if
for all v2 possible for the hooks in (S, t, Z, v1) do
a3 combining a hook and a hypothesis, using straight rule
β(s, t, X, u1, v2) =
max
braceleftBig
β(s, t, X, u1, v2), β(s, S, Y, u1, v1) · β+(S, t, Z, v1, v2) · P(X → [Y Z])
bracerightBig
end for
end for
end for
a3 inverted rule
for all rules X → 〈Y Z〉 ∈ G do
for all (Z, u2, v2) possible for the span of (S, t) do
a3 a hook who is on (s, S), nonterminal as Y , and outside expectation being v2 is required
if not exist hooks(s, S, Y, v2) then
build hooks(s, S, Y, v2)
end if
for all v1 possible for the hooks in (s, S, Y, v2) do
a3 combining a hook and a hypothesis, using inverted rule
β(s, t, X, u2, v1) =
max
braceleftBig
β(s, t, X, u2, v1), β(S, t, Z, u2, v2) · β+(s, S, Y, v2, v1) · P(X → 〈Y Z〉)
bracerightBig
end for
end for
end for
end for
end for
routine build hooks(s, t, X, vprime)
for all (X, u, v) possible for the span of (s, t) do
a3 combining a bigram with a hypothesis
β+(s, t, X, vprime, v) =
max
braceleftBig
β+(s, t, X, vprime, v), bigram(vprime, u) · β(s, t, X, u, v)
bracerightBig
end for
72
area for future work. Building the chart items with
hooks may take more time than it saves if many of
the hooks are never combined with complete con-
stituents due to aggressive pruning. However, it may
be possible to look at the contents of the chart in or-
der to build only those hooks which are likely to be
useful.

References
Aho, Albert V. and Jeffery D. Ullman. 1972. The The-
ory of Parsing, Translation, and Compiling, volume 1.
Englewood Cliffs, NJ: Prentice-Hall.
Charniak, Eugene. 1997. Statistical parsing with a
context-free grammar and word statistics. In Proceed-
ings of the Fourteenth National Conference on Arti-
ficial Intelligence (AAAI-97), pages 598–603, Menlo
Park, August. AAAI Press.
Collins, Michael. 1997. Three generative, lexicalised
models for statistical parsing. In Proceedings of the
35th Annual Conference of the Association for Compu-
tational Linguistics (ACL-97), pages 16–23, Madrid,
Spain.
Eisner, Jason and Giorgio Satta. 1999. Efficient parsing
for bilexical context-free grammars and head automa-
ton grammars. In 37th Annual Meeting of the Associ-
ation for Computational Linguistics.
Melamed, I. Dan. 2003. Multitext grammars and syn-
chronous parsers. In Proceedings of the 2003 Meeting
of the North American chapter of the Association for
Computational Linguistics (NAACL-03), Edmonton.
Wu, Dekai. 1996. A polynomial-time algorithm for sta-
tistical machine translation. In 34th Annual Meeting
of the Association for Computational Linguistics.
Wu, Dekai. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23(3):377–403.
