? 
Comparing Lexicalized Treebank Grammars Extracted from 
Chinese, Korean, and English Corpora 
Fei Xia, Chung-hye Han, Martha Palmer, and Aravind Joshi 
University of Pennsylvania 
Philadelphia PA 19104, USA 
{fxia, chunghye, mpalmer, j oshi}@linc, cis. upenn, edu 
Abstract 
In this paper, we present a method 
for comparing Lexicalized Tree Ad- 
joining Grammars extracted from 
annotated corpora for three lan- 
guages: English, Chinese and Ko- 
rean. This method makes it possi- 
ble to do a quantitative comparison 
between the syntactic structures of 
each , thereby providing a 
way of testing the Universal Gram- 
mar Hypothesis, the foundation of 
modern linguistic theories. 
1 Introduction 
The comparison of the grammars extracted 
from annotated corpora (i.e., Treebanks) is 
important on both theoretical and engineer- 
ing grounds. Theoretically, it allows us to do 
a quantitative testing of the Universal Gram- 
mar Hypothesis. One of the major concerns 
in modern linguistics is to establish an ex- 
planatory basis for the similarities and varia- 
tions among s. The working assump- 
tion is that s of the world share a set 
of universal linguistic principles and the ap- 
parent structural differences attested among 
s can be explained as variation in 
the way the universal principles are instan- 
tiated. Comparison of the extracted syntac- 
tic trees allows us to quantitatively evaluate 
how similar the syntactic structures of differ- 
ent s are. From an engineering per- 
spective the extracted grammars and the links 
between the syntactic structures in the gram- 
mars are valuable resources for NLP applica- 
tions, such as parsing, computational lexicon 
development, and machine translation (MT), 
to name a few. 
In this paper we first briefly discuss some 
linguistic characteristics of English, Chinese, 
and Korean, and introduce the Treebanks for 
the three s. We then describe a 
tool that extracts Lexicalized Tree Adjoin- 
ing Grammars (LTAGs) from Treebanks and 
the results of its application to these three 
Treebanks. Next, we describe our methodol- 
ogy for automatic comparison of the extracted 
Treebank grammars, This consists primar- 
ily of matching syntactic structures (namely, 
templates and sub-templates) in each pair 
of Treebank grammars. The ability to per- 
form this type of comparison for different lan- 
guages has a definite positive impact on the 
possibility of sorting out the universal ver- 
sus -dependent features of s. 
Therefore, our grammar extraction tool is not 
only an engineering tool of great value in im- 
proving the efficiency and accuracy of gram- 
mar development, but it is also very useful for 
investigating theoretical linguistics. 
2 Three Languages and Three 
'rreebanks 
In this section, we briefly discuss some lin- 
guistic characteristics of English, Chinese, 
and Korean, and introduce the Treebanks for 
these s. 
2.1 Three Languages 
These three s belong to different lan- 
guage families: English is Germanic, Chinese 
is Sino-Tibetan, and Korean is Altaic (Com- 
rie, 1987). There are several major differences 
between these s. First, both English 
52 
and Chinese have predominantly subject- 
verb-object (SVO) word order, whereas Ko- 
rean has underlying SOV order. Second, the 
word order in Korean is freer than in English 
and Chinese in the sense that argument NPs 
are freely permutable (subject to certain dis- 
course constraints). Third, Korean and Chi- 
nese freely allow subject and object deletion, 
but English does not. Fourth, Korean has 
richer inflectional morphology than English, 
whereas Chinese has little, if any, inflectional 
morphology. 
2.2 Three Treebanks 
The Treebanks that we used in this paper are 
the English Penn Treebank II (Marcus et al., 
1993), the Chinese Penn Treebank (Xia et 
al., 2000b), and the Korean Penn Treebank 
(Chung-hye Han, 2000). The main param- 
eters of these Treebanks are summarized in 
Table 1.1 The tags in each tagset can be 
classified into one of four types: (1) syntac- 
tic tags for phrase-level annotation, (2) Part- 
Of-Speech (POS) tags for head-level annota- 
tion, (3) function tags for grammatical func- 
tion annotation, and (4) empty category tags 
for dropped arguments, traces, and so on. 
We chose these Treebanks because they all 
use phrase structure annotation and their an- 
notation schemata are similar, which facili- 
tates the comparison between the extracted 
Treebank grammars. Figure 1 shows an an- 
notated sentence from the Penn English Tree- 
bank. 
3 LTAGs and Extraction 
Algorithm 
In this section, we give a brief introduction to 
the LTAG formalism and to a system named 
LexTract, which we build to extract LTAGs 
from Treeb~.nks. 
1The reason why the average sentence length for 
Korean is much shorter than those for English and 
Chinese is that a big portion of the corpus for Ko- 
rean Treebank includes dialogues that contain many 
one-word replies, whereas English and Chinese cor- 
pora consist of newspaper articles. 
((S (ppoLOC (IN at) 
(NP (NNP FNX)) 
(NP-SBJ-1 (bINS underwriters)) 
(ADVP (RB stin)) 
(VP (VBP draft) 
(NP (bINS policies)) 
(S-MNR 
(NP-SBJ (-NONE- *-1 )) 
(VP (VBG using) 
(NP 
(NP (iNN fountain) (NNS pens)) 
(CO and) 
(NP (VBG blotting) (NN papers)))))))) 
Figure 1: An example from Penn English 
Treebank 
3.1 LTAG formalism 
LTAGs are based on the Tree Adjoining 
Grammar formalism developed by Joshi, 
Levy, and Takahashi (Joshi et al., 1975; Joshi 
and Schabes, 1997). The primitive elements 
of an LTAG are elementary trees (etrees). 
Each etree is associated with a lexical item 
(called the anchor of the tree) on its fron- 
tier. LTAGs possess many desirable proper- 
ties, such as the Extended Domain of Local- 
ity, which allows the encapsulation of all argu- 
ments of the anchor associated with an etree. 
There are two types of etrees: initial trees and 
auxiliary trees. An auxiliary tree represents 
a recursive structure and has a unique leaf 
node, called the foot node, which has the same 
syntactic category as the root node. Leaf 
nodes other than anchor nodes and foot nodes 
are substitution nodes. Etrees are combined 
by two operations: substitution and adjunc- 
tion. The resulting structure of the combined 
etrees is called a derived tree. The combina- 
tion process is expressed as a derivation tree. 
Figure 2 shows the etrees, the derived tree, 
and the derivation tree for the sentence un- 
derwriters still draft policies. Foot and sub- 
stitution nodes are marked by ,, and $, re- 
spectively. The dashed and solid lines in the 
derivation tree are for adjunction and substi- 
tution operations, respectively. 
3.2 The Form of Target Grammars 
Without further constraints, the etrees in 
the target grammar (i.e., the grammar to be 
extracted by LexTract) could be of various 
shapes. LexTract recognizes three types of 
53 
# of POS # ofsyntac- 
tags tic tags 
Language corpus size (words) 
English 1,174K 
Chinese 100K 
Korean 30K 
average sen- 
tence length 
23.85 words 
|Ib"~'-ll~,)_o~ 
34 
17 
of ftmc- # of empty cat- 
tion tags egory tags 
20 12 
26 7 
17 4 
Table 1: Size of the Treebanks and the tagsets used in each Treebank 
#h ~ vP -~,~ #3: S #4: 
I ADVP VP" ", ~, vP / I ~s I ', ..... ~ /Nns 
vBP ~/ I I ? ' v" ,,. 
m~.,ria.~ ~,ill draR pc u.mm,..~ 
(a) etre~ 
NF VP 
Jail~ draft NN$ 
I ptfliei~ 
draft(#3) 
--"--'"T'----"---- undcxwrRcrs(# 1 ) \ policies(#4) 
still(#2) 
(h) dcrivcd trc¢ (c) dczivatitm trcc 
Figure 2: Etrees, derived tree, and derivation 
tree for underwriters still draft policies 
x TM 
x ~ wq 
y~ X~= w~ X m )¢,. cc I X ~ 
xo z,~ /~ x, 
{ xo z~ x/~'X.z ,~ knti.~l 
itgm I 
(a) spinc-ctr~ (b) rm~-ctrce (c) c,,nj-etree 
Figure 3: Three types of elementary trees in 
the target grammar 
relation (namely, predicate-argument, modi- 
fication, and coordination relations) between 
the anchor of an etree and other nodes in the 
etree, and imposes the constraint that all the 
etrees to be extracted should fall into exactly 
one of the three patterns in Figure 3. 
The spine-etrees for predicate-argument 
relations. X ° is the head of X m and the 
anchor of the etree. The etree is formed 
by a spine X m ~ X m-1 ~ .. ~ X ° and 
the arguments of X °. 
The mod-etrees for modification rela- 
tions. The root of the etree has two chil- 
dren, one is a foot node with the label 
Wq, and the other node X m is a modifier 
of the foot node. X m is further expanded 
into a spine-etree whose head X ° is the 
anchor of the whole mod-etree. 
The conj-etrees for coordination rela- 
tions. In a conj-etree, the children of the 
root are two conjoined constituents and 
a node for a coordination conjunction. 
One conjoined constituent is marked as 
the foot node, and the other is expanded 
into a spine-etree whose head is the an- 
chor of the whole tree. 
Spine-etrees are initial trees, whereas mod- 
etrees and conj-etrees are auxiliary trees. 
3.3 Extraction algorithm 
The core of LexTract is an extraction algo- 
rithm that takes a Treebank sentence such as 
the one in Figure 1 and Treebank-specific in- 
formation provided by the user of LexTract, 
and produces a set of etrees as in Figure 4 
and a derivation tree. We have described 
LexTract's architecture, its extraction algo- 
rithm, and its applications in (Xia, 1999; Xia 
et al., 2000a). Therefore, we shall not re- 
peat them in this paper other than point- 
ing out that LexTract is completely - 
independent. 
3.4 Experiments 
The results of running LexTract on English, 
Chinese, and Korean Treebanks are shown in 
Table 2. Templates are etrees with the lexical 
items removed. For instance, #3, #6, and #9 
in Figure 4 are three distinct etrees but they 
share the same template. 
Figure 5 shows the log frequency of tem- 
plates in the English Treebank and percent- 
age of template tokens covered by template 
54 
m 
m ~i\]lml~iJ 
template etree word etree types 
types types types per word type 
6926 131,397 49,206 
1140 21,125 10,772 
etree types 
per word token 
CFG rules 
(unlexicalized) 
2.67 34.68 1524 
1.96 9.13 515 
1.45 2.76 : 177 
Table 2: Grammars extracted from three Treebanks 
#1: re2: #3: #4: #5: #6: 
S NP NP VP S NP 
PP S* NNS ADVP VP* NP| VP NNS /~ NNP 
,..P, I I R', vBf'~-.~ I 
l FNX enderwri,~,~ { { polid~ 
s0ll draft at 
#7: #8: #9: #I0: #ll: #12: 
VP NP NP NP NP 
VP" S NN NP" N VBG Np~ NIP" CC| NP 
NP VP { ~ finmtain bltXting 
e V~O NP; pen.~ I pap~ 
~ing 
Figure 4: The extracted etrees from the fully 
bracketed ttree 
types. 2 In both cases, template types are 
sorted according to their frequencies and plot- 
ted on the X-axis. The figure shows that 
a small subset of template types, which oc- 
curs very frequently in the Treebank and can 
be seen as the core of the Treebank gram- 
mar, covers the majority of template tokens 
in the Treebank. For instance, the most 
frequent template type covers 9.37% of the 
template tokens and the top 100 (500, 1000 
and 1500, respectively) template types cover 
87.1% (96.6%, 98.4% and 99.0%, respectively) 
of the tokens, whereas about half (3440) of 
the template types occur once, accounting for 
only 0.32% of template tokens in total. 
4 Comparing Three Treebank 
Grammars 
In this section, we describe our methodology 
for comparing Treebank gr3.mmars and the 
experimental results. 
4.1 Methodology 
To compare Treeb~nb grammars, we need to 
ensure that the Treebank grammars are based 
on the same tagset. To achieve that, we first 
create a new tagset that includes all the tags 
2If a template occurs n times in the corpus, it is 
counted as one template type but n template tokens. 
(a) Frequency (b) Coverage 
Figure 5: Etree template types and template 
tokens in the Penn English Treebank 
(X-axes: (a) and (b) template types 
Y-axes: (a) log frequency of templates; (b) 
percentage of template token covered by tem- 
plate types) 
from the three Treebanks. Then we merge 
some tags in this new tagset into a single tag. 
This step is necessary because certain distinc- 
tions among some tags in one  do not 
exist in another . For example, the 
English Treebank has distinct tags for verbs 
in past tense, past participals, gerunds, and 
so on; however, no such distinction is mor- 
phologically marked in Chinese and, there- 
fore, the Chinese Treebank uses the same tag 
for verbs regardless of the tense and aspect. 
To make the conversion straightforward for 
verbs, we use a single tag for verbs in the new 
tagset. Next, we replace the tags in the origi- 
nal Treebanks with the tags in the new tagset, 
and then re-run LexTract to build Treebank 
gr~mraars from those Treebanks. 
Now that the Treebank grammars are based 
on the same tagset, we can compare them ac- 
cording to the templates and sub-templates 
that appear in more than one 'rreebank m 
that is, given a pair of Treebank grammars, 
we first calculate how many templates oc- 
cur in both grammars; 3 Next, we decompose 
SIdeally, to get more accurate comparison results, 
we would like to compare etrees, rather than templates 
(which are non-lexicalized); however, comparing etrees 
requires bilingual parallel corpora, which we are cur- 
55 
templates: sub-templates: 
~ spine: S -> VP -> V NP| 
~ subca~ (:NP, V@. NP) 
V@ NP! with root S 
(a) spine-etree template 
VP spine: PP-> P 
VP'~ PP ~ subeat: (P@, NP) with root PP 
P@ NP~ rood-pair: (VP*, PP) 
(b) mod-etree template 
I .......~ spine: NP->N 
NP* cc~ r~P ~ subeat ~@) with root NP 
lq@ conj-tuple: (NP*, CC, NP) 
(c) conj-etree template 
Figure 6: The decompositions of etree tem- 
plates (In sub-templates, @ marks the anchor 
in subcategorization frame, * marks the mod- 
ifiee in a modifier-modifiee pair.) 
each template into a list of sub-templates (e.g., 
spines and subcategorization frames) and cal- 
culate how many of those sub-templates occur 
in both grammars. A template is decomposed 
as follows: A spine-etree template is decom- 
posed into a spine and a subcategorization 
frame; a mod-etree template is decomposed 
into a spine, a subcategorization frame, and a 
modifier-modifiee pair; a conj-etree template 
is decomposed into a spine, a subcategoriza- 
tion frame, and a coordination tuple. Figure 
6 shows examples of this decomposition for 
each type of template. 
4.2 Experiments 
After tags in original Treebn.nks being re- 
placed with the tags in the new tagset, the 
numbers of templates in the new Treebank 
gra.mmars decrease by about 50%, as shown 
in the second colnmn of Table 3 (cf. the sec- 
ond column in Table 2). Table 3 also lists the 
numbers of sub-templates, such as spines and 
subcategorization frames, for each grammar. 
Table 4 lists the numbers of template types 
shared by each pair of Treeba.nk gr3.mmars 
and the percentage of the template tokens 
rently building. 
in each Treebank which are covered by these 
common template types. For example, there 
are 237 template types that appear in both 
English and Chinese Treebank grammars. 
These 237 template types account for 80.1% 
of template tokens in the English Treebank, 
and 81.5% of template tokens in the Chi- 
nese Treebank. The table shows that, al- 
though the number of matched templates are 
not very high, they are among the most fre- 
quent templates and they account for the ma- 
jority of template tokens in the Treebanks. 
For instance, in the (Eng, Ch) pair, the 237 
template types that appear in both gram- 
mars is only 77.5% of all the English template 
types, but they cover 80.1% of template to- 
kens in the English Treebank. If we define the 
core grammar of a  as the set of the 
templates that occur very often in the Tree- 
bnnk, the data suggest that the majority of 
the core grammars are easily inter-mappable 
structures for these three s. 
If we compare sub-templates, rather than 
templates, in the Treebank grammars, the 
percentages of matched sub-template tokens 
(as in Table 5) are higher than the percent- 
ages of matched template tokens. This is be- 
cause two distinct templates may share com- 
mon sub-templates. 
4.3 Unmatched templates 
Our previous experiments (see Table 4) show 
that the percentages of unmatched template 
tokens in three Treebanks range from 16.0% 
to 43.8%, depending on the  pairs. 
Given a  pair, there are many pos- 
sible reasons why a template appears in one 
Treebank grammar, but not in the other. We 
divide those unmatched templates into two 
categories: spuriously unmatched templates 
and truly unmatched templates. 
Spuriously unmatched templates Spu- 
riously unmatched templates are templates 
that either should have found a matched tem- 
plate in the other gra.mmar or should not have 
been created by LexTract in the first place 
if the Treebanks were complete, uniformly 
annotated, and error-free. A spuriously un- 
matched template exists because of one of the 
56 
templates subtemplates 
spines subcat~ames mod-pairs 
Eng 3139 500 541 332 53 
Ch 547 108 180 152 18 
Kor 271 55 58 53 6 
(Eng,Ch) 
(Eng, Kor) 
(Ch, Kor) 
conj-tuples total 
1426 
458 
172 
Table 3: Treebank grammars with the new tagset 
type (#) 
token (%) 
type (#) 
token (%) 
type (:~) 
token (%) 
matched templates 
(237, 237) 
(80.1, 81.5) 
(83, 83) 
(57.7, 82.8) 
(59,59) 
(57.2, 84.0) 
templates with 
unique tags 
(536, 99) 
(2.8, 12.3) 
(2075, 6) 
(28.1, 0.1) 
(324,6) 
(29.4, 0.1) 
other unmatched 
templates 
(2366, 211) 
(17.1, 6.2) 
(981, 182) 
(14.2, 17.1) 
(164, 206) 
(13.4, 16.0) 
Table 4: Comparisons of templatea in three Treebank grammars 
following reasons: 
(Sl) Treebank size: The template is lin- 
guistically sound in both s, and, 
therefore, should belong to the grarnmars 
for these s. However, the tem- 
plate appears in only one Treebank gram- 
mar because the other Treebank is too 
small to include such a template. Figure 
7(S1) shows a template that is valid for 
both English and Chinese, but it appears 
only in the English Treebank, not in the 
Chinese Treebank. 
($2) Annotation difference: Treebanks 
may choose different annotations for 
the same constructions; consequentially, 
the templates for those constructions 
look different. Figure 7($2) shows the 
templates used in English and Chinese 
for a VP such as "surged 7 (dollars)". 
In the template for English, the QP 
projects to an NP, but in the template 
for Chinese, it does not. 
($3) Treeb~nk annotation error: A tem- 
plate in a Treebank may result from an- 
notation errors in that Treebank. If no 
corresponding mistakes are made in the 
other Treebank, the template in the first 
Treebank will not match any template in 
the second 'I~reebank. For instance, in the 
English Treebank the word about in the 
sentence About 5 people showed up is of- 
ten mis-tagged as a preposition, resulting 
in the template in Figure 7($3). Not sur- 
prisingly, that template does not match 
any template in the Chinese Treebank. 
Truly unmatched templates A truly un- 
matched template is a template that does not 
match any template in the other Treebank 
even if we assume both Treebanks are per- 
fectly annotated. Here, we list three reasons 
why a truly unmatched template exist. 
(T1) Word order: The word order deter- 
mines the positions of arguments w.r.t. 
their heads, and the positions of modi- 
fiers w.r.t, their modifiees. If two lan- 
guages have different word orders, their 
templates which include arguments of a 
head or a modifier are likely to look dif- 
ferent. For example, Figure 8(T1) show 
the templates for transitive verbs in Chi- 
nese and Korean grammars. The tem- 
plates do not match because of the dif- 
ferent positions of the object of the verb. 
(T2) Unique tags: For each pair of lan- 
guages, some Part-of-speech tags and 
syntactic tags may appear in only one 
. Therefore, the templates with 
those tags will not match any templates 
in the other . For instance, in 
Korean the counterparts of preposition 
phrases in English and Chinese are noun 
phrases (with postpositions attaching to 
them, not preposition phrases); there- 
fore, the templates with PP in Chinese, 
57 
(Eng,Ch) 
(Eng, Kor) 
(Ch, Kor) 
Table 
spines subcat frames rood-pairs conj-tuples total 
type (60,60) (92, 92) (83,83) (II,II) (246,246) 
token (94.7,87.2) (94.0, 86.3) (82.6, 80.0) (84.2, 99.1) (91.4, 85.2) 
type (39, 39) (40, 40) (46, 46) (1, 1) (126,126) 
token (70.3, 96.9) (62.1, 96.6) (56.8, 99.5) (9.3, 52.3) (63.4,97.3) 
type (28, 28) (25,25) (29,29) (I, I) (83, 83) 
token I (74.2, 99.2) (63.1, 98.1) (60.2, 93.4) (0.i, 0.4) (66.1, 96.9) 
5: Comparisons of sub-templates in three Treebank grammars 
VP 
VP* CC! VP 
V @ NIL Nl-~ 
English 
vp yP A 
VP* NP " VP* 
I QP Qp I 
I cD~ 
CD~ 
English Chinese 
QP 
P@ QP* 
English 
(S 1) Treebank size ($2) annotation difference ($3) annotation crmr 
Figure 7: Examples of spuriously unmatched templates 
such as the left one in Figure 8(T2), do 
not match any template in Korean. 
(T3) Unique syntactic relations: Some 
syntactic relations may be present in 
only one of the pair of s being 
compared. For instance, the template 
in Figure 8(T3) is used for the sentence 
such as "You should go," said John, 
where the subject of the verb said ap- 
pears after the verb. No such template 
exists in Chinese. 
So far, we have listed six possible reasons 
for unmatched templates. Without manually 
examining all the unmatched templates, it is 
difficult to tell how many unmatched tem- 
plates are caused by a particular reason. Nev- 
ertheless, these reasons help us to interpret 
the results in Table 4. For instance, the ta- 
ble shows that Korean grammars cover only 
57.7% of template tokens in the English Tree- 
bank, and 57.2% in the Chinese Treebank, 
whereas the coverages for other  pairs 
are all above 80%. We suspect that this 
difference of coverage is mainly caused by 
(S1), (T1), and (T2). That is, first, Ko- 
rean Treebank is much smaller than the En- 
glish and the Chinese Treebanks, English and 
Chinese Treebanks may have many tree tem- 
plates that simply was not found in the Ko- 
rean Treebank; Second, English and Chinese 
are predominantly head-initial, whereas Ko- 
rean is head-final, therefore, many templates 
in English and Chinese can not find matched 
templates in Korean because of the word or- 
der difference; Third, Korean does not have 
preposition phrases, causing all the templates 
in English and Chinese with PPs become un- 
matched. To measure the effect of the word 
order factor to the matching rate, we re-did 
the experiment in Section 4.2, but this time 
we ignored the word order -- that is, we treat 
templates as unordered trees. The results are 
given in Table 6. Comparing this table with 
Table 4, we can clearly see that, the percent- 
ages of matched templates increase substan- 
tially for (Eng, Kor) and (Ch, Kor) when the 
word order is ignored. Notice that the match- 
ing percentage for (Eng, Ch) does not change 
as much because the word orders in English 
and Chinese are much similar than the orders 
in English and Korean. 
5 Conclusion 
We have presented a method of quantitatively 
comparing LTAGs extracted from Treebanks. 
Our experimental results show a high pro- 
portion of easily inter-mappable structures, 
giving a positive implications for Universal 
Grammar hypothesis, We have also described 
a number of reasons why a particular tern- 
58 
$ $ 
A 
V~ NPt NPt V~ 
Chinese Korean 
(TI) word order 
vP A 
VP* NP VP* I 
P@ NPt N@ 
Chinese Korean 
(T2) unique rags 
s s("'s 
NPt 
V~ $ ! 
£ 
English 
(T3) unique relation 
Figure 8: Truly unmatched templates 
(Eng,Ch) 
(Eng, Kor) 
(Ch, Kor) 
matched templates 
type (334, 259) 
token (82.8, 82.2) 
type (222, 167) 
token (66.4, 92.4) 
type (126,125) 
token (68.3, 97.3) 
tag mismatches 
i (536, 99) 
! (2.8, 12.3) 
I (2075, 6) 
! (28.1, 0.1) 
(324,6) 
(29.4, 0.1) 
other mismatches 
(2269, 189) 
(14.4, 5.5) 
(842, 98) 
(5.5, 7.5) 
(97, 140) 
(2.3, 2.6) 
Table 6: Comparisons of templates w/o orders 
plate does not match any template in other 
s and tested the effect of word order 
on matching percentages. 
There are two natural extensions of this 
work. First, running an alignment algorithm 
on parallel bracketed corpora to produce 
word-to, word mappings. Given such word-to- 
word mappings and our template matching 
algorithm, we can automatically create lexi- 
calized etree-to-etree mappings, which can be 
used for semi-automatic transfer lexicon con- 
struction. Second, LexTract can build deriva- 
tion trees for each sentence in the corpora. By 
comparing derivation trees for parallel sen- 
tences in two s, instances of struc- 
tural divergences (Dorr, 1993; Dorr, 1994; 
Palmer et al., 1998) can be automatically de- 
tected. 

References 
Chung-hye Hart. 2000. Bracketing Guide- 
lines for the Penn Korean Treebank (draft). 
www.cis.upenn.edu/xtag/korean.tag. 
Bernard Comrie. 1987. The World's Major Lan- 
guages. Oxford University Press, New York. 
B. J. Dorr. 1993. Machine ~D'anslation: a View 
from the Lexicon. MIT Press, Boston, Mass. 
B. J. Dorr. 1994. Machine translation diver- 
gences: a formal description and proposed so- 
lution. Computational Linguistics, 20(4):597- 
635. 
Aravind Joshi and Yves Schabes. 1997. Tree 
Adjoining Grammars. In A. Salomma and 
G. Rosenberg, editors, Handbook off For- 
mal Languages and Automata. Springer-Verlag, 
Herdelberg. 
Aravind K. Joshi, L. Levy, and M. Takahashi. 
1975. Tree Adjunct Grammars. Journal off 
Computer and System Sciences. 
M. Marcus, B. Santorini, and M. A. 
Marcinkiewicz. 1993. Building a Large 
Annotated Corpus of English: the Penn 
Treebank. Computational Lingustics. 
Martha Palmer, Owen Rainbow, and Alexis Nasr. 
1998. Rapid Prototyping of Domain-Specific 
Machine Translation System. In Proc. of 
AMTA-1998, Langhorne, PA. 
Fei Xia, Martha Palmer, and Aravind Joshi. 
2000a. A Uniform Method of Grammar Ex- 
traction and its Applications. In Proc. off Joint 
SIGDAT Conference on Empirical Methods in 
Natural Language Processing and Very Large 
Corpora (EMNLP/VLC). 
Fei Xia, Martha Palmer, Nianwen Xue, 
Mary Ellen Okurowski, John Kovarik, Shizhe 
Huang, Tony Kroch, and Mitch Marcus. 
2000b. Developing Guidelines and Ensuring 
Consistency for Chinese Text Annotation. 
In Proc. off the 2nd International Confer- 
ence on Language Resources and Evaluation 
(LREC-2000), Athens, Greece. 
Fei Xia. 1999. Extracting Tree Adjoining Gram- 
mars from Bracketed Corpora. In Proc. off 5th 
Natural Language Processing Pacific Rim Sym- 
posium (NLPRS-99), Beijing, China. 
