Building A Large Chinese Corpus 
Annotated With Semantic Dependency 
 
LI Mingqin 
Department of Electronic Engineering, 
Tsinghua University,  
Beijing 100084, China 
lmq@thsp.ee.tsinghua
.edu.cn  
LI Juanzi 
Department of Computer Science 
and Technology,  
Tsinghua University, 
 Beijing 100084, China 
ljz@thsp.ee.tsinghu
a.edu.cn 
DONG Zhendong 
Research Centre of Computer & 
Language Engineering, 
Chinese Academy of Sciences, 
Beijing, 100084,China 
dzd@keenage.com 
WANG Zuoying 
Department of Electronic Engineering,  
Tsinghua University, 
 Beijing 100084, China 
wzy-dee@tsinghua.edu.cn  
LU Dajin 
Department of Electronic Engineering,  
Tsinghua University, 
 Beijing 100084, China 
ludj@mail.tsinghua.edu.cn 
 
 
 
Abstract 
At present most of corpora are annotated 
mainly with syntactic knowledge. In this 
paper, we attempt to build a large corpus 
and annotate semantic knowledge with 
dependency grammar. We believe that 
words are the basic units of semantics, 
and the structure and meaning of a 
sentence consist mainly of a series of 
semantic dependencies between 
individual words. A 1,000,000-word-
scale corpus annotated with semantic 
dependency has been built. Compared 
with syntactic knowledge, semantic 
knowledge is more difficult to annotate, 
for ambiguity problem is more serious. In 
the paper, the strategy to improve 
consistency is addressed, and congruence 
is defined to measure the consistency of 
tagged corpus.. Finally, we will compare 
our corpus with other well-known 
corpora. 
1   Introduction 
As basic research tools for investigators in natural 
language processing, large annotated corpora play 
an important role in investigating diverse lan-
guage phenomena, building statistical language 
models, evaluating and comparing kinds of pars-
ing models. At present most of corpora are anno-
tated mainly with syntactic knowledge, though 
some function tags are added to annotate semantic 
knowledge. For example, the Penn Treebank 
(Marcus et al., 1993) was annotated with skeletal 
syntactic structure, and many syntactic parsers 
were evaluated and compared on the corpus. For 
Chinese, some corpora annotated with phrase 
structure also have been built, for instance the 
Penn Chinese Treebank (Xia et al., 2000) and Sina 
Corpus (Huang and Chen, 1992). A syntactic an-
notation scheme based on dependency was pro-
posed by (Lai and Huang, 2000), and a small 
corpus was built for testing. However, very lim-
ited work has been done with annotation semantic 
knowledge in all languages. From 1999, Berkeley 
started FrameNet project (Baker et al., 1998), 
which produced the frame-semantic descriptions 
of several thousand English lexical items and 
backed up these description with semantically an-
notated attestations from contemporary English 
corpus. Although few corpora annotated with se-
mantic knowledge are available now, there are 
some valuable lexical databases describing the 
lexical semantics in dictionary form, for example 
English WordNet (Miller et al., 1993) and Chinese 
HowNet (Dong and Dong, 2001). 
For Chinese, many attentions have been natu-
rally paid to researches on semantics, because 
Chinese is a meaning-combined language, its syn-
tax is very flexible, and semantic rules are more 
stable than syntactic rules. For instance, in Chi-
nese it is very pervasive that more than one part-of 
-speeches a word has, and a word does not have 
tense or voice flectional transition under different 
tenses or voices. Nevertheless, no large Chinese 
corpus annotated with semantic knowledge has 
ever been built at present. In Semantic Depend-
ency Net (SDN), we try to describe deeper seman-
tic dependency relationship between individual 
words and represent the meaning and structure of 
a sentence by these dependencies. 
Compared with syntactic corpus, it is more dif-
ficult to build a semantic corpus, for the granular-
ity of semantic knowledge is smaller, and 
behaviors of different words differ more greatly. 
Furthermore, ambiguity in semantics is commoner. 
Different people may have different opinions on 
understanding the same word in the same sentence, 
and even the same people may have different 
opinions on understanding the same word in dif-
ferent occasions. In this paper, we emphatically 
discuss the strategy to improve the consistency of 
Semantic Dependency Net. 
The paper is organized as follows. The tagging 
scheme is discussed in Section 2, which describes 
the semantic dependency grammar and the tag set 
of semantic relations. In section 3, we describe the 
tagging task. First, we briefly introduce the text of 
this corpus, which has been tagged with semantic 
classes. Second, we describe the strategy to im-
prove consistency during tagging and checking. 
At last, congruence is defined to measure the con-
sistency of tagged corpus. In Section 4, we briefly 
introduce some of the works on the corpus, and 
indicate the directions that the project is likely to 
take in the future. Finally, we compare SDN cor-
pus with some other well-known corpora.   
  
 
Figure 1:  A sample sentence from the corpus. (a) The sentence tagged with semantic classes; (b) The sen-
tence annotated with semantic dependency; (c) The semantic dependency tree of the sentence, headwords 
are linked with bold lines, and modifier words are linked with arrow lines. 
2 The tagging scheme of semantic depend-
ency 
2.1 Semantic dependency grammar 
Like Word grammar (Hudson, 1998), We believe 
that words are the basic units of semantics, and 
the structure and meaning of a sentence consist 
mainly of semantic dependencies between indi-
vidual words. So a sentence could be annotated 
with a series of semantic dependency relations (Li 
Juanzi and Wang, 2002). Let S be a sentence 
composed of words tagged with semantic classes, 
},,,,,,{
2211
><><><=
nn
swswswS L
. A 
list of semantic dependency relations is defined as: 
)}(,),2(),1({ nSRSRSRSRL L=
, 
where 
),()(
ii
rhiSR =
. SR stands for ‘ semantic 
relation’. 
),()(
ii
rhiSR =
 states that the 
i
h
-th 
word is the headword to the i-th word with seman-
tic relation 
i
r
. If the word j is the root, 
)( jSR
 is 
defined to be (-1, “kernel word”).  
For example, a sample sentence from the cor-
pus is shown in Figure 1 (a). The semantic de-
pendency relation list and semantic dependency 
tree are shown in Figure 1 (b) and (c) respectively. 
More samples will be seen in Appendix A. 
In semantic dependency grammar, the head-
word of sentence represents the main meaning of 
the whole sentence, and the headword of constitu-
ent represents the main meaning of the constituent. 
In a compound constituent, the headword inherits 
the headword of the head sub-constituent, and 
headwords of other sub-constituents are dependent 
on that headword. We select the word that can 
represent the meaning of the constituent to the 
most extent as headword. For example, the verb is 
the headword of verb phrase, the object is the 
headword of preposition phrase, and the location 
noun is the headword of the location phrase.  
At the same time, semantic dependency rela-
tions do not damage the phrase structure, that is, 
all words in the same phrase are in the same sub-
tree whose root is the headword of the phrase. 
Therefore, when tagging dependency relations, 
semantic and syntactic restrictions are both taken 
into account. The structures of dependency tree 
are mainly determined by syntactic restrictions, 
and the semantic relations are mainly determined 
by semantic restrictions. For example, in Figure 1 
the phrase “g1866g2469g7138g6116g7536g11352”( of his invention pro-
duction) modifies the phrase “g6524g5203g1363g11004” (popu-
larization and application) in syntax, so the word 
“g6524g5203” (popularization) governs the word “g6116g7536” 
(production).  However, the production is the con-
tent of the action popularization in semantics, so 
the relation between them is “content”. 
Our tagging scheme is more concise compared 
with phrase structure grammar, in which the 
boundaries of all phrases have to be marked and 
the corresponding labels have to be tagged. In the 
semantic dependency grammar, phrases are im-
plicit, but play no part in grammar. More empha-
sis is paid to the syntactic and semantic functions 
of the word, especially of the headword. 
2.2 The dependency relation tag set 
The dependency relation tag set mainly consists of 
three kinds of relations: semantic relations, syn-
tactic relations and special relations. Semantics is 
the main content of this corpus, so semantic rela-
tions are in the majority, and syntactic relations 
are used to annotate the special structures that do 
not have exact sense in terms of semantics. In ad-
dition, there are two special relations: “kernel 
word” is to indicate the headword of a sentence, 
and “failure” is to indicate the word that cannot be 
annotated with dependency relations because the 
sentence is not completed. 
The selections of semantic relations were re-
ferred to HowNet (Dong and Dong, 2001). 
HowNet is a lexical database, which describes the 
relations among words and concepts as a network. 
In HowNet, senventy-six semantic relations are 
defined to describe all relations among various 
concepts, and most of them describe the semantic 
relations between action and other concepts. With 
these semantic relations, necessary role frame is 
further defined. The roles in the necessary role 
frame must take part in the action in real word, 
while these roles may not appear in the same sen-
tence. Hong Kong Technology University has 
successfully tagged a news corpus with the neces-
sary role frame (Yan and Tan, 1999), which 
shows that these roles can describe all semantic 
phenomena in real texts. 
 In order to make tagging task easier and the 
corpus more suitable for statistical learning, we 
have pared down some relations in HowNet and 
got fifty-nine semantic relations. Some HowNet 
relations seldom occurred in the corpus, and their 
semantic functions are somewhat similar, so they 
are merged. Some relations are ambiguous, for 
example “degree” and “range”. In order to im-
prove the consistency, we also merge these two 
relations. 
Semantic relations can describe the relations 
between notional words, but they cannot annotate 
function words in some special phrase structures. 
So nine syntactic relations are added. 
The tag set is listed in table 1. Full definition 
of each dependency relation can be seen in (Li 
Mingqin et al., 2002). 
   
g53g72g79g72g89g68g81g87g3
g18g1863g13007g1039g1319g3
g51g82g86g86g72g86g86g82g85g3
g18g20058g7389g13785g3
g40g91g76g86g87g72g81g87g3
g18g4396g10628g1319g3
g40g91g83g72g85g76g72g81g70g72g85g3
g18g13475g20576g13785g3
g36g74g72g81g87g3
g18g7057g1119 
g39g72g86g70g85g76g83g87g76g89g72g3
g18g6563g1901g1319 
g51g82g86g86g72g86g86g76g82g81g3
g18g2356g7389g10301g3
g51g68g87g76g72g81g87g3
g18g2475g1119g3
g51g68g85g87g50g73g55g82g88g70g75g3
g18g16314g2462g18108g1226g3
g38g82g81g87g72g81g87g3
g18g1881g4493g3
g44g86g68g3
g18g12879g6363g3
g51g68g85g87g50g73g3
g18g18108g2010g3
g58g75g82g79g72g3
g18g6984g1319g3
g53g72g86g88g79g87g3
g18g13479g7536g3
g55g68g85g74g72g87g3
g18g11458g7643g3
g38g82g86g87g3
g18g1207g1227 
g53g72g86g88g79g87g40g89g72g81g87g3
g18g13479g7536g1119g1226g3
g40g89g72g81g87g51g85g82g70g72g86g86g3
g18g1119g1226g17819g12255g3
g54g88g70g70g72g72g71g76g81g74g3
g18g6521g13505g3
g36g70g70g82g80g83g68g81g76g80g72g81g87g3
g18g1288g19555g3
g48g82g71g76g73g76g72g85g3
g18g6563g17860g3
g53g72g86g87g85g76g70g87g76g89g72g3
g18g19492g4462g3
g52g88g68g81g87g76g87g92g3
g18g6980g18339g3
g36g83g83g82g86g76g87g76g89g72g3
g18g2528g1313g16833g3
g39g72g74g85g72g72g3
g18g12255g5242 
g48g68g81g81g72g85g3
g18g7053g5347g3
g41g85g72g84g88g72g81g70g92g3
g18g20069g10587g3
g55g76g80g72g86g3
g18g2172g18339g3
g54g70g82g83g72g3
g18g14551g3272g3
g38g82g80g80g72g81g87g3
g18g16792g16782g3
g55g76g80g72g3
g18g7114g19400g3
g55g76g80g72g44g81g76g3
g18g17227g3999g7114g19400g3
g55g76g80g72g41g76g81g3
g18g13468g8502g7114g19400g3
g39g88g85g68g87g76g82g81g36g73g87g72g85g40g89g72g81g87g3
g18g2530g5322g7114g8585g3
g39g88g85g68g87g76g82g81g3
g18g17839g12255g7114g8585g3
g55g76g80g72g3
g18g7114g19400 
g55g76g80g72g44g81g76g3
g18g17227g3999g7114g19400g3
g55g76g80g72g41g76g81g3
g18g13468g8502g7114g19400g3
g39g88g85g68g87g76g82g81g3
g18g17839g12255g7114g8585g3
g39g88g85g68g87g76g82g81g36g73g87g72g85g40g89g72g81g87g3
g18g2530g5322g7114g8585g3
g55g76g80g72g53g68g81g74g72g3
g18g7114g17329 
g47g82g70g68g87g76g82g81g3
g18g3800g6164g3
g47g82g70g68g87g76g82g81g44g81g76g3
g18g2419g3800g6164g3
g47g82g70g68g87g76g82g81g41g76g81g3
g18g13468g3800g6164g3
g47g82g70g68g87g76g82g81g55g75g85g88g3
g18g17902g17819g3800g6164g3
g54g87g68g87g72g44g81g76g3
g18g2419g10378g5589g3
g54g87g68g87g72g41g76g81g3
g18g13468g10378g5589g3
g51g68g85g87g81g72g85g3
g18g11468g1288g1319g3
g38g82g81g87g85g68g86g87g3
g18g2454g10043g1319g3
g38g82g81g87g72g81g87g38g82g80g83g68g85g72g3
g18g8616g17751g1881g4493 
g52g88g68g81g87g76g87g92g38g82g80g83g68g85g72g3
g18g8616g17751g18339g3
g48g72g68g81g86g3
g18g6175g8585g3
g44g81g86g87g85g88g80g72g81g87g3
g18g5049g1867g3
g48g68g87g72g85g76g68g79g3
g18g7460g7021g3
g54g82g88g85g70g72g3
g18g7481g9316 
g38g68g88g86g72g3
g18g2419g3252g3
g51g88g85g83g82g86g72g3
g18g11458g11352g3
g36g70g70g82g88g71g76g81g74g87g82g3
g18g7693g6466g3
g39g76g85g72g70g87g76g82g81g3
g18g7053g2533g3
g38g82g81g71g76g87g76g82g81g3
g18g7477g1226g3
Semantic 
Relations 
g70859g709 
g38g82g81g70g72g86g86g76g82g81g3
g18g16765g8505 
g37g72g86g76g71g72g86g3
g18g17894g17839g3
g38g82g81g77g88g81g70g87g88g85g72g3
g18g5194g2027g3
g40g91g70g72g83g87g3
g18g19512g1114g3
  
g254g39g72g255g3
g71g72g83g72g81g71g72g81g70g92g3
g18“”g11352g4395g1393g4396g3
g258g255g54g75g76g255g258g255g71g72g255g3
g18g258g7171g258g11352g1393g4396g3
g38g82g81g81g72g70g87g76g82g81g3
g18g17842g6521g1393g4396g3
g47g82g70g68g87g76g82g81g51g85g72g83g82g86g76g87g76g82g81g3
g18g7053g1313g16801g1393g4396g3
g55g72g81g86g72g9g57g82g76g70g72g3
g18g3g7114g5589g16833g5589g3
g1393g4396 
g48g82g82g71g3
g18g16833g8680g1393g4396 
Syntacitc 
Relations 
g7089g709 
g38g82g85g85g72g79g68g87g76g89g72g3
g18g1863g13864g16801g1393g4396g3
g55g85g72g81g71g3
g18g17247g2533g2172g16801g1393g4396g3
g51g85g72g83g82g86g76g87g76g82g81g3
g18g1183g16801g1393g4396g3
  
Special 
relaions
g7082g709 
g46g72g85g81g72g79g3g90g82g85g71g3
g18g7692g5527g6116g2010g3
g41g68g76g79g88g85g72g3
g18g1393g4396g3845g17145g3
g3    
Table 1: The dependency relation tag set. 
3   The tagging and checking of semantic 
dependency relations 
3.1 Texts of corpus 
A part of Tsinghua Corpus (Zhang, 1999) anno-
tated with semantic classes was selected as raw 
data of our corpus. The texts of Tsinghua corpus 
come from the news of People’s Daily. The se-
lected part consists of about 1,000,000 words, ap-
proximately 1,500,000 Chinese characters. Its 
domain covers the politics, economy, science, 
sports, etc. The proportion of different domains is 
shown in figure 2. 
 
Figure 2: The proportion of texts of different do-
mains 
Because Chinese is not written with word de-
limiters, first the text was segmented into words 
according to the lexicon of 100,000 words. Then 
each word was tagged with semantic class, whose 
definition follows Tongyici Cilin (Dictionary of 
Synonymous Words) (Mei  et al., 1983).The seman-
tic classes are organized as a tree, which has three 
levels. The first level contains 18 classes, the sec-
ond level contains 101 classes, and the third level 
contains 1434 classes. These hierarchical semantic 
classes are helpful to express the superordinate and 
subordinate information among words. 
All the text in Tsinghua Corpus was seg-
mented, tagged and checked manually. Since the 
corpus was built in 1998, it has been used for sev-
eral years in the researches on automatic sense tag-
ging and class-based language model. Now, the 
accuracy of tagging system has reached to 92.7% 
(Zhang, 1999). 
3.2 Tagging tools  
A computer-aided tagging tool was developed to 
assist annotators in tagging semantic dependency 
relations. To tag a word, annotator only need to 
select its headword and their relation by clicking 
mouse. After a sentence has been tagged, the cor-
responding semantic dependency tree will be dis-
played to help annotators check the sentence 
structure. 
Two additional functions are also provided in 
the tool: dependency grammar checking and on-
line reference of HowNet. Dependency grammar 
checking guarantees that the tagged sentence con-
forms to four axioms of dependency grammar 
(Robinson, 1970): 
(a) One and only one element is independent; 
(b) All others depend directly on some ele-
ment; 
(c) No element depends directly on more than 
one other 
(d) If A depends directly on B and some ele-
ment C intervenes between them (in linear or-
der of string), then C depends directly on A or 
B or some other intervening element. 
During annotating procedure, the tool checks 
whether the tagged relation conforms to depend-
ency grammar, and prompts the grammar errors in 
time. 
On-line HowNet reference facilitates looking 
up semantic knowledge and helps to ensure the 
consistency of tagging. Semantic knowledge is 
more difficult to grasp than syntactic knowledge. 
Even for annotators majored in linguistics, it is too 
difficult to grasp all semantic relations of words 
only after a short-term training. And different opin-
ions about relations will lead to the inconsistency. 
However, HowNet defines the necessary role 
frame for verbs frequently used in real world, and 
these roles can be mapped to our semantic relations, 
so HowNet has set up a detail annotating manual 
for us. For example, in HowNet the role frame of 
the verb “g18337g16282” (pay attention to) is defined as 
{experiencer, target, cause}. With basic semantic 
knowledge, annotators can easily identify the rela-
tion between “g2350g3775” (doctor) and “g18337g16282” (pay at-
tention to) as “experiencer”, and the relation 
between “g6524g5203” (popularization) and “g18337g16282” (pay 
attention to) as “target”. We integrated the on-line 
reference of HowNet to the tool, which has been 
proved in practice to be very helpful in improving 
the consistency and speed of tagging. 
3.3 Checking  
Our work is the first attempt to annotate semantic 
dependency relations on a large corpus, and no 
prior knowledge is available, so the whole corpus 
is tagged manually. But in checking procedure we 
have learned some experience and knowledge, 
which should be used as possible as we can. So we 
adopt two checking modes. In the first mode—
manual checking, checkers correct all errors by 
hand; in second mode—semiautomatic checking, 
computer-aided checking tool automatically 
searches for the errors and then human checkers 
correct them, and it means checkers need to read 
only about 1/3 or less questionable sentences.  
In semiautomatic checking, all the files are 
scanned automatically to search for three kinds of 
errors: 
1. To check whether the semantic relations 
conform to the necessary role frame defined 
by HowNet. 
2. To check whether the relations conform to 
error rules. Some errors frequently occurred 
during manual checking. For example, the re-
lation between words “g1889/g2460”(again) and a 
verb must be “frequency”, but in incorrect 
sentences it was tagged otherwise. We sum-
marized these errors, and wrote them as rules. 
3. To check whether the score of semantic 
dependency model (equation 1) is below some 
threshold. A simple semantic dependency 
model was built on the corpus. Although the 
score of tagged sentence cannot be the crite-
rion of correctness, at least it can show the 
consistency of a kind of sentences. 
∏
=
=
n
k
kkk
rwhwPTP
1
)),(,()(                     (1) 
where n  is the length of the sentence, 
k
w  is the  
k-th word in the sentence, )(
k
wh  is the headword 
to 
k
w  with semantic relation 
k
r . 
The semiautomatic checking interface could 
prompt some possible errors, but the necessary role 
frames defined by HowNet may be not complete, 
the error rules may be not restrict, and the score of 
semantic structure model may be not credible. The 
prompted errors may be false, so the decision 
whether the error is true and how to correct it must 
be made by human checkers. This is the reason 
why it is called semiautomatic checking. 
The checking procedure consisted of five 
rounds of selective manual checking and a round 
of semiautomatic checking. In tagging procedure, 
we dispatched the raw files to annotators in a 
group of 10 files. In a round of selective manual 
checking, one file in every group was selected to 
check. All corrections were recorded by the check-
ing interface, and the reasons for corrections were 
explained by the checker. If too many error sen-
tences occurred in the selected file, all files in this 
group needed correcting by original annotators 
after referring to the corrected sentences and their 
explanations. 
After four rounds of selective manual check-
ing, most of errors have been corrected, but there 
were still some files that have not been checked or 
corrected. We semi-automatically checked all files. 
Finally, the fifth round of manual checking was 
taken.  
Fourteen graduate studentstook part in anno-
tating, most of them are majored in linguistics. 
Seven excellent students were elected for checking 
among annotators, and they were not allowed to 
check their own files. According to our statistics, 
the average speed to annotate by hand is about 1.15 
hours per 100 sentences; the average speed to 
check by hand is about 0.25 hours per 100 sen-
tences; and the speed to check half automatically is 
about 0.08 hours per 100 sentences. In manual 
checking procedure, there were 50% of all files 
that were manually checked, 75.45% that were 
turned to the original annotator to correct. (When 
counting the files corrected by original annotators, 
if the same group of files were corrected in two 
rounds, we count them as two groups.) And all 
files were checked in semiautomatic checking pro-
cedure. 
3.4 Congruence 
Under the given annotating manual, consistency is 
an important criterion to evaluate the corpus. If 
tagged sentence is independently checked and 
passed by several experts, the annotation may be 
credible; otherwise, if some experts do not agree to 
the annotation, it may be not credible enough. If 
several experts evaluate tagged sentences inde-
pendently, the inter-checker agreement is defined 
as the measure of consistency.  
Relation Congruence (RCn) and Sentence 
Congruence (SCn) are defined. RCn is the number 
of relations for which n judges agreed, divided by 
the total number of relations, in which n can be 1, 
2, 3. SCn is the number of sentences for which n 
judges agreed, divided by the total number of sen-
tences, in which n can be 1, 2, 3. For example, if 
three experts take part in evaluating, RC3 is the 
percentage of the annotated relation that all three 
experts are agree to one annotation, and SC1 is the 
percentage of the annotated sentence for which all 
three judges’ opinions are different from one an-
other. 
Before checking, 500 sentences were evalu-
ated by three experts. After checking, 1,400 sen-
tences were evaluated by three experts. In order to 
balance the coverage and workload of evaluation, 
another 4,900 sentences were evaluated by two 
experts. The congruency is shown in table 2. 
 RC3 RC2 RC1
Unchecked(500) 
90.78%g3 8.65%g3 0.57%g3
Checked(1400) 
96.24% 3.64% 0.12% 
Checked(4900) 
--------- 97.56% 2.44% 
 SC3 SC2 SC1 
Unchecked(500) 
69.20%g3 23.00%g3 7.80%g3
Checked(1400) 
83.43% 14.07% 2.50% 
Checked(4900) 
---------- 89.10% 10.90% 
Table 2: the congruency of data before and after 
checking 
The results show that the quality of corpus is 
improved greatly after checking, and high rela-
tion/sentence congruency of 96.24%/83.43% 
among three experts was satisfactory. 
4   Future works 
Although the tagging task is completed, much fur-
ther work will be needed. A user-friendly, interac-
tive interface for corpus investigation is needed to 
search the example sentences and to maintain the 
tagged data. Inconsistencies still exist in the corpus, 
and it may become more apparent with time. How 
to reduce inconsistencies is a challenging problem. 
The role frame of verbs can to be extracted from 
the corpus, which could be integrated with 
HowNet to build a larger database. The correlation 
frame of nouns, which can represent the order of 
modifier phrases, can be extracted, too. 
More statistical researches could be carried 
out on the corpus. Researches on Chinese informa-
tion structure have been carried out on the corpus 
(You et al., 2002). Auto-tagging the semantic de-
pendency structure of this kind is under going. And 
we hope the SDN corpus could be exploited in 
more areas: speech recognition, natural language 
understanding, machine translation, information 
extraction, and so on. 
5   Comparison with other corpora 
Our Corpus is compared with other famous cor-
pora for English and Chinese in table 3. 
Corpus Languge Content Scale Institution 
SDN Chinese 
Semantic 
dependency 
1,000,000 
words 
Tsinghua 
FrameNet English 
Frame 
Semantics 
250,000 
sentences 
Berkeley 
TreeBank(C) Chinese 
Phrase 
structure 
500,000 
words 
UPenn 
Table 3: Comparison with other corpora 
The FrameNet is annotated with semantic 
knowledge, which emphasizes on describing the 
frame and scene of several thousands verbs. They 
first build a frame database, which contains de-
scriptions of each frame of the verbs, and then an-
notated example sentences of these frames. Unlike 
FrameNet, we first annotated semantic dependency 
relations of sentences according to HowNet, and 
hope to extract frames from the corpus later. Fra-
meNet only described the frame of verbs, while 
from Semantic Dependency Net the correlation 
frame of nouns and verbs could be automatically 
learned by machine. 

References 
Collin F. Baker, Charles J. Fillmore, John B. Lowe. 
1998. The Berkeley FrameNet Project, Proceedings 
of the COLING-ACL, Montreal, Canada.  
 
Zhendong Dong and Qiang Dong. 2001. Construction of 
a Knowledge System and its Impact on Chinese Re-
search, Contemporary Linguistics, 3: 33-44, Beijing. 
 
Chu-Ren Huang and Keh-jiann Chen. 1992. A Chinese 
Corpus for Linguistics Research. In the Proceedings 
of the COLING. 1214-1217. Nantes, France 
 
Richard Hudson. 1998. Word Grammars, Dependency 
and Valency, An International Handbook of Con-
temporary Research. Edited by Vilmos Agel, 
Ludwig M. Eichinger, etc. Berlin, Walter de Gruyter. 
 
Tom B. Y. Lai and Changning Huang. 2000. Depend-
ency-based syntactic analysis of Chinese and Anno-
tation of Parsed Corpus. The 38th Annual Meeting of 
the Association for Computational Linguistics, Hong 
Kong. 
 
Juanzi Li and Zuoying Wang. 2002. Chinese Statistcal 
Parser Based on Semantic Dependecies, Tsinghua 
Science and Technology, 7(6): 591-595. 
 
Mingqin Li, Fang You, Juanzi Li, Zuoying Wang. 2002. 
Manual of Tagging Semantic Dependency (third 
Version), Technical Report, Tsinghua University, 
Department of Electronic Engineering. 
 
Mitchell P. Marcus, Beatrice Santorini, Mary Ann Mar-
cinkiewicz. 1993. Building a large annotated corpus 
of English: The Penn Treebank. Computational Lin-
guistics, 19(2): 313-330. 
 
Jiaju Mei, Yiming Zhu and YunQi Gao, Yin Hongxiang, 
Edited. 1983. Tongyici Cilin (Dictionary of 
Synonymous Words), Shanghai Cishu Publisher. 
 
George A. Miller, Richard Beckwith, Christiane Fell-
baum, Derek Gross, and Katherine Miller. 1993. In-
troduction to WordNet: An On-line Lexical Database, 
Five papers on WordNet, CSL Report 43, Cognitive 
Science Laboratory. Princeton University.  
 
Jane J. Robinson. 1970. Dependency Structures and 
Transformation Rules. Lanuage, 46: 259-285.  
 
Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen 
Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe 
Huang, Tony Kroch, and Mitch Marcus. 2000. Pro-
ceedings of the second International Conference on 
Language Resources and Evaluation (LREC 2000), 
Athens, Greece. 
 
Guowei Yan and Huimin Tan. 1999. Corpus Annotating 
Mannual Based on HowNet (Jiyu ZhiWang de 
Yuliao Biaozhu Shouce), Technical Report, the De-
partment of computer science, Hong Kong Univer-
sity of Sience of Techonolgy. 
http://www.keenage.com 
 
Fang You, Juanzi Li and Zuoying Wang. 2002. An ap-
proach Based HowNet for Extracting Chinese Mes-
sage Structure, Computer Engineering and 
Applications, 38: 56-58. 
 
Jianping Zhang. 1999. A Study of Language Model and 
Understanding Algorithm for Large Vocabulary 
Spontaneous Speech Recognition. Doctor Disserta-
tion, The Department of Electronic Engineering, 
Tsinghua University, Beijing. 
