File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1712_metho.xml

Size: 21,732 bytes

Last Modified: 2025-10-06 14:08:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1712">
  <Title>Building A Large Chinese Corpus Annotated With Semantic Dependency</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> As basic research tools for investigators in natural language processing, large annotated corpora play an important role in investigating diverse language phenomena, building statistical language models, evaluating and comparing kinds of parsing models. At present most of corpora are annotated mainly with syntactic knowledge, though some function tags are added to annotate semantic knowledge. For example, the Penn Treebank (Marcus et al., 1993) was annotated with skeletal syntactic structure, and many syntactic parsers were evaluated and compared on the corpus. For Chinese, some corpora annotated with phrase structure also have been built, for instance the Penn Chinese Treebank (Xia et al., 2000) and Sina Corpus (Huang and Chen, 1992). A syntactic annotation scheme based on dependency was proposed by (Lai and Huang, 2000), and a small corpus was built for testing. However, very limited work has been done with annotation semantic knowledge in all languages. From 1999, Berkeley started FrameNet project (Baker et al., 1998), which produced the frame-semantic descriptions of several thousand English lexical items and backed up these description with semantically annotated attestations from contemporary English corpus. Although few corpora annotated with semantic knowledge are available now, there are some valuable lexical databases describing the lexical semantics in dictionary form, for example English WordNet (Miller et al., 1993) and Chinese HowNet (Dong and Dong, 2001).</Paragraph>
    <Paragraph position="1"> For Chinese, many attentions have been naturally paid to researches on semantics, because Chinese is a meaning-combined language, its syntax is very flexible, and semantic rules are more stable than syntactic rules. For instance, in Chinese it is very pervasive that more than one part-of -speeches a word has, and a word does not have tense or voice flectional transition under different tenses or voices. Nevertheless, no large Chinese corpus annotated with semantic knowledge has ever been built at present. In Semantic Dependency Net (SDN), we try to describe deeper semantic dependency relationship between individual words and represent the meaning and structure of a sentence by these dependencies.</Paragraph>
    <Paragraph position="2"> Compared with syntactic corpus, it is more difficult to build a semantic corpus, for the granularity of semantic knowledge is smaller, and behaviors of different words differ more greatly.</Paragraph>
    <Paragraph position="3"> Furthermore, ambiguity in semantics is commoner.</Paragraph>
    <Paragraph position="4"> Different people may have different opinions on understanding the same word in the same sentence, and even the same people may have different opinions on understanding the same word in different occasions. In this paper, we emphatically discuss the strategy to improve the consistency of Semantic Dependency Net.</Paragraph>
    <Paragraph position="5"> The paper is organized as follows. The tagging scheme is discussed in Section 2, which describes the semantic dependency grammar and the tag set of semantic relations. In section 3, we describe the tagging task. First, we briefly introduce the text of this corpus, which has been tagged with semantic classes. Second, we describe the strategy to improve consistency during tagging and checking.</Paragraph>
    <Paragraph position="6"> At last, congruence is defined to measure the consistency of tagged corpus. In Section 4, we briefly introduce some of the works on the corpus, and indicate the directions that the project is likely to take in the future. Finally, we compare SDN corpus with some other well-known corpora.</Paragraph>
    <Paragraph position="7">  tence annotated with semantic dependency; (c) The semantic dependency tree of the sentence, headwords are linked with bold lines, and modifier words are linked with arrow lines. 2 The tagging scheme of semantic dependency null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Semantic dependency grammar
</SectionTitle>
      <Paragraph position="0"> Like Word grammar (Hudson, 1998), We believe that words are the basic units of semantics, and the structure and meaning of a sentence consist mainly of semantic dependencies between individual words. So a sentence could be annotated with a series of semantic dependency relations (Li Juanzi and Wang, 2002). Let S be a sentence composed of words tagged with semantic classes,</Paragraph>
      <Paragraph position="2"> states that the</Paragraph>
      <Paragraph position="4"> defined to be (-1, &amp;quot;kernel word&amp;quot;).</Paragraph>
      <Paragraph position="5"> For example, a sample sentence from the corpus is shown in Figure 1 (a). The semantic dependency relation list and semantic dependency tree are shown in Figure 1 (b) and (c) respectively.</Paragraph>
      <Paragraph position="6"> More samples will be seen in Appendix A.</Paragraph>
      <Paragraph position="7"> In semantic dependency grammar, the head-word of sentence represents the main meaning of the whole sentence, and the headword of constituent represents the main meaning of the constituent. In a compound constituent, the headword inherits the headword of the head sub-constituent, and headwords of other sub-constituents are dependent on that headword. We select the word that can represent the meaning of the constituent to the most extent as headword. For example, the verb is the headword of verb phrase, the object is the headword of preposition phrase, and the location noun is the headword of the location phrase.</Paragraph>
      <Paragraph position="8"> At the same time, semantic dependency relations do not damage the phrase structure, that is, all words in the same phrase are in the same sub-tree whose root is the headword of the phrase.</Paragraph>
      <Paragraph position="9"> Therefore, when tagging dependency relations, semantic and syntactic restrictions are both taken into account. The structures of dependency tree are mainly determined by syntactic restrictions, and the semantic relations are mainly determined by semantic restrictions. For example, in Figure 1 the phrase &amp;quot;g1866g2469g7138g6116g7536g11352&amp;quot;( of his invention production) modifies the phrase &amp;quot;g6524g5203g1363g11004&amp;quot; (popularization and application) in syntax, so the word &amp;quot;g6524g5203&amp;quot; (popularization) governs the word &amp;quot;g6116g7536&amp;quot; (production). However, the production is the content of the action popularization in semantics, so the relation between them is &amp;quot;content&amp;quot;.</Paragraph>
      <Paragraph position="10"> Our tagging scheme is more concise compared with phrase structure grammar, in which the boundaries of all phrases have to be marked and the corresponding labels have to be tagged. In the semantic dependency grammar, phrases are implicit, but play no part in grammar. More emphasis is paid to the syntactic and semantic functions of the word, especially of the headword.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The dependency relation tag set
</SectionTitle>
      <Paragraph position="0"> The dependency relation tag set mainly consists of three kinds of relations: semantic relations, syntactic relations and special relations. Semantics is the main content of this corpus, so semantic relations are in the majority, and syntactic relations are used to annotate the special structures that do not have exact sense in terms of semantics. In addition, there are two special relations: &amp;quot;kernel word&amp;quot; is to indicate the headword of a sentence, and &amp;quot;failure&amp;quot; is to indicate the word that cannot be annotated with dependency relations because the sentence is not completed.</Paragraph>
      <Paragraph position="1"> The selections of semantic relations were referred to HowNet (Dong and Dong, 2001).</Paragraph>
      <Paragraph position="2"> HowNet is a lexical database, which describes the relations among words and concepts as a network.</Paragraph>
      <Paragraph position="3"> In HowNet, senventy-six semantic relations are defined to describe all relations among various concepts, and most of them describe the semantic relations between action and other concepts. With these semantic relations, necessary role frame is further defined. The roles in the necessary role frame must take part in the action in real word, while these roles may not appear in the same sentence. Hong Kong Technology University has successfully tagged a news corpus with the necessary role frame (Yan and Tan, 1999), which shows that these roles can describe all semantic phenomena in real texts.</Paragraph>
      <Paragraph position="4"> In order to make tagging task easier and the corpus more suitable for statistical learning, we have pared down some relations in HowNet and got fifty-nine semantic relations. Some HowNet relations seldom occurred in the corpus, and their semantic functions are somewhat similar, so they are merged. Some relations are ambiguous, for example &amp;quot;degree&amp;quot; and &amp;quot;range&amp;quot;. In order to improve the consistency, we also merge these two relations.</Paragraph>
      <Paragraph position="5"> Semantic relations can describe the relations between notional words, but they cannot annotate function words in some special phrase structures.</Paragraph>
      <Paragraph position="6"> So nine syntactic relations are added.</Paragraph>
      <Paragraph position="7"> The tag set is listed in table 1. Full definition of each dependency relation can be seen in (Li Mingqin et al., 2002).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Texts of corpus
</SectionTitle>
      <Paragraph position="0"> A part of Tsinghua Corpus (Zhang, 1999) annotated with semantic classes was selected as raw data of our corpus. The texts of Tsinghua corpus come from the news of People's Daily. The selected part consists of about 1,000,000 words, approximately 1,500,000 Chinese characters. Its domain covers the politics, economy, science, sports, etc. The proportion of different domains is shown in figure 2.</Paragraph>
      <Paragraph position="1">  according to the lexicon of 100,000 words. Then each word was tagged with semantic class, whose definition follows Tongyici Cilin (Dictionary of Synonymous Words) (Mei et al., 1983).The semantic classes are organized as a tree, which has three levels. The first level contains 18 classes, the second level contains 101 classes, and the third level contains 1434 classes. These hierarchical semantic classes are helpful to express the superordinate and subordinate information among words.</Paragraph>
      <Paragraph position="2"> All the text in Tsinghua Corpus was segmented, tagged and checked manually. Since the corpus was built in 1998, it has been used for several years in the researches on automatic sense tagging and class-based language model. Now, the accuracy of tagging system has reached to 92.7% (Zhang, 1999).</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Tagging tools
</SectionTitle>
      <Paragraph position="0"> A computer-aided tagging tool was developed to assist annotators in tagging semantic dependency relations. To tag a word, annotator only need to select its headword and their relation by clicking mouse. After a sentence has been tagged, the corresponding semantic dependency tree will be displayed to help annotators check the sentence structure.</Paragraph>
      <Paragraph position="1"> Two additional functions are also provided in the tool: dependency grammar checking and on-line reference of HowNet. Dependency grammar checking guarantees that the tagged sentence conforms to four axioms of dependency grammar (Robinson, 1970):  (a) One and only one element is independent; (b) All others depend directly on some element; null (c) No element depends directly on more than one other (d) If A depends directly on B and some element C intervenes between them (in linear or null der of string), then C depends directly on A or B or some other intervening element.</Paragraph>
      <Paragraph position="2"> During annotating procedure, the tool checks whether the tagged relation conforms to dependency grammar, and prompts the grammar errors in time.</Paragraph>
      <Paragraph position="3"> On-line HowNet reference facilitates looking up semantic knowledge and helps to ensure the consistency of tagging. Semantic knowledge is more difficult to grasp than syntactic knowledge. Even for annotators majored in linguistics, it is too difficult to grasp all semantic relations of words only after a short-term training. And different opinions about relations will lead to the inconsistency. However, HowNet defines the necessary role frame for verbs frequently used in real world, and these roles can be mapped to our semantic relations, so HowNet has set up a detail annotating manual for us. For example, in HowNet the role frame of the verb &amp;quot;g18337g16282&amp;quot; (pay attention to) is defined as {experiencer, target, cause}. With basic semantic knowledge, annotators can easily identify the relation between &amp;quot;g2350g3775&amp;quot; (doctor) and &amp;quot;g18337g16282&amp;quot; (pay attention to) as &amp;quot;experiencer&amp;quot;, and the relation between &amp;quot;g6524g5203&amp;quot; (popularization) and &amp;quot;g18337g16282&amp;quot; (pay attention to) as &amp;quot;target&amp;quot;. We integrated the on-line reference of HowNet to the tool, which has been proved in practice to be very helpful in improving the consistency and speed of tagging.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Checking
</SectionTitle>
      <Paragraph position="0"> Our work is the first attempt to annotate semantic dependency relations on a large corpus, and no prior knowledge is available, so the whole corpus is tagged manually. But in checking procedure we have learned some experience and knowledge, which should be used as possible as we can. So we adopt two checking modes. In the first mode-manual checking, checkers correct all errors by hand; in second mode--semiautomatic checking, computer-aided checking tool automatically searches for the errors and then human checkers correct them, and it means checkers need to read only about 1/3 or less questionable sentences.</Paragraph>
      <Paragraph position="1"> In semiautomatic checking, all the files are scanned automatically to search for three kinds of errors:  1. To check whether the semantic relations conform to the necessary role frame defined by HowNet.</Paragraph>
      <Paragraph position="2"> 2. To check whether the relations conform to error rules. Some errors frequently occurred during manual checking. For example, the relation between words &amp;quot;g1889/g2460&amp;quot;(again) and a verb must be &amp;quot;frequency&amp;quot;, but in incorrect sentences it was tagged otherwise. We summarized these errors, and wrote them as rules. 3. To check whether the score of semantic dependency model (equation 1) is below some threshold. A simple semantic dependency model was built on the corpus. Although the score of tagged sentence cannot be the criterion of correctness, at least it can show the consistency of a kind of sentences.</Paragraph>
      <Paragraph position="4"> where n is the length of the sentence, k w is the k-th word in the sentence, )( k wh is the headword to</Paragraph>
      <Paragraph position="6"> The semiautomatic checking interface could prompt some possible errors, but the necessary role frames defined by HowNet may be not complete, the error rules may be not restrict, and the score of semantic structure model may be not credible. The prompted errors may be false, so the decision whether the error is true and how to correct it must be made by human checkers. This is the reason why it is called semiautomatic checking.</Paragraph>
      <Paragraph position="7"> The checking procedure consisted of five rounds of selective manual checking and a round of semiautomatic checking. In tagging procedure, we dispatched the raw files to annotators in a group of 10 files. In a round of selective manual checking, one file in every group was selected to check. All corrections were recorded by the checking interface, and the reasons for corrections were explained by the checker. If too many error sentences occurred in the selected file, all files in this group needed correcting by original annotators after referring to the corrected sentences and their explanations.</Paragraph>
      <Paragraph position="8"> After four rounds of selective manual checking, most of errors have been corrected, but there were still some files that have not been checked or corrected. We semi-automatically checked all files.</Paragraph>
      <Paragraph position="9"> Finally, the fifth round of manual checking was taken.</Paragraph>
      <Paragraph position="10"> Fourteen graduate studentstook part in annotating, most of them are majored in linguistics. Seven excellent students were elected for checking among annotators, and they were not allowed to check their own files. According to our statistics, the average speed to annotate by hand is about 1.15 hours per 100 sentences; the average speed to check by hand is about 0.25 hours per 100 sentences; and the speed to check half automatically is about 0.08 hours per 100 sentences. In manual checking procedure, there were 50% of all files that were manually checked, 75.45% that were turned to the original annotator to correct. (When counting the files corrected by original annotators, if the same group of files were corrected in two rounds, we count them as two groups.) And all files were checked in semiautomatic checking procedure. null</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Congruence
</SectionTitle>
      <Paragraph position="0"> Under the given annotating manual, consistency is an important criterion to evaluate the corpus. If tagged sentence is independently checked and passed by several experts, the annotation may be credible; otherwise, if some experts do not agree to the annotation, it may be not credible enough. If several experts evaluate tagged sentences independently, the inter-checker agreement is defined as the measure of consistency.</Paragraph>
      <Paragraph position="1"> Relation Congruence (RCn) and Sentence Congruence (SCn) are defined. RCn is the number of relations for which n judges agreed, divided by the total number of relations, in which n can be 1, 2, 3. SCn is the number of sentences for which n judges agreed, divided by the total number of sentences, in which n can be 1, 2, 3. For example, if three experts take part in evaluating, RC3 is the percentage of the annotated relation that all three experts are agree to one annotation, and SC1 is the percentage of the annotated sentence for which all three judges' opinions are different from one another. null Before checking, 500 sentences were evaluated by three experts. After checking, 1,400 sentences were evaluated by three experts. In order to balance the coverage and workload of evaluation, another 4,900 sentences were evaluated by two experts. The congruency is shown in table 2.</Paragraph>
      <Paragraph position="2">  checking The results show that the quality of corpus is improved greatly after checking, and high relation/sentence congruency of 96.24%/83.43% among three experts was satisfactory.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Future works
</SectionTitle>
    <Paragraph position="0"> Although the tagging task is completed, much further work will be needed. A user-friendly, interactive interface for corpus investigation is needed to search the example sentences and to maintain the tagged data. Inconsistencies still exist in the corpus, and it may become more apparent with time. How to reduce inconsistencies is a challenging problem.</Paragraph>
    <Paragraph position="1"> The role frame of verbs can to be extracted from the corpus, which could be integrated with HowNet to build a larger database. The correlation frame of nouns, which can represent the order of modifier phrases, can be extracted, too.</Paragraph>
    <Paragraph position="2"> More statistical researches could be carried out on the corpus. Researches on Chinese information structure have been carried out on the corpus (You et al., 2002). Auto-tagging the semantic dependency structure of this kind is under going. And we hope the SDN corpus could be exploited in more areas: speech recognition, natural language understanding, machine translation, information extraction, and so on.</Paragraph>
    <Paragraph position="3">  The FrameNet is annotated with semantic knowledge, which emphasizes on describing the frame and scene of several thousands verbs. They first build a frame database, which contains descriptions of each frame of the verbs, and then annotated example sentences of these frames. Unlike FrameNet, we first annotated semantic dependency relations of sentences according to HowNet, and hope to extract frames from the corpus later. FrameNet only described the frame of verbs, while from Semantic Dependency Net the correlation frame of nouns and verbs could be automatically learned by machine.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML