File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1001_metho.xml

Size: 20,402 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1001">
  <Title>Data Homogeneity and Semantic Role Tagging in Chinese</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 The Data
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Materials
</SectionTitle>
      <Paragraph position="0"> In this study, we used two datasets: sentences from primary school textbooks were taken as examples for simple data, while sentences from a large corpus of newspaper texts were taken as complex examples. null Two sets of primary school Chinese textbooks popularly used in Hong Kong were taken for reference. The two publishers were Keys Press and Modern Education Research Society Ltd. Texts for Primary One to Six were digitised, segmented into words, and annotated with parts-of-speech (POS). This results in a text collection of about 165K character tokens and upon segmentation about 109K word tokens (about 15K word types).</Paragraph>
      <Paragraph position="1"> There were about 2,500 transitive verb types, with frequency ranging from 1 to 926.</Paragraph>
      <Paragraph position="2"> The complex examples were taken from a sub-set of the LIVAC synchronous corpus  (Tsou et al., 2000; Kwong and Tsou, 2003). The subcorpus consists of newspaper texts from Hong Kong, including local news, international news, financial news, sports news, and entertainment news, collected in 1997-98. The texts were segmented into words and POS-tagged, resulting in about 1.8M character tokens and upon segmentation about 1M word tokens (about 47K word types). There were about 7,400 transitive verb types, with frequency ranging from 1 to just over 6,300.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Training and Testing Data
</SectionTitle>
      <Paragraph position="0"> For the current study, a set of 41 transitive verbs common to the two corpora (hereafter referred to as textbook corpus and news corpus), with frequency over 10 and over 50 respectively, was sampled.</Paragraph>
      <Paragraph position="1"> Sentences in the corpora containing the sampled verbs were extracted. Constituents corresponding to semantic roles with respect to the target verbs were annotated by a trained human annotator and the annotation was verified by another. In this study, we worked with a set of 11 predicate-independent abstract semantic roles.</Paragraph>
      <Paragraph position="2"> According to the Dictionary of Verbs in Contemporary Chinese (Xiandai Hanyu Dongci Dacidian, Xian Dai Han Yu Dong Ci Da Ci Dian - Lin et al., 1994), our semantic roles include the necessary arguments for most verbs such as agent and patient, or goal and location in some cases; and some optional arguments realised by adjuncts, such as quantity, instrument, and source. Some examples of semantic roles with respect to a given predicate are shown in Figure 1.</Paragraph>
      <Paragraph position="3"> Altogether 980 sentences covering 41 verb types in the textbook corpus were annotated, resulting in 1,974 marked semantic roles (constituents); and 2,122 sentences covering 41 verb types in the news corpus were annotated, resulting in 4,933 marked constituents  .</Paragraph>
      <Paragraph position="4"> The role labelling system was trained on 90% of the sample sentences from the textbook corpus and the news corpus separately; and tested on the remaining 10% of both corpora.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="4" type="metho">
    <SectionTitle>
4 Automatic Role Labelling
</SectionTitle>
    <Paragraph position="0"> The automatic labelling was based on the statistical approach in Gildea and Jurafsky (2002). In Section 4.1, we will briefly mention the features used in the training process. Then in Sections 4.2 and 4.3, we will explain our approach for locating headwords in candidate constituents associated with semantic roles, in the absence of parse information. null  These figures only refer to the samples used in the current study. In fact over 35,000 sentences in the LIVAC corpus have been semantically annotated, covering about 1,500 verb types and about 80,000 constituents were marked.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Training
</SectionTitle>
      <Paragraph position="0"> In this study, our probability model was based mostly on parse-independent features extracted from the training sentences, namely: Headword (head): The headword from each constituent marked with a semantic role was identified. For example, in the second sentence in Figure 1, Xue Xiao (school) is the headword in the constituent corresponding to the agent of the verb Ju Hang (hold), and Bi Sai (contest) is the headword of the noun phrase corresponding to the patient.</Paragraph>
      <Paragraph position="1"> Position (posit): This feature shows whether the constituent being labelled appears before or after the target verb. In the first example in Figure 1, the experiencer and time appear on the left of the target, while the theme is on its right.</Paragraph>
      <Paragraph position="2"> POS of headword (HPos): Without features provided by the parse, such as phrase type or parse tree path, the POS of the headword of the labelled constituent could provide limited syntactic information. null Preposition (prep): Certain semantic roles like time and location are often realised by prepositional phrases, so the preposition introducing the relevant constituents would be an informative feature. null Hence for automatic labelling, given the target verb t, the candidate constituent, and the above features, the role r which has the highest probability for P(r  |head, posit, HPos, prep, t) will be assigned to that constituent. In this study, however, we are also testing with the unknown boundary condition where candidate constituents are not available in advance. To start with, we attempt to partially locate them by identifying their head-words first, as explained in the following sections. Figure 1 Examples of semantic roles with respect to a given predicate</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
4.2 Locating Candidate Headwords
</SectionTitle>
      <Paragraph position="0"> In the absence of parse information, and with constituent boundaries unknown, we attempt to partially locate the candidate constituents by identifying their corresponding headwords first.</Paragraph>
      <Paragraph position="1"> Sentences in our test data were segmented into words and POS-tagged. We thus divide the recognition process into two steps, locating the head-word of a candidate constituent first, and then expanding from the headword to determine its boundaries.</Paragraph>
      <Paragraph position="2">  Basically, if we consider every word in the same sentence with the target verb (both to its left and to its right) a potential headword for a candidate constituent, what we need to do is to find out the most probable words in the sentence to match against individual semantic roles. We start with a feature set with more specific distributions, and back off to feature sets with less specific distributions null  . Hence in each round we look for )|(maxarg setfeaturerP r for every candidate word. Ties are resolved by giving priority to the word nearest to the target verb in the sentence.</Paragraph>
      <Paragraph position="3"> Figure 2 shows an example illustrating the procedures for locating candidate headwords. The target verb is Fa Xian (discover). In the first round, using features head, posit, HPos, and t, Shi Hou (time) and Wen Ti (problem) were identified as Time and Patient respectively. In the fourth subsequent round, backing off with features posit and HPos, Wo Men (we) was identified as a possible Agent. In this round a few other words were identified as potential Patients. However, they would not be considered since Patient was already located in a previous round. So in the end the headwords identified for the test sentence are Wo Men for Agent, Wen Ti for Patient and Shi Hou for Time.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.3 Constituent Boundary
</SectionTitle>
      <Paragraph position="0"> Upon the identification of headwords for potential constituents, the next step is to expand from these headwords for constituent boundaries. Although we are not doing this step in the current study, it can potentially be done via some finite state techniques, or better still, with shallow syntactic processing like simple chunking if available.</Paragraph>
      <Paragraph position="1">  In this experiment, we back off in the following order: P(r|head, posit, HPos, prep t), P(r|head, posit, t), P(r  |head, t), P(r  |HPos, posit, t), P(r  |HPos, t). However, the prep feature becomes obsolete when constituent boundaries are unknown.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="7" type="metho">
    <SectionTitle>
5 The Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
5.1 Testing
</SectionTitle>
      <Paragraph position="0"> The system was trained on the textbook corpus and the news corpus separately, and tested on both corpora (the data is homogeneous if the system is trained and tested on materials from the same source). The testing was done under the &amp;quot;known constituent&amp;quot; condition and &amp;quot;unknown constituent&amp;quot; condition. The former essentially corresponds to the known-boundary condition in related studies; whereas in the unknown-constituent condition, which we will call &amp;quot;headword location&amp;quot; condition hereafter, we tested our method of locating candidate headwords as explained above in Section 4.2.</Paragraph>
      <Paragraph position="1"> In this study, every noun, verb, adjective, pronoun, classifier, and number within the test sentence containing the target verb was considered a potential headword for a candidate constituent corresponding to some semantic role. The performance was measured in terms of the precision (defined as the percentage of correct outputs among all outputs), recall (defined as the percentage of correct outputs among expected outputs), and F  score which is the harmonic mean of precision and recall.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="5" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> The results are shown in Tables 1 and 2, for training on homogeneous dataset and different dataset respectively, and testing under the known constituent condition and the headword location condition.</Paragraph>
      <Paragraph position="1"> When trained on homogeneous data, the results were good on both datasets under the known constituent condition, with an F  score of about 90.</Paragraph>
      <Paragraph position="2"> This is comparable or even better to the results reported in related studies for known boundary condition. The difference is that we did not use any parse information in the training, not even phrase type. When trained on a different dataset, however, the accuracy was maintained for textbook data, but it decreased for news data, for the known constituent condition.</Paragraph>
      <Paragraph position="3"> For the headword location condition, the performance in general was expectedly inferior to that for the known constituent condition. Moreover, this degradation seemed to be quite consistent in most cases, regardless of the nature of the training set. In fact, despite the effect of training set on news data, as mentioned above, the degradation  Tex ata News a different materials.</Paragraph>
      <Paragraph position="4"> Hence the effect of training data is only obvious in the news corpus. In other words, both sets of training data work similarly well with textbook test data, but the performance on news test data is worse when trained on textbook data. This is understandable as the textbook data contain fewer examples and the sentence structures are usually much simpler than those in newspapers. Hence the system tends to miss many secondary roles like location and time, which are not sufficiently represented in the textbook corpus. The conclusion that training on news data gives better result might be ference in the corpus size of the two datasets.</Paragraph>
      <Paragraph position="5"> Nevertheless, the deterioration of results on textbook sentences, even when trained on news data, is simply reinforcing the importance of data homogeneity, if nothing else. More on data homogeneity will be discussed in the next section.</Paragraph>
      <Paragraph position="6"> p In addition, the surprisingly low precision under the headword location condition is attributable to a technical inadequacy in the way we break ties. In this study we only make an effort to eliminate multiple tagging of the same role to the same target verb in a sentence on either side of the target verb, but not if they appear on both sides of the target verb. This should certainly be dealt with in future experiments.</Paragraph>
      <Paragraph position="7">  cuss this below in relation to n', duration as in</Paragraph>
    </Section>
    <Section position="3" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
Chi San
</SectionTitle>
      <Paragraph position="0"> d the parse information wo verb Jin Hang , being very pol he design he feature set should benefit m nalysis and input.</Paragraph>
    </Section>
    <Section position="4" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.1 Role of Parse Information
</SectionTitle>
      <Paragraph position="0"> According to Carreras and Marquez (2004), the state-of-the-art results for semantic role labelling systems based on shallow syntactic information is about 15 lower than those with access to gold standard parse trees, i.e., around 60. With homogeneous training and testing data, our experimental results for the headword location condition, with no syntactic information available at all, give an F  score of 52.89 and 44.35 respectively for textbook data and news data. Such results are in line with and comparable to those reported for the unknown boundary condition with automatic parses in Gildea and Palmer (2002), for instance. Moreover, when they used simple chunks instead of full parses, the performance resulted in a drop to below 50% precision and 35% recall with relaxed scoring, ce their conclusion on the necessity of a parser.</Paragraph>
      <Paragraph position="1"> The more degradation in performance observed in the news data is nevertheless within expectation, and it suggests that simple and complex data seem to have varied dependence on parse information.</Paragraph>
      <Paragraph position="2"> We will further dis data homogeneity.</Paragraph>
    </Section>
    <Section position="5" start_page="5" end_page="5" type="sub_section">
      <SectionTitle>
6.2 Data Homogeneity
</SectionTitle>
      <Paragraph position="0"> The usefulness of parse information for semantic role labelling is especially interesting in the case of Chinese, given the flexibility in its syntax-semantics interface (e.g. the object after Chi 'eat' could refer to the patient as in Chi Pin Guo 'eat apple', location as in Chi Shi Tang 'eat cantee Yen 'eat three years', etc.).</Paragraph>
      <Paragraph position="1"> As reflected from the results, the nature of training data is obviously more important for the news data than the textbook data; and the main reason might be the failure of the simple training data to capture the many complex structures of the news sentences, as we suggested earlier. The relative flexibility in the syntax-semantics interface of Chinese is particularly salient; hence when a sentence gets more complicated, there might be more intervening constituents an uld be useful to help identify the relevant ones in semantic role labelling.</Paragraph>
      <Paragraph position="2"> With respect to the data used in the experiment, we tried to explore the complexity in terms of the average sentence length and number of semantic role patterns exhibited. For the news data, the average sentence length is around 59.7 characters (syllables), and the number of semantic role patterns varies from 4 (e.g. Da Suan 'to plan') to as many as 25 (e.g. Jin Hang 'to proceed with some action'), with an average of 9.5 patterns per verb. On the other hand, the textbook data give an average sentence length of around 39.7 characters, and the number of semantic role patterns only varies from 1 (e.g. Jue Ding 'to decide') to 11 (e.g. Ju Hang 'to hold some event'), with an average of 5.1 patterns per verb. Interestingly, the ymorphous in news texts, only shows 5 different patterns in textbooks.</Paragraph>
      <Paragraph position="3"> Thus the nature of the dataset for semantic role labelling is worth further investigation. T of the method and t from ore linguistic a</Paragraph>
    </Section>
    <Section position="6" start_page="5" end_page="7" type="sub_section">
      <SectionTitle>
6.3 Future Work
</SectionTitle>
      <Paragraph position="0"> In terms of future development, apart from improving the handling of ties in our method, as mentioned above, we plan to expand our work in several respects. The major part would be on the generalization to unseen headwords and unseen predicates. As is with other related studies, the examples available for training for each target verb are very limited; and the availability of training data is also insufficient in the sense that we cannot expect them to cover all target verb types. Hence  it is very important to be able to generalize the process to unseen words and predicates. To this end we will experiment with a semantic lexicon like Tongyici Cilin (Tong Yi Ci Ci Im , a Chinese thesau null re of Chinese, we intend to improve our method and re linguistic consideration.</Paragraph>
      <Paragraph position="1"> semantic lexicons, and to modify the method and augment the feature set with more linguistic input.</Paragraph>
      <Paragraph position="2"> This work is supported by Competitive Earmarked  Tagging. In Proceedings of the Research Note Session of the 10th Conference of the European Chapter rus) in both training and testing, which we expect to improve the overall performance.</Paragraph>
      <Paragraph position="3"> Another area of interest is to look at the behaviour of near-synonymous predicates in the tagging process. Many predicates may be unseen in the training data, but while the probability estimation could be generalized from near-synonyms as suggested by a semantic lexicon, whether the similarity and subtle differences between near-synonyms with respect to the argument structure and the corresponding syntactic realisation could be distinguished would also be worth studying. Related to this is the possibility of augmenting the feature set. Xue and Palmer (2004), for instance, looked into new features such as syntactic frame, lexicalized constituent type, etc., and found that enriching the feature set improved the labelling performance. In particular, given the importance of data homogeneity as observed from the experimental results, and the challenges posed by the characteristic natu feature set with mo</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="7" end_page="7" type="metho">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> The study reported in this paper has thus tackled semantic role labelling in Chinese in the absence of parse information, by attempting to locate the corresponding headwords first. We experimented with both simple and complex data, and have explored the effect of training on different datasets.</Paragraph>
    <Paragraph position="1"> Using only parse-independent features, our results under the known boundary condition are comparable to those reported in related studies. The head-word location method can be further improved.</Paragraph>
    <Paragraph position="2"> More importantly, we have observed the importance of data homogeneity, which is especially salient given the relative flexibility of Chinese in its syntax-semantics interface. As a next step, we plan to explore some class-based techniques for the task with reference to existing</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML