File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-3008_metho.xml
Size: 9,539 bytes
Last Modified: 2025-10-06 14:10:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-3008"> <Title>Discursive Usage of Six Chinese Punctuation Marks</Title> <Section position="5" start_page="44" end_page="44" type="metho"> <SectionTitle> SOLUTIONHOOD-S, SOLUTIONHOOD-M and </SectionTitle> <Paragraph position="0"> SOLUTIONHOOD-N are regarded as 3 relations.</Paragraph> <Paragraph position="1"> Following Carlson et al. (2001) and Marcu's (1999) examples, we've composed a 60-page Chinese RST annotation manual, which includes preprocessing procedures, segmentation rules, definitions and examples of the relations, tag definitions for structural elements, tagging conventions for special structures, and a relation selection protocol. When annotating, we choose the most indicative relation according to the manual. Trees are constructed with binary branches except for multinuclear relations.</Paragraph> <Paragraph position="2"> One experienced annotator had sketched trees for all the 395 files before the completion of the manual. Then she annotated 97 shortest files from 197 randomly selected texts, working independently and with constant reference to the manual. After a one-month break, she re-annotated the 97 files, with reference to the manual and with occasional consultation with Chinese journalists and linguists. The last version, though far from error-free, is currently taken as the right version for reliability tests and other statistics.</Paragraph> <Paragraph position="3"> Parentheses, and other PMs used in structural elements of CJPL texts, are of high relevance to discourse parsing, since they can be used in a preprocessor to filter out text fragments that do not need be annotated in terms of RST. trees. The intra-coder accuracy rate (R</Paragraph> </Section> <Section position="6" start_page="44" end_page="45" type="metho"> <SectionTitle> 3 Results </SectionTitle> <Paragraph position="0"> The 97 double-annotated files have in the main body of their texts a total of 677 paragraphs and 1,914 EUDAs. Relational patterns of those PMs are reported in Table 2-7 below . The &quot;N&quot;, &quot;S&quot; or &quot;M&quot; tags after each relation indicate the nuclearity status of each EUDA ended with a certain PM. The number of those PMs used in structural elements of CJPL texts are also reported as they make up the total percentage. Based on data from the 2nd version of annotated texts. This is higher than the overall 42.93% rate for colons used in structural elements, for we've only finished 97 shortest ones from the 197 randomly selected files. The above data suggest at least the following: 1) There is no one-to-one mapping between any of PM studied and a rhetorical relation. But some PMs have dominant rhetorical usages. 2) C-Question Mark is not most frequently related with SOLUTIONHOOD, but with CONJUNCTION. That is because a high percentage of questions in our corpus are rhetorical and used in groups to achieve certain argumentative force.</Paragraph> <Paragraph position="1"> 3) C-Colon is most frequently related with ATTRIBUTION and ELABORATION, apart from its usage in structural elements. 4) C-Semicolon is overwhelmingly associated with multinuclear relations, particularly with CONJUNCTION.</Paragraph> <Paragraph position="2"> 5) C-Dash usually indicates an ELABORATION relation. But since it is often used in pairs, it is often bound to both the Nucleus and Satellite units of a relation.</Paragraph> <Paragraph position="3"> 6) 82.3% tokens of the six Chinese PMs are uniquely related to EUDAs of certain nucleus status in a rhetorical relation, taking even C-Dash into account.</Paragraph> <Paragraph position="4"> 7) The following relations have more than 10% of their instances related to one of the six PMs studied here: ADDITION,</Paragraph> </Section> <Section position="7" start_page="45" end_page="45" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> How useful are these six PMs in the prediction of rhetorical relations in Chinese texts? In our opinion, this question can be answered partly through a comparison with Chinese cue phrases.</Paragraph> <Paragraph position="1"> Cue phrases are widely discussed and exploited in the literature of both Chinese studies and RST applications as a major surface device.</Paragraph> <Paragraph position="2"> Unfortunately, Chinese cue phrases in natural texts are difficulty to identify automatically. As known, Chinese words are made up of 1, 2, or more characters, but there is no explicit word delimiter between any pair of adjacent words in a string of characters. Thus, they are not known before tokenization (&quot;fenci&quot; in Chinese, meaning &quot;separating into words&quot;, or &quot;word segmentation&quot; so as to recognize meaningful words out of possible overlaps or combinations). The task may sound simple, but has been the focus of considerable research efforts (e.g. Webster and Kit, 1992; Guo 1997; Wu, 2003).</Paragraph> <Paragraph position="3"> Since many cue phrases are made up of high-frequency characters (e.g. &quot;Er -ER&quot; in &quot;Er -er&quot; meaning &quot;but/so/and&quot;, &quot; Ran Er -ran'er&quot; meaning &quot;but/however&quot;, &quot;Yin Er -yin'er&quot; meaning &quot;so/because of this&quot;, &quot;Er Qie -erqie&quot; meaing &quot;in addition&quot; etc.; &quot;Ci -ci &quot; in &quot;Ci Hou -cihou &quot; meaning &quot;later/hereafter&quot;, &quot;Yin Ci -yinci&quot; meaning &quot;as a result&quot;, &quot;You Ci Kan Lai -youcikanlai &quot; meaning &quot;on this ground/hence&quot;, etc.), a considerable amount of computation must be done before these cue phrases can ever been exploited.</Paragraph> <Paragraph position="4"> Apart from tokenization, POS and WSD are other necessary steps that should be taken before making use of some common cue phrases. They are all hard nuts in Chinese language engineering.</Paragraph> <Paragraph position="5"> Interestingly, many researches done in these three areas have made use of the information carried by PMs (e.g. Sun et al. 1998).</Paragraph> <Paragraph position="6"> Chan et al. (2000) did a study on identify Chinese connectives as signals of rhetorical relations for their Chinese summarizer. Their tests were successful. But like PMs, Chinese cue phrases are not in a one-to-one mapping relationship with rhetorical relations, either.</Paragraph> <Paragraph position="7"> In our finished portion of CJPL corpus, we've identified 161 Types of cue phrases at or above our EUDA level, recording 539 tokens. These cue phrases are scattered in 477 EDUAs, indicating 20.5% of the total relations in our finished portion of the corpus. Our six PMs, on the other hand, have 551 tokens in the same finished portion, delimiting 345 EUDAs (and 206 structural elements), and indicating 14.8% of the total relations. However, since there are far more types of cue phrases than types of punctuation marks, 90.1% of cue phrases are sparser at or above our EDUA level than the least frequently used PM--Ellipsis in this case.</Paragraph> <Paragraph position="8"> And Chinese cue phrases don't signal all the rhetorical relations at all levels. For instance, CONJUNTION is the most frequently used relation in our annotated text (taking 22.1% of all the discursive relations), but it doesn't have strong correlation with any lexical item. Its most frequent lexical cue is &quot;Ye -ye&quot;, taking 2.4%. ELABORATION is another common relation in CJPL, but it is rarely marked by cue phrases.</Paragraph> </Section> <Section position="8" start_page="45" end_page="45" type="metho"> <SectionTitle> ATTRIBUTION, SOLUTIONHOOD and </SectionTitle> <Paragraph position="0"> DISJUNCTION are amongst other lowest marked relations in Chinese--they happen to be signaled quite significantly by a punctuation mark.</Paragraph> <Paragraph position="1"> Given the cost to recognize Chinese cue phrases accurately, the sparseness of many of these cues, and the risk of missing all cue phrases for a particular discursive relation, punctuation marks with strong rhetorical preferences appear to be useful supplements to cue phrases.</Paragraph> </Section> <Section position="9" start_page="45" end_page="47" type="metho"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> Because rhetorical structure in Chinese texts is not explicit by itself, systematic and quantitative evaluation of various factors that can contribute to the automatic analysis of texts is quite necessary. The purpose of this study is to look into the discursive patterns of Chinese PMs, to see if they can facilitate discourse parsing without deep semantic analysis.</Paragraph> <Paragraph position="1"> We have in this study observed the discursive usage of six Chinese PMs, from their overall distribution in our Chinese discourse corpus, their syntax in context, to their rhetorical roles at We are yet to give a theoretical definition of Cue Phrases in our study. But the identified ones range similarly to those English cue phrases listed in Marcu (1997).</Paragraph> <Paragraph position="2"> or above our EUDA level. Current statistics seem to suggest clear patterns of their rhetorical roles, and their distinctive correlation with nuclearity in most relations. These patterns and correlation may be useful in NLP projects.</Paragraph> </Section> class="xml-element"></Paper>