File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1213_intro.xml
Size: 2,540 bytes
Last Modified: 2025-10-06 14:01:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1213"> <Title>Annotating information structures in Chinese texts using HowNet</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Corpora are essential resources to any research in language engineering. For Chinese, efforts in building large corpora started in the 90s. For instance, the PH corpus of 4 million Chinese characters with word boundary information was released in 1993 (Guo, 1993). The first version of the Sinica corpus of two millions words marked with word boundaries and parts-of-speech was released in 1995 (CKIP, 1995). In 1996, a small corpus of 5266 distinct words (inclusive of punctuation marks) with a total occurrence frequency of 51870 was released (Yu et al., 1996). This corpus was derived from the Singapore Primary School Chinese Language Textbooks. It contained information on word boundaries, parts-of-speech and also syntactic structures. In 2000, two additional bracketed corpora have just been announced. The first one, the Chinese Penn Treebank, includes 100thousand words (Xia et al., 2000). The second one, the Sinica Treebank, which is derived from the Sinica corpus, contains 38,725 sentences with 1000 of them released to the public 1 (CKIP, 2000).</Paragraph> <Paragraph position="1"> The historical development of Chinese corpus construction has shown a consensus in incorporating more powerful linguistic structures into corpora. As noted by Marcus (1997), the more powerful linguistic structures will help in improving the accuracy of parsing.</Paragraph> <Paragraph position="2"> This is especially true to isolating language such as Chinese. However, there is very little work on annotating corpora with semantic information.</Paragraph> <Paragraph position="3"> To the best of our knowledge, there is only one report of this kind. The work by Lua 2 annotated 340,000 words with semantic class information as defined in a thesaurus of synonyms (Mei, 1983). With the release of HowNet 3, a bilingual general knowledge base, Gan and Tham (1999) reported the first corpus of 30,000 words that was annotated with the general knowledge structure defined in HowNet. This paper reported an extension of the work in &an and Tham (1999) on the annotation of information structures in Chinese texts. In Section 2, an overview of HowNet is provided. Information structure and an illustration will be given in Section 3.</Paragraph> </Section> class="xml-element"></Paper>