File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1107_metho.xml

Size: 14,985 bytes

Last Modified: 2025-10-06 14:09:11

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1107">
  <Title>Chinese Chunking with another Type of Spec</Title>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
(3) Structural ambiguities
</SectionTitle>
    <Paragraph position="0"> In Chinese, some structural ambiguities in phrase level are impossible or unnecessary to be distinguished during chunking. There is an example of 'a_n_n':</Paragraph>
    <Paragraph position="2"> identically acceptable. English also has such problem. The solution of CoNLL2000 is not to distinguish inner structure and group the given sequence as a single chunk. For example, the inner structure of '[NP heavy truck production]' is '{{heavy truck} production}', whereas one reading of '[NP heavy quake damage]' is '{heavy {quake damage}}'.</Paragraph>
    <Paragraph position="3"> Besides, 'a_n_n', 'm_n_n' and 'm_q_n_n' also have the similar problem.</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Chinese Chunking Spec
</SectionTitle>
    <Paragraph position="0"> As a kind of shallow parsing, the principles of chunking are to make chunking much more efficient and precise than full parsing. Obviously, one can shorten the length of chunks to leave ambiguities outside of chunks. For example, if we let noun-noun sequences always chunk into single word, those ambiguities listed in Table 1 would not be encountered and the performance would be greatly improved. In fact, there is an implicit requirement in chunking, no matter which language it is, the average length of chunks is as longer as possible without violating the general principle of chunking. So a trade-off between the average chunk length and the chunking performance exists.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Why another type of spec is needed
</SectionTitle>
      <Paragraph position="0"> A convenient spec is to extract the lowest non-terminal nodes from a Treebank (e.g. CTB) as Chinese chunked data. But there are some problems. The trees are designed for full parsing instead of shallow parsing, thus some of these problems listed in section 2 could not be resolved well in chunking. Maybe we can compile some rules to prune the tree or break some non-terminal nodes in order to properly resolve these problems just like CoNLL2000. However, just as (Kim Sang and Buchholz, 2000) noted: &amp;quot;some trees are very complex and some annotations are inconsistent&amp;quot;.</Paragraph>
      <Paragraph position="1"> So these rules are complex, the extracted data are inconsistent and manual check is also needed. In addition, the resource of Chinese Treebank is limited and the extracted data is not enough for chunking.</Paragraph>
      <Paragraph position="2"> So we compile another type of chunking spec according to the observation from un-bracket corpus instead of Treebank. The only shortcoming is the cost of annotation, but there are some advantages for us to explore.</Paragraph>
      <Paragraph position="3"> 1) It coincides with auto chunking procedure, and we can select proper solutions to these problems without constraints of the exist Treebank. The purpose of drafting another type of chunking spec is to keep chunking consistency as high as possible without hurting the performance of autochunking in whole.</Paragraph>
      <Paragraph position="4"> 2) Through spec drafting and text annotating most frequent and significant syntactic ambiguities could be studied, and those observations are in turn described in the spec carefully.</Paragraph>
      <Paragraph position="5"> 3) With a proper spec and certain mechanical approaches, a large-scale chunked data could be produced without supporting from the Treebank.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Our spec
</SectionTitle>
      <Paragraph position="0"> Our spec and chunking annotation are based on PK corpus  (Yu et al. 1996). The PK corpus is unbracketed, but in which all words are segmented and only one POS tag is assigned to each word. We define 11 chunk types that are similar with CoNLL2000. They are NP (noun chunk), VP (verb chunk), ADJP (adjective chunk), ADVP (adverb chunk), PP (prepositional chunk), CONJP (conjunction), MP (numerical chunk), TP (temporal chunk), SP (spatial chunk), INTJP (interjection) and INDP (independent chunk). During spec drafting we try to find a proper chunk spec to solve these problems by two ways: either merging neighboring chunks into one chunk or shortening them. Besides those structural ambiguities, we also extend boundary of the chunks with minor structural ambiguities in order to make the chunks close to the constituents.  The auxiliary 'g11352/of' is one of the most frequent words in Chinese and used to connect a pre-modifier with its nominal head. However the left boundary of such a g8859 -construction is quite complicated: almost all kinds of preceding clauses, phrases and words can be combined with it to form such a pre-modifier, and even one g8859-construction can embed into another. So we definitely leave it outside any chunk. Similarly, conjunctions, 'g3706 /and', 'g5783/or' and 'g2568/and' et al., are also left outside any chunk no matter they are word-level or  Can be downloaded from www.icl.pku.edu.cn phrase-level coordinations. For instances, the examples in Section 2 are chunked as '[NP g6931g12586g5627</Paragraph>
      <Paragraph position="2"> Similar with the shared task of CoNLL2000, we define noun compound that is formed by a noun-sequence: 'a_n_n', 'm_n_n' or 'm_q_n_n', as one chunk, even if there are sub-compounds, sub-phrase or coordination relations inside it. For instances, '[NP g19750g5192 g5547g5907g13785 g12197g6228 g7393g2165g19443]',</Paragraph>
      <Paragraph position="4"> respectively.</Paragraph>
      <Paragraph position="5"> However, it does not mean that we blindly bind all neighboring nouns into a flat NP. If those neighboring nouns are not in one constituent or cross the phrase boundary, they will be chunked separately, such as following two examples in  g11263g11198] g8859/u [NP g13942g1319] [NP g2163g14033]'. So our solution does not break the grammatical phrase structure in a given sentence.</Paragraph>
      <Paragraph position="6"> With this chunking strategy, we not only properly resolved these problems, but also get longer chunks. Longer chunks can make successive parsing easier based on chunking. For example, if we chunked the sentence as: [NP g8424g7567] g8859 [NP g9497g2773 g2660g12642] [NP g6777g6686] [VP g5063 g3275g3157g16280g7080 g2636] g1941/w There would be three possible syntactic trees which are difficult to be distinguished:</Paragraph>
      <Paragraph position="8"> Whereas with above chunking strategy of our spec, there is only one syntactic tree remained: {{[NP g8424g7567] g8859 [NP g9497g2773 g2660g12642 g6777g6686]} [VP g5063 g3275g3157g16280g7080 g2636]} g1941/w Another reason of the chunking strategy is that for some NLP applications such as IR, IE or QA, it is unnecessary to analyze these ambiguities at the  early stage of text analysis.</Paragraph>
      <Paragraph position="9"> (2) PP Most PP consists of only the preposition itself  because the right boundary of a preposition phrase is hard to identify or far from the preposition. But certain prepositional phrases in Chinese are formed with a frame-like construction, such as [PP g3324/p 'at' ...g1025/f 'middle'], [PP g3324/p ...g990/f 'top'], etc. Statistics shows that more than 90% of those frame-like PPs are un-ambiguous, and others commonly have certain formal features such as an auxiliary g11352 or a conjunction immediately following the localizer. Table 2 shows the statistic result. Thus with those observations, those frame-like constructions could be chunked as PP. The length of such kind of PP frames is restricted to be at most two words inside in order to keep the distribution of chunk length more even and the chunking annotation more consistent.</Paragraph>
      <Paragraph position="10">  as a chunk without any ambiguity (3) SP  Most spatial chunks consist of only the localizer(with POS tag '/s' or '/f'). But if the spatial phrase is in the beginning of a sentence, or there is a punctuation (except &amp;quot;g1940&amp;quot;) in front of it, then the localizer and its preceding words could be chunked as a SP. And the number of words in front of the localizer is also restricted to at most two for the same reason.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
(4) VP
</SectionTitle>
    <Paragraph position="0"> Commonly, a verb chunk VP is a pre-modifier verb construction, or a head-verb with its following verb-particles which form a morphologically derived word sequence. The pre-modifier is formed by adverbial phrases and/or auxiliary verbs.</Paragraph>
    <Paragraph position="1"> In order to keep the annotation consistent those verb particles and auxiliary verbs could be found in a closed list respectively only. Post-modifiers of a verb such as object and complement should be excluded in a verb chunk.</Paragraph>
    <Paragraph position="2"> We find that although a head verb groups more than one preceding adverbial phrases, auxiliary verbs and following verb-particles into one VP, its chunking performance is still high. For example:</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Spec Comparison
</SectionTitle>
      <Paragraph position="0"> We compare our spec with the Treebank-derived spec, named as S1, which is to extract the lowest non-terminal nodes from CTB as chunks from the  aspect of the solutions of these problems in section 2. Noun-noun compound and the coordination  which has no conjunction are chunked identically in both specs. But for others, there are different. In S1, the conjunctions of phrase-level coordination are outside of chunks and the ones of word-level are inside a chunk, all adjective or numerical modifiers are separate from noun head. According to S1, the example in 3.2.1 should be chunked as following.</Paragraph>
      <Paragraph position="2"> But these phrases that are impossible to distinguish inner structures during the early stage of text analysis are hard to be chunked and would cause some inconsistency. '[ADJP g7380g1314] [NP g5049g17176] g2656 [NP g10995g8975g17165]' or '[ADJP g7380g1314] [NP g5049g17176 g2656 g10995 g8975g17165]', '[ADJP g10628g1207] [NP g1237g1006] [NP g2058g5242]' or '[ADJP g10628g1207] [NP g1237g1006 g2058g5242]', are hard to make decisions with S1.</Paragraph>
      <Paragraph position="3"> In addition, with our spec outside words are only punctuations, structural auxiliary ' g11352 /of', or conjunctions, whereas with S1, outside words are defined as all left words after lowest non-terminal extraction.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4 Chunking Annotation
</SectionTitle>
    <Paragraph position="0"> Four graduate students of linguistics were assigned to annotate manually the PK corpus with the proposed chunking spec. Many discussions between authors and those annotators were conducted in order to define a better chunking spec for Chinese. Through the spec drafting and text annotating most significant syntactic ambiguities in Chinese, such as those structural ambiguities discussed in section 2 and 3, have been studied, and those observations are carefully described in the spec in turn.</Paragraph>
    <Paragraph position="1"> Consistency control is another important issue during annotation. Besides the common methods: manual checking, double annotation, post annotation checking, we explored a new consistency measure to help us find the potential inconsistent annotations, which is hinted by (Kenneth and Ryszard. 2000), who defined consistency gain as a measure of a rule in learning from noisy data.</Paragraph>
    <Paragraph position="2"> The consistency of an annotated corpus in whole could be divided down into consistency of each chunk. If the same chunks appear in the same context, they should be identically annotated. So we define the consistency of one special chunk as the ratio of identical annotation in the same context. corpusin ))context( ,( of No.</Paragraph>
    <Paragraph position="3"> )context(in annotation same of No.</Paragraph>
    <Paragraph position="5"> (2) Where P represents a pattern of the chunk (POS or/and lexical sequence), context(P) represents the needed context to annotate this chunk, N represents the number of chunks in the whole corpus S. In order to improve the efficiency we also develop a semi-automatic tool that not only check mechanical errors but also detect those potential inconsistent annotations. For example, one inputs a POS pattern: 'a_n_n', and an expected annotation result: 'B-NP_I-NP_E-NP  ', the tool will list all the consistent and inconsistent sentences in the annotated text respectively. Based on the output one can revise those inconsistent results one by one, and finally the consistency of the chunked text will be improved step by step.</Paragraph>
  </Section>
  <Section position="8" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5 Chunking Model
</SectionTitle>
    <Paragraph position="0"> After annotating the corpus, we could use various learning algorithms to build the chunking model. In this paper, HMM is selected because not only its training speed is fast, but also it has comparable performance (Xun and Huang, 2000).</Paragraph>
    <Paragraph position="1"> Automatic chunking with HMM should conduct the following two steps. 1) Identify boundaries of each chunk. It is to assign each word a chunk mark, named M, which contains 5 classes: B, I, E, S (a single word chunk) and O (outside all chunks). 2) Tag the chunk type, named X, which contains 11 types defined in Section 3.</Paragraph>
    <Paragraph position="2"> So each word will be tagged with two tags: M and X (the words excluding from any chunk only have M). So the result after chunking is a sequence of triples (t, m, x), where t, m, x represent POS tag, chunk mark and chunk type respectively. All the triples of a chunk are combined as an item n</Paragraph>
    <Paragraph position="4"> which also could be named as a chunk rule. Let W as the word segmentation result of a given sentence,</Paragraph>
    <Paragraph position="6"> ) as the chunking result. The statistical chunking model could be described as following:  B, E, I represent the left/right boundary of a chunk and inside a chunk respectively, B-NP means this word is the beginning of NP.</Paragraph>
    <Paragraph position="7">  Smoothing follows the method of (Gao et al., 2002).</Paragraph>
    <Paragraph position="8"> In order to improve the performance we use N-fold error correction (Wu, 2004) technique to reduce the error rate and TBL is used to learn the error correction rules based on the output of HMM.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML