File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1818_metho.xml

Size: 13,688 bytes

Last Modified: 2025-10-06 14:08:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1818">
  <Title>Chinese Base-Phrases Chunking</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Definitions of Chinese Base
Phrases
</SectionTitle>
    <Paragraph position="0"> The idea of parsing by chunks goes back to Abney (1991). In his definition of chunks in English, he assumed that a chunk has syntactic structure and he defined chunks in terms of major heads, which are all content words except those that appear between a function word and the content word which selects. A major head is the 'semantic' head (s-head) for the root of the chunk headed by it. However, s-heads can be defined in terms of syntactic heads.</Paragraph>
    <Paragraph position="1"> If the syntactic head h of a phrase P is a content word, is also the s-head of P. If h is a function word, the s-head of P is the s-head of the phrase</Paragraph>
    <Paragraph position="3"> The research enlightens us about the definition of Chinese base phrases. In this paper, a Chinese base phrase consists of a single content word surrounded by a cluster of function words. The single content word is the semantic head of the base phrase. The forms of base phrases can be expressed as follows.</Paragraph>
    <Paragraph position="5"> Coordinate structure The components of 'modifier' and 'complement' are optional. A head could be a simple word as well as the structure of &amp;quot;modifier + head&amp;quot; or &amp;quot;head + complement&amp;quot;, but not &amp;quot;modifier + head + complement&amp;quot;. Coordinate structure could not consist of coordinate symbols such as comma and co-ordinating conjunction. The type of base phrases is congruent with its head's semantic information.</Paragraph>
    <Paragraph position="6"> In most cases, the type accords with the head's syntactical information, for example, when the head is a noun, the phrase is a noun phrase. However, when a head is a noun that denotes a place, the base phrase including that head is not a noun phrase, but a location phrase.</Paragraph>
    <Paragraph position="7"> We consider 9 types of Chinese base phrases in our research: namely adjective phrase (ap), distinguisher phrase (bp), adverbial phrase (dp), noun phrase (np), temporal phrase (tp), location phrase (sp), verb phrase (vp), quantity phrase (mp), quasi quantity phrase (mbar). The inner grammar structures of every base phrase are very important too, but we will discuss that in another paper.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Overview
</SectionTitle>
    <Paragraph position="0"> The frame of Chinese base phrase parsing is composed of two parts: one is the &amp;quot;Type and bracket tagging model&amp;quot;, the other is the &amp;quot;Base phrases acquisition model&amp;quot; which consists of two modules which are &amp;quot;brackets matching &amp;quot;and &amp;quot;correct the types of base phrases&amp;quot;. (See figure 1.) The input to the system is a sequence of POS. In the &amp;quot;Predict the phrase boundary&amp;quot; module, we predict the type, which each word belongs to, and the position of each word in a base phrase with Memory-Based Learning (MBL)(Using the software package provided by Tilburg University.).</Paragraph>
    <Paragraph position="1"> And the result is expressed as a pair formed by base phrase type and position information. Because our Chinese base phrases are non-recursive and non-overlapping, the left and right boundaries of base phrases must match with each other which means they should be a pair and alternative.</Paragraph>
    <Paragraph position="2"> However, the errors involving in the first part will lead to incorrect base phrases because the boundaries do not match, for example &amp;quot;[...[...]&amp;quot;. In the second part, grammar rules that indicate the inner structures of base phrases are used to resolve the boundary ambiguities. Furthermore, it also takes lexical information into account to correct the type mistakes.</Paragraph>
    <Paragraph position="3"> The corpus used in the experiment includes 7606 sentences. It comes from the Chinese Balance Corpus including about 2000 thousand words with four types: literature (44%), news (30%), academic article (20%) and spoken Chinese (6%). These 7606 sentences are split into 6846 training sentences and 760 held out for testing.</Paragraph>
    <Paragraph position="4"> Input Type and bracket tagging  based, supervised learning approach: a memory-based learning algorithm constructs a classifier for a task by storing a set of examples. Each example associates a finite number of classes. Given a new feature vector, the classifier extrapolates its class from those of the most similar feature vectors in memory (Daelemans et, al., 1999). The input to the &amp;quot;Predict the phrase boundary&amp;quot; module is some feature vectors, which compose of a sequence of POS. The solution of the module is to</Paragraph>
    <Paragraph position="6"> (Wojciech and Thorsten, 1998), a duple formed by a type tag and a boundary tag for each word t . Here r indicates the boundary tag, while denotes the type tag.</Paragraph>
    <Paragraph position="8"> the word is not in any type of base phrases.) [?] The indicates the position of the word in a base phrase as shown below:</Paragraph>
    <Paragraph position="10"> 'L': the left boundary, 'R': the right boundary, 'I': the middle position, 'O': outside any base phrases, 'LR': the left and right boundary.</Paragraph>
    <Paragraph position="11"> What information is used to represent data in feature vectors is an important aspect in MBL algorithms. We tried many feature vectors with various lengths. And it is interesting to note that the feature window is not the bigger the better. When the feature window is (-2, +2) in the context, the result is the best. So the feature vector in the experiment is: (POS-2, POS-1, POS0, POS+1, POS+2). The pattern describes the combination of feature vector and result duple &gt;&lt;</Paragraph>
    <Paragraph position="13"> For the experiment in the first step, we use  TiMBL , an MBL software package developed in the ILK-group (Daelemans et, al., 2001). The results of phrase boundary prediction with MBL shows in table 1.</Paragraph>
    <Paragraph position="14"> Table1: The result of word boundary prediction Table 1 shows that there is much difference between the results of various types of base phrases. The precisions and recalls of np, vp, mp, ap and dp are all almost over 90%. Comparatively, the results of sp, tp, bp and mbar are much lower, especially their recalls. This is due to some resemblances between sp, tp and np in Chinese syntactical grammars. Sp and tp may be considered as belong to NP, however, in the definition of Chinese base phrases, sp, tp and np are defined separately for the semantic difference. And the separation can also help in other tasks such as proper noun identification, information retrieval etc.</Paragraph>
    <Paragraph position="15">  processing model.</Paragraph>
    <Paragraph position="16"> (1) Boundary ambiguity: the r 's mistakes will cause the multiple choices regarding the boundaries. For example: &amp;quot; i {np Zhe /rN } [?] /m Chuang Ju /n } , /, Dui /p {np Hou Shi /t {np Zhen Jiu /n } De /u {np Fa Zhan /vN Ying Xiang /vN } {ap Hen /dD Da /a } . /. &amp;quot;. (Please pay attention to the '__' part.) There are altogether three modalities: &amp;quot;{ ...{ ...}&amp;quot;, &amp;quot;{ ...}...}&amp;quot; and &amp;quot;{ ...{ ...}...}&amp;quot;. These are caused by the redundancy and absence of boundaries.</Paragraph>
    <Paragraph position="18"> (2) The type mistake of base phrases: For example: in the sentence of &amp;quot;{np Cang Yi /n } {dp Ji Ben Shang /d } {vp Shi /vC } {np Qing Cang Gao Yuan /nS } Shang /f {tp Cang Zu /nR Ren Min /n } Zai /p...&amp;quot;, the parser mistakes the type of &amp;quot;{ Cang Zu /nR Ren Min /n }&amp;quot; ,which is np, for tp. This error type commonly appears between sp, tp and np, as well as mbar and mp.</Paragraph>
    <Paragraph position="19">  should be &amp;quot;{np Wai Yong /n } {np Yao Wu /n }&amp;quot;. It is very difficult to correct this type of errors because the boundary distribution accords with the definition of Chinese base phrases. The left and right boundaries alternate with each other. Therefore, it is very difficult to find the errors in the sequence from the modalities.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Obtaining the whole base phrases
</SectionTitle>
      <Paragraph position="0"> with Grammar rules With the bracket (boundary) representation, incorrect bracket will be generated but these will be eliminated in the bracket combination process. In the experiment, we attempt to apply grammar rules that represent the inner structures of Chinese base phrases to get rid of the boundary ambiguities.</Paragraph>
      <Paragraph position="1"> These grammar rules are derived from the corpus.</Paragraph>
      <Paragraph position="2"> On the other hand, boundary predictions can find many base phrases that do not accord with the limited grammar rules.</Paragraph>
      <Paragraph position="3"> Figure 2 shows the main strategy of how to use the grammar rules. When if ()&gt;1, there are more than one pair of combined brackets in which the sequences accord with the grammar rules. We are apt to choose the longest possible because the shorter sequences appear more in the corpus. The longer the sequence, the more weight it should carry. When there is only the shorter sequence according with grammar rules, it is more possible to be the correct one. In this case, one or more boundaries will be left. They often need some other boundaries to match, so we try to retrieve some missing boundaries through the partitions in the sentences that should not belong to any base phrases. These partitions are the marks of base phrase boundaries.</Paragraph>
      <Paragraph position="4"> If we find these partitions between two ambiguous boundaries, we will know where to place the new boundary.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.3 Correct the type mistake with
lexical information
</SectionTitle>
      <Paragraph position="0"> In the Chinese language, some POS sequences may belong to different types. For example, &amp;quot;{vN n}&amp;quot; could be np, sp or tp. These sequences often appear in np, sp, tp, mp and mbar. It is difficult to know its right type even with the grammar rules, as we have done in section 5.2. In order to resolve this problem, we attempt to use lexical information because it implies semantic information to some extent.</Paragraph>
      <Paragraph position="1"> The lexical information is distinctive between mp and mbar. mbar is often composed of numbers such as &amp;quot;1200&amp;quot; and numbers in Chinese such as &amp;quot;Si &amp;quot;. The lexical information between tp and np is also obvious, such as &amp;quot;Shi Hou &amp;quot;, &amp;quot;Shi Dai &amp;quot; and &amp;quot;Shi Ji &amp;quot; etc. For sp and np, the words are &amp;quot;Di Qu &amp;quot;, &amp;quot;Liu Yu &amp;quot; etc.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.4 Experimental results
</SectionTitle>
      <Paragraph position="0"/>
      <Paragraph position="2"> if (Only the sequence with the shortest length accords with the grammar rules).</Paragraph>
      <Paragraph position="3"> then { Find partitions such as conjunctions, localizers, punctuations and some prepositions between the ambiguous boundaries in sequences; if (The partitions exist) then {Add boundaries to generate whole base phrases according to the  The simplest bracket combination algorithm is very strict: it only uses adjacent brackets if they appear next to each other in the correct order (first open and then close) without any intervening brackets.</Paragraph>
      <Paragraph position="4"> The result of the algorithm is shown in table 2, as the baseline of the boundary combination experiment.</Paragraph>
      <Paragraph position="5">  From the table 2, we could see the recalls are commonly low. We change another strategy to obtain the whole base phrases as described in section 5.2. The result of using the grammar rules is shown in table 3.</Paragraph>
      <Paragraph position="6"> With the help of grammar rules, all kinds of base phrases improved their f-measures though the precisions or recalls of some types decrease slightly. Comparing with the baseline results in table 2, all the recalls increase significantly. However, the recalls of sp, tp and mp still do not satisfy us. There are more than twenty structures of np which also belong to tp or sp. Except in the case where mp and mbar have the same structure {m}, they are easily distinguished in other structures. (Mbar is always composed of numerals and mp always ends with a quantifier.) In order to distinguish tp from np, sp from np and mbar from mp, we use lexical information for the type disambiguation. The results are shown in table 4.</Paragraph>
      <Paragraph position="7">  From the table 4, we could see improvement in all the results (precisions and recalls) of mp and mbar. It shows that the lexical information is effective for distinguishing between them. On the contrary, although the f-measures of np and sp increase, their precisions decline. Thus, those words marking tp and sp are not appropriate for disambiguation. We could see the effect of lexical information is limited because it is difficult to find the words that could distinguish different types of base phrases.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML