File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0113_metho.xml
Size: 12,286 bytes
Last Modified: 2025-10-06 14:14:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0113"> <Title>Development of a Partially Bracketed Corpus with Part-of-Speech Information Only</Title> <Section position="3" start_page="0" end_page="162" type="metho"> <SectionTitle> 2. Experimental Framework </SectionTitle> <Paragraph position="0"> Because the probabilistic chunker proposed in this paper is based on syntactic tags (parts of speech), a part-of-speech tagger is needed. A word sequence W is input to the part-of-speech tagger and a part-of- null speech sequence P is generated. The output of the tagger is the input of the chunker. The probabilistic chunker partitions P into C, i.e., a sequence of chunks. Each chunk contains one or more parts of speech. Consider the example &quot;Attorneys for the mayor said that an amicable property settlement has been agreed upon .&quot;. This 15-word sentence is input to the part-of-speech tagger and a part-of-speech sequence</Paragraph> </Section> <Section position="4" start_page="162" end_page="162" type="metho"> <SectionTitle> &quot;NNS IN ATI NPT VBD CS AT JJ NN NN HVZ BEN VBN IN .&quot; is generated. The probabilistic </SectionTitle> <Paragraph position="0"> chunker then partitions this sequence into several chunks. The chunked result is shown as follows.</Paragraph> <Paragraph position="1"> \[NNS\] \[ INATINPT \] \[VBD\] \[CS\] \[AT\] \[ JJNNNN \] \[ HVZ BEN \] \[VBN\] \[IN\]\[. \] However, the pei-formance evaluation of the chunker is a sticky work. To evaluate the performance of the chunker, Susanne Corpus, which is a modified and condensed version of Brown Corpus, is adopted. But, the tagging sets \[6,7\] of LOB Corpus and Susanne Corpus are different. The latter has finer tags than the former. Thus, a tag mapper is introduced in the experimental framework shown as Figure 1.</Paragraph> <Paragraph position="2"> ~!.:: .r.~:..~..:~ :~.j -~':~'~-~ ~ &quot;~-:'~:&quot;~ ~i i &quot;::::'::::~:'~:':~:i':::: -:&quot; .::i'::::'~': ..... ::::':':~:::::':::::'::::::~::::~: ................ :~:::: :::::::::::&quot;~!~:!:'~:&quot; ~' |a ~. ~ .:.i~ ~ li ~.~.ii::.:i ~ :':: :'~ ::'~:'~ ~ ~ ~.~ &quot;&quot;:&quot; &quot;:':&quot;~ ~:':':':':':':':':&quot;:&quot; &quot;&quot; &quot;:':&quot;:':':':':':'~ ~:i~il &quot;:: ............. :::::&quot; &quot;:':':~ :~'~&quot; ~...:~.~ |</Paragraph> </Section> <Section position="5" start_page="162" end_page="162" type="metho"> <SectionTitle> C </SectionTitle> <Paragraph position="0"> In our experiments, the test sentence Ps comes from Susanne Corpus. It is a part-of-speech sequence.</Paragraph> <Paragraph position="1"> The corresponding syntactic structure T is regarded as an evaluation criterion for the probabilistic chunker.</Paragraph> <Paragraph position="2"> It is sent to the performance evaluation model. The tag mapper in this figure is used to transform the Susanne part-of-speech into LOB part-of-speech. Through the tag mapper, Ps is converted into PI. Then, PI is input to the probabilistic chunker and a chunk sequence C is produced. Finally, the performance evaluation model reports the evaluation results according to C and T.</Paragraph> </Section> <Section position="6" start_page="162" end_page="169" type="metho"> <SectionTitle> 3. A Tag Mapper </SectionTitle> <Paragraph position="0"> The tagging set of Susanne Corpus is extended and modified from LOB Corpus. They have 424 and 153 tags, respectively. To map a Susanne tag into a LOB tag manually is a tedious work. Thus, an automatic tag mapping algorithm is provided. By our investigation, we found that words are good clues to relate these two tagging sets. Therefore, the first step in automatic tag mapping is to collect words from Susanne Corpus for each Susanne tag. Table 1 lists some examples.</Paragraph> <Paragraph position="1"> Column three in Table 1 denotes the correct mapping to LOB tags. The second step is to find the corresponding LOB tags from LOB Corpus for each word collected at the first step. Table 2 shows the sample results.</Paragraph> <Paragraph position="2"> and ( CC RB&quot; RB NC ) plus ( IN JJ NN &FW ) & ( CC ) with ( IN IN&quot; RI NC ) without ( IN RI ) physics ( NN ) politics ( NN NNS ) mathematics ( NN ) associates ( NNS VBZ ) am ( BEM &FW ) ai ( HVZ BEZ BER ) Those words which cannot be found in LOB Corpus are removed. Symbol * denotes that all the words cannot be found in LOB Corpus. The third step is to find the corresponding LOB tag for each Susanne tag. For each Susanne tag, the frequency of LOB tags is calculated and the most frequent LOB tag is regarded as the result. For example, LOB tags NN and NNS in row three of Table 2 appear three and one times, respectively. Thus, Susanne tag NNlux is mapped to LOB tag NN. After examining all the Susanne tags by these three steps, three cases have to be considered: (1) Unique Tag. Only one LOB tag remains.</Paragraph> <Paragraph position="3"> (2) Multiple Tags. More than one LOB tags remain.</Paragraph> <Paragraph position="4"> (3) No Match.</Paragraph> <Paragraph position="5"> When all the words extracted from Susanne Corpus for a Susanne tag cannot be found in LOB Corpus, the Susanne tag is mapped to &quot;No Match&quot;. Some of these words are characteristic words such as YTL 1.</Paragraph> <Paragraph position="6"> The experimental results are shown in Table 3.</Paragraph> <Paragraph position="7"> In Table 3, &quot;Include&quot; denotes that the correct tag belongs to the remaining multiple tags and &quot; Exclude&quot; denotes that the correct tag is not mcluded in the remaining tags. Note that the ditto tags are not considered in this experiment. This is because the mapping for ditto tags can be obtained by human easily. Therefore, only 310 Susanne tags are resolved in this experiment. The experimental results show that the number of multiple tags is large. Thus, two heuristic rules are introduced to reduce the number of multiple tags.</Paragraph> <Paragraph position="8"> First, those LOB tags which are similar to Susanne tag are selected. For example, Susanne tag NNJ2 can be mapped tO LOB tags NNS or VBZ in the above experiment. NNS has two common characters with NNJ2, so that Susanne tag NNJ2 is mapped to LOB tag NNS. Under this heuristic rule, the experimental results are showia m Table 4.</Paragraph> <Paragraph position="9"> Next, let us consider an example. Susanne tag IW can be mapped to LOB tags IN or RI in the above experiment. Thus, the first heuristic rule has no effects. We examine the tag mapping for the preceding and subsequent three tags of 1W. They are listed as follows.</Paragraph> <Paragraph position="10"> (-1) Susanne Tag lit is mapped to (-2) Susanne Tag IIx is mapped to (-3) Susanne Tag IO is mapped to (**) Susanne Tag IW is mapped to (+2) Susanne Tag JB is mapped to (+3) Susanne Tag JBo is mapped to LOB Tag IN.</Paragraph> <Paragraph position="11"> LOB Tag IN.</Paragraph> <Paragraph position="12"> LOB Tag IN.</Paragraph> <Paragraph position="13"> LOB Tag IN RI.</Paragraph> <Paragraph position="14"> LOB Tag JJ.</Paragraph> <Paragraph position="15"> LOB Tag AP.</Paragraph> <Paragraph position="16"> Note that only tags which have the same first character as IW are considered, that is,. only (-I), (-2) and (-3) are considered. In these three mappings, LOB tag IN is the most frequent and the only one mapping, and IN is a candidate for IW. Thus, Susanne tag IW is mapped to LOB tag IN. The above procedure forms the second heuristic rule. The experimental results after applying two heuristic rules are shown as follows. Three tags - say, FA, FB and GG, must be treated in particular. For example, Susanne Corpus tags genitive case noun as \[John NP 's_GG\], but LOB Corpus tags it as \[John's_PN$\]. Two Susanne tags may be mapped into One LOB tag. Ignoring these three special tags, only nineteen Susanne tags have wrong mapping in Uniq0e-Tag case.</Paragraph> <Paragraph position="17"> 4. A Probabilistic Chunker Gale and Church, \[8\] propose d~ 2, a X2-1ike statistic, to measure the association between two words. Table 6 illustrates a twr-by-two contingency table for words w I and w 2. Cell a counts the number of sentences that contain both w I and w 2. Cell b (c) counts the number of sentences that contain w 2 (Wl) but not w I (w2). Cell d counts the number of sentences that does not contain both w 1 and w 2. That is, if N is the total number of sentences, d=N-a-b-c. Based on this contingency table, (~2 is defined as follows:</Paragraph> <Paragraph position="19"> where Pi denotes part-of-speech i, F(p 1,P2) is the frequency of which P2 follows p 1, F(Pl) and F(P2) are the frequencies of Pl and P2, and N is the corpus size in terms of the number of words in training corpus. Based on this definition and ~2 measure, consider the sentence &quot;The Fulton County Grand Jury said Friday an investigation ...&quot;, which has tag sequence &quot;ATI NP NPL JJ NN VBD NR AT NN ...&quot;. Its syntactic structure for the first seven words is shown in Figure 2.</Paragraph> <Paragraph position="20"> The 4 2 distribution for these parts of speech is shown in Figure 3. Position i (x axis) is the location between parts of speech Pi and Pi+ 1' regarded as the boundaries of chunks. That is, ATI and NP belong to different chunks. Similarly, (NPL and JJ), (NN and VND) and (VND and NR) have the same situation. Let us discuss these concepts formally. For a!probabilistic chunker, the generalized contingency table is defined as follows.</Paragraph> <Paragraph position="22"> where c i denotes chunk i, .F(cl,c2) is the frequency of which c 2 follows el, F(cl) and F(c2) are the frequencies ofc 1 and c2, and N is the corpus size m terms of the number of words in training corpus. Let the tag sequence P be P l, P2, .-., Pn. Assume there are two possible chunked results. The first is composed oftw0 chunks, i.e., \[Pl, P2 ..... Pi\] and \[Pi+l, Pi+2, --., Pn\], and is regarded as a correct result. The second is also composed of two chunks, i.e., \[Pl, P2, -.., Pi-l\] and \[Pi, Pi+l, -.., Pn\], but is regarded as a wrong result, iSince \[Pl, P2, ..., Pi\] is a chunk, \[Pl, P2 ..... Pi-1\] is very likely to be followed by Pi. In other words,</Paragraph> <Paragraph position="24"> For the first chunked result, we can obtain the following contingency table:</Paragraph> <Paragraph position="26"> Similarly, the following contingency table is obtained for the second chunked result:</Paragraph> <Paragraph position="28"> The above derivation tells us: the local minimums of the ~b 2 distribution denote plausible boundaries of two chunks. To simplify Definition 2, Definitions 3 and 4 are formulated.</Paragraph> <Paragraph position="30"> where Pi denotes part-of-speech i, F(\[Pi\],\[Pi+l\] ) is the frequency of which Pi+l follows Pi, F(\[Pi\] ) and F(\[Pi+l\]) are the frequencies of Pi and Pi+l, and N is the corpus size in terms of the number of words in training corpus. It is clear that Definition 3 is the same as Definition 1. Based on Definitions 3, the probabilistic chunker is presented as follows. Note that N is the length of the tag sequence and the last chunk is always a one-tag</Paragraph> <Paragraph position="32"> where Pi denotes part-of-speech i, F(\[Pi, Pi+l\],\[Pi+2\]) is the frequency of which Pi+l,Pi+2 follows Pi, F(\[pi, Pi+l\]) and F(\[Pi+2\]) are the frequencies of (pi, Pi+l) and Pi+2, and N is the corpus size m terms of the number of words in training corpus.</Paragraph> <Section position="1" start_page="168" end_page="169" type="sub_section"> <SectionTitle> Right Chunk </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> where Pi denotes part-of-speech i, F(\[Pi\],\[Pi+I, Pi+2\]) is the frequency of which Pi+l,Pi+2 follows Pi, F(\[pi\]) and F(\[Pi+l, Pi+2\]) are the frequencies of Pi and (Pi+l, Pi+2), and N is the corpus size in terms of the number of words m training corpus.</Paragraph> <Paragraph position="3"> each while loop, probabilistic chunker based on Definition 4 processes three parts of speech and concerns the dp 2 distribution between them.</Paragraph> </Section> </Section> class="xml-element"></Paper>