File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-2126_evalu.xml
Size: 15,142 bytes
Last Modified: 2025-10-06 13:58:32
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2126"> <Title>Word Order Acquisition from Corpora</Title> <Section position="4" start_page="873" end_page="876" type="evalu"> <SectionTitle> 3 Experiments and Discussion </SectionTitle> <Paragraph position="0"> In our experiment, we used the Kyoto University text corpus (Version 2) (Kurohashi mid Nagao, 1997), a tagged corpus of the Mainichi newspaper. For training, we used 17,562 sentences from newspaper articles appearing in 1995, from January 1st to Jmmary 8th and from Jmmary 10th to June 9th. For testing, we used 2,394 sentences fl'om articles appearing on January 9th and from June 10th to June 30th.</Paragraph> <Section position="1" start_page="873" end_page="873" type="sub_section"> <SectionTitle> 3.1 Definition of Word Order in a Corpus </SectionTitle> <Paragraph position="0"> In the Kyoto University corpus, each bunsetsu has only one modifiee. When a bunsetsu Bm depends on a bunsetsu Bd and there is a bunsetsu /3p that depends on and is coordinate with \])d, Bp has not only the information that its modifiee is \]~d but also a label indicating a coordination or the information that it is coordinate with B d. This information indirectly shows that the bunsetsu Bm can depend on both \]3p and Bd. In this case, we consider Bm a modifier of both Bv and B d.</Paragraph> <Paragraph position="1"> Under this condition, modifiers of a bunsetsu B are identified in the following steps.</Paragraph> <Paragraph position="2"> 1. Bunsetsus that depend on a bunsetsu B are classifted as modifiers of B.</Paragraph> <Paragraph position="3"> 2. When B has a label indicating a coordination, bunsetsus that are to tile left of 13 and depend on the same modifiee as B are classified as modifiers of B.</Paragraph> <Paragraph position="4"> 3. Bunsetsus that depend on a modifier of B and have a label indicating a coordination are classifted as modifiers of B. The third step is repeated. null When the above procedure is completed, all bunsetsus that coordinate with each other are identified as modifiers which depend oi1 the same nmdifiee. For example, from the data listed on the left side of Toble 2, the modifiers listed in the right-hand column are identified for each bunsetsu. &quot;Nt~I; ~ (Taro_to, Taro and),&quot; &quot;?~g-~ IS (Hanako_to, Hanako),&quot; &quot;ql &quot;(, (dete,, participate,)&quot; are all identified as modifiers which depend on the same modifiee &quot;~ b 7=deg (yusyo_sita., won.).&quot;</Paragraph> </Section> <Section position="2" start_page="873" end_page="875" type="sub_section"> <SectionTitle> 3.2 Experimental Results </SectionTitle> <Paragraph position="0"> The features used in our experiment are listed in Tables 3 and 4. Each feature consists of a type and a value. The features consist basically of some attributes of the bunsetsu itself, and syntactic and contextual information. We call the features listed in Tables 3 'basic features.' We selected them manually so that they reflect the basic conditions governing word order that were sunmmrized by Saeki (Saeki, 1998). The features in Table 4 are combinations of basic features ('combined features') and were also selected manually. They are represented by the nmne of the target bunsetsu plus the feature type of the basic features. The total number of features was about 190,000, and 51,590 of them were observed in the training cortms three or more times. These were the ones we used in our experiment.</Paragraph> <Paragraph position="1"> The following terms are used in these tables: Mdfrl, Mdfr2, Mdfe: The word order model described in Section 2.1 estimates the probability that modifiers are in the appropriate order as the product of the probabilities of all pairs of modifiers. When estimating the probability tbr each pair of modifiers, the model assmnes that the two modifiers are in the appropriate order.</Paragraph> <Paragraph position="2"> Here we call the left modifier Mdfrl, the right modifier Mdfr2, and their modifiee Mdfe.</Paragraph> <Paragraph position="3"> Head: the rightmost word in a bunsetsu other than those whose major pro't-of-speech I category is 1Part-of-speech categories follow those of JUMAN (Kurohashi and Nagao, 1998).</Paragraph> <Paragraph position="5"> '%} @ (special marks),&quot; &quot;112 N (1)ost-posi~ioual particles),&quot; or &quot;}~N~? (suffixes).&quot; Head-Lex: the fllndalnental forth (unintlected forln) of the head word. Only words with a frequency of tlve or more are used.</Paragraph> <Paragraph position="6"> Head-Inf: the inflection type of a head.</Paragraph> <Paragraph position="7"> SemFeat: We use the upper third layers of bunrui .qoihyou (NLl/I(National Language Research Institute), 19641 as semantic features. Bunrui goihyou is a Japanese thesaurus that has a tree structure and consists of seven layers. The tree has words in its leaves, and each word has a figure indicating its category number. For example, the figure in parenthesis of a feature &quot;Head-SemFeat(ll0)&quot; in Table 3 shows the upper three digits of the category number of the head word or the ancestor node of the head word in the third layer in the tree.</Paragraph> <Paragraph position="8"> Type: the rightmost word other than those whose major part-of-speech category is &quot;~@ (special marks).&quot; If the major category of the word is neither &quot;NJN (post-positional particles)&quot; nor &quot;}~/~'~ (suffixes),&quot; and the word is inflectable, 2 then the type is represented by the inflection type.</Paragraph> <Paragraph position="9"> JOSHI1, JOSHI2:JOSHI1 is the rightmost post-positional particle in the bunsetsu. And if there are two or more post-positional particles in the bunsetsu, JOSHI2 is the second-rightmost postpositiolml particle.</Paragraph> <Paragraph position="10"> NmnberOfMdfrs: number of modifiers.</Paragraph> <Paragraph position="11"> 2The inflection types follow those of J UMAN.</Paragraph> <Paragraph position="12"> Mdfrl-MdfrType, Mdfr2-MdfrType: Types of tile modifiers of Mdfi'l and Mdfr2.</Paragraph> <Paragraph position="13"> X-IDto-Y: X is identical to Y.</Paragraph> <Paragraph position="14"> Repetition-Head-Lex: a ret)etition word allpearing ill a preceding senteuce.</Paragraph> <Paragraph position="15"> ReferencePronourl: a reference pronoun appearing in the target bunsetsu or ill its modifiers. Categories 1 to 6 ill Table 3 reI)resent attributes in a bunsetsu, categories 7 to 10 represent syntactic information, and categories 11 and 12 represent contextual information.</Paragraph> <Paragraph position="16"> The results of our experiment are listed in Table 5. The first line shows tlle agreement rate when we estimated word order for 5,278 bunsetsus that have two or more modifiers and were extracted from 2,394 sentences al)pearing on Jmmary 9th and from June 10th to June 301tl. \Ve used bunsetsu boundary information and syntactic and contextual information which were derivable froln the test corpus and related to the input bunsetsus. As syntactic ilffOrlnation we used dependency inforlnation, coordinate structure, and information on whether the target bunsetsu is at the eM of a sentence. As contextual information we used the preceding sentence. The values in the row labeled Baseline1 in Table 5 are the agreement rates obtained when every order of all pairs of modifiers was selected randolnly. And values in the B&seline2 row are the agreement rates obtained when we used the following equation instead of Eq. (5):</Paragraph> <Paragraph position="18"> IIere we assume that B1 and \]32 are modifiers, their modifiee is B, the word types of B1 and \]32 are respectively Wl and we. The values frcq(wr2) and frcq(w.27 ) then respectively represent the fl'equencies with which w7 and w,2 appeared in the order &quot;WT, we, mid w&quot; and &quot;w2, WT, and w&quot; in Malnichi newspaper articles fl'om 1991 to 1997. a Equation (7) means that given the sentence &quot;~t~lt I:t (Taro_wa) / ~- -- ~, (tennis_wo) / b ~-:o (sita.),&quot; one of two possibilities, &quot;1$ (wa) / ~ (wo) / t, ~:o (sita.)&quot; and &quot;#c (wo) / tS (wa) / b ~:o (sita.),&quot; which has the higher frequency, is selected.</Paragraph> </Section> <Section position="3" start_page="875" end_page="875" type="sub_section"> <SectionTitle> 3.3 Features and Agreement Rate </SectionTitle> <Paragraph position="0"> This section describes how much each feature set contributes to improving the agreement rate.</Paragraph> <Paragraph position="1"> The values listed in the rightmost columns in Tables 3 and 4 shows the performance of the word order estimation without each feature set. The values in parentheses are the percentage of improvement or degradation to the formal experiment. In the experiments, when a basic feature was deleted, the combined features that included the basic feature were also deleted. The most useful feature is the type of 3When wl and w2 were the same word, we used the head words in Bt and 132 as Wl and w2. When one offreq(wt2) and freq(w21) was zero and the other was five or more, we used the flequencies when they appeared in the order &quot;Wl ws&quot; and &quot;w2 wt,&quot; respectively~ instead of frcq(wi2) al,d freq(wsl). When both freq(wl.2) and freq(w27) were zero, we instead used random figures between 0 and t.</Paragraph> <Paragraph position="2"> bunsetsu, which basically signifies the case marker or inflection type. This result is close to our expectations. null We selected features that, according to linguistic studies, as mudl as possible reflect the basic conditions governing word order. The rightmost column in Tables 3 and 4 shows the extent to which each condition contributes to improving the agreement rate.</Paragraph> <Paragraph position="3"> However, each category of features might be rougher than that which is linguistically interesting. For example, all case markers such as &quot;wa&quot; and &quot;wo&quot; were classified into the same category, and were deleted together in the experiment when single categories were removed. An experiment that considers each of these markers separately would help us verify the importance of these markers separately. If we find new features in future linguistic research on word order, the experiments lacking each feature separately would help us verify their importance in the same manner.</Paragraph> </Section> <Section position="4" start_page="875" end_page="876" type="sub_section"> <SectionTitle> 3.4 Training Corpus and Agreement Rate </SectionTitle> <Paragraph position="0"> The agreement rates for the training corpus and the test corpus are shown in Figure 1 as a function of the amount of training data (ntunber of sentences).</Paragraph> <Paragraph position="1"> The agreement rates in the &quot;pair of modifiers&quot; and '(r)!19~ ..... %:I:: 'i, ~ ......... ~':z~-,~',:: : : i data and the agreement rate.</Paragraph> <Paragraph position="2"> &quot;Complete agreement&quot; measurements were respectiw~ly 82.54% and 68.40%. These values were obtained with very small training sets (250 sentences). These rates m'e considerably higher than those of the baselines, indicating that word order in Japanese can be acquired fl'om newspaper articles even with a small training set.</Paragraph> <Paragraph position="3"> With 17,562 training sentences, the agreemenl, rate in the &quot;Complete agreement&quot; measurement was 75.41%. We randomly selected and analyzed 100 modifiees from 1,298 modifiees whose modifiers' word order did not agree with those in the original text. We found that 48 of them were in a natural order and 52 of them were in an unnatural order. The former result shows that the word order was relatively fl'ee and several orders were acceptable. The latter result shows that the word order acquisition was not sufficient. To complete the acquisition we need more training corpora and features which take into account different information than that m Tables 3 mid 4. We found many idiomatic expres- null sions in the uimatural word order results, such as &quot;~ ffl\[~il~:5~ (houchi-kokka_ga, a country under the rule of law) /\[l}l~ (kiitc, to listen) /~#t~ (alcireru, to disgust), ~rj ~ b ?= a ~ (souan-s~,ta-no_ga, orlgl,ration) / ~ *o ~- *o co (somosomo-no, at all) / ~t~ ~ U (hg~'ima~'4 the beginning),&quot; and &quot;&quot;~ l~ (g#-~4 taste) / ~'~B (seikon, one's heart and soul) /g~ 6 (homcru, to trot somethil,g into soinething).&quot; We think that the apt)ropriate word order for these idiomatic expressions could be acquired if we had more training data. We also found several coordinate structures in the Ulnlatural word order results, suggesting that we should survey linguistic studies on coordinate structures and try to find efllcient features for acquiring word order from coordinate structures.</Paragraph> <Paragraph position="4"> We (lid not use the results of semantic and contextual analyses as input because corpora with semantic and contextuM tags were not available. If such corpora were available, we could more et\[iciently use features dealing with seinantic features, reference pronouns, and repetition words. We plan to make corpora with semantic and contextual tags and use these tags as input.</Paragraph> <Paragraph position="5"> 3.5 Acquisition from a Raw Corpus In this section, we show that a raw cortms instead of a tagged corpus can be used to train the lnodel, if it is first analyzed by a parser. We used the lnorl)hological analyzer JUMAN and a tmrser KNP (Kurohashi, 11198) which is based on a det)endency grainlnar, it, order to extract iuforumtion from a raw corpus for detecting whether or not each feature is found.</Paragraph> <Paragraph position="6"> 'l?tm accuracy of JUMAN for detecting inorphological boundaries and part-of-speech tags is about 98%, and the parsecs dependency accuracy is about 90%.</Paragraph> <Paragraph position="7"> These results were obtained from analyzing Mainichi newspaper articles.</Paragraph> <Paragraph position="8"> We used 217,562 sentences for training. When these sel~t, ences were all extracted from a raw corlms , the agreement rate was 87.64% for &quot;pair of modifiers&quot; and was 75.77% for &quot;Colnplete agreement.&quot; When the 217,562 training sentences were sentences fl'oln the tagged cortms (17,562 sentences) used in our forreal exl)eriment aInl froln a raw cortms, the agree&quot; e S :~ ment rate for &quot;pair of lno(hfi.r, was 87.66% and for &quot;Complete agreement&quot; was 75.88%. These rates were about 0.5% higher than those obtained when we used only sentences from a tagged corlms. Thus, we can acquire word order by adding inforlnation froln a rmv corpus even if we do not have a large tagged corpus. The results also indicate that the parser accuracy is not so significant for word order acquisition and that an accuracy of about 90% is sufficient.</Paragraph> </Section> </Section> class="xml-element"></Paper>