File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1023_metho.xml
Size: 18,658 bytes
Last Modified: 2025-10-06 14:10:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1023"> <Title>A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis</Title> <Section position="3" start_page="176" end_page="177" type="metho"> <SectionTitle> 2 Automatically Constructed Case Frames </SectionTitle> <Paragraph position="0"> We employ automatically constructed case frames (Kawahara and Kurohashi, 2002) for our model of</Paragraph> <Paragraph position="2"> wo advice, instruction, address(be given) kara <agent>, president, circle, *** ... ... ...</Paragraph> <Paragraph position="3"> case structure analysis. This section outlines the method for constructing the case frames. A large corpus is automatically parsed, and case frames are constructed from modifier-head examples in the resulting parses. The problems of automatic case frame construction are syntactic and semantic ambiguities. That is to say, the parsing results inevitably contain errors, and verb senses are intrinsically ambiguous. To cope with these problems, case frames are gradually constructed from reliable modifier-head examples.</Paragraph> <Paragraph position="4"> First, modifier-head examples that have no syntactic ambiguity are extracted, and they are disambiguated by a couple of a verb and its closest case component. Such couples are explicitly expressed on the surface of text, and can be considered to play an important role in sentence meanings. For instance, examples are distinguished not by verbs (e.g., &quot;tsumu&quot; (load/accumulate)), but by couples (e.g., &quot;nimotsu-wo tsumu&quot; (load baggage) and &quot;keiken-wo tsumu&quot; (accumulate experience)). Modifier-head examples are aggregated in this way, and yield basic case frames.</Paragraph> <Paragraph position="5"> Thereafter, the basic case frames are clustered to merge similar case frames. For example, since &quot;nimotsu-wo tsumu&quot; (load baggage) and &quot;busshi-wo tsumu&quot; (load supply) are similar, they are clustered. The similarity is measured using a thesaurus (Ikehara et al., 1997).</Paragraph> <Paragraph position="6"> Usingthisgradualprocedure, weconstructedcase frames from the web corpus (Kawahara and Kuro- null hashi, 2006). The case frames were obtained from approximately 470M sentences extracted from the web. They consisted of 90,000 verbs, and the average number of case frames for a verb was 34.3.</Paragraph> <Paragraph position="7"> In Figure 1, some examples of the resulting case frames are shown. In this table, 'CS' means a case slot. <agent> in the table is a generalized example, which is given to the case slot where half of the examples belong to <agent> in a thesaurus (Ikehara et al., 1997). <agent> is also given to &quot;ga&quot; case slot that has no examples, because &quot;ga&quot; case components are usually agentive and often omitted.</Paragraph> </Section> <Section position="4" start_page="177" end_page="180" type="metho"> <SectionTitle> 3 Integrated Probabilistic Model for </SectionTitle> <Paragraph position="0"> Syntactic and Case Structure Analysis The proposed method gives a probability to each possible syntactic structure T and case structure L of the input sentence S, and outputs the syntactic and case structure that have the highest probability. That is to say, the system selects the syntactic structure Tbest and the case structure Lbest that maximize the probability P(T,L|S):</Paragraph> <Paragraph position="2"> The last equation is derived because P(S) is constant. null</Paragraph> <Section position="1" start_page="177" end_page="180" type="sub_section"> <SectionTitle> 3.1 Generative Model for Syntactic and Case Structure Analysis </SectionTitle> <Paragraph position="0"> We propose a generative probabilistic model based on the dependency formalism. This model considers a clause as a unit of generation, and generates the input sentence from the end of the sentence in turn.</Paragraph> <Paragraph position="1"> P(T,L,S) is defined as the product of a probability for generating a clause Ci as follows:</Paragraph> <Paragraph position="3"> of a sentence does not have a modifying head, but we handle it by assuming bhn = EOS (End Of Sentence). null For example, consider the sentence in Figure 1. There are two possible dependency structures, and for each structure the product of probabilities indicated below of the tree is calculated. Finally, the model chooses the highest-probability structure (in this case the left one).</Paragraph> <Paragraph position="4"> Ci is decomposed into its predicate type fi (including the predicate's inflection) and the rest case structure CSi. This means that the predicate included in CSi is lemmatized. Bunsetsu bhi is also decomposed into the content part whi and the type fhi.</Paragraph> <Paragraph position="6"> The last equation is derived because the content part in CSi is independent of the type of its modifying head (fhi), and in most cases, the type fi is independent of the content part of its modifying head (whi). Forexample,P(bentou-wa tabete|syuppatsu-shita) is calculated as follows:</Paragraph> <Paragraph position="8"> structure and P(fi|fhi) generative model for predicate type. The following two sections describe these models.</Paragraph> <Paragraph position="9"> 3.2 Generative Model for Case Structure We propose a generative probabilistic model of case structure. This model selects a case frame that matches the input case components, and makes correspondences between input case components and case slots.</Paragraph> <Paragraph position="10"> A case structure CSi consists of a predicate vi, a case frame CFl and a case assignment CAk. Case assignment CAk represents correspondences between input case components and case slots as shown in Figure 2. Note that there are various possibilities of case assignment in addition to that of Figure 2, such as corresponding &quot;bentou&quot; (lunchbox) with &quot;ga&quot; case. Accordingly, the index k of CAk ranges up to the number of possible case assignments. By splitting CSi into vi, CFl and CAk,</Paragraph> <Paragraph position="12"> The above approximation is given because it is natural to consider that the predicate vi depends on itsmodifyingheadwhi,thatthecaseframeCFl only dependsonthepredicatevi, andthatthecaseassignment CAk depends on the case frame CFl and the predicate type fi.</Paragraph> <Paragraph position="13"> The probabilities P(vi|whi) and P(CFl|vi) are estimated from case structure analysis results of a large raw corpus. The remainder of this section illustrates P(CAk|CFl,fi) in detail.</Paragraph> <Paragraph position="14"> Let us consider case assignment CAk for each case slot sj in case frame CFl. P(CAk|CFl,fi) can be decomposed into the following product depending on whether a case slot sj is filled with an input case component (content part nj and type fj)</Paragraph> <Paragraph position="16"> where the function A(sj) returns 1 if a case slot sj is filled with an input case component; otherwise 0.</Paragraph> <Paragraph position="18"> because the evaluation of case slot assignment depends only on the case frame. We call these probabilitiesgenerativeprobabilityofacaseslot,andthey null are estimated from case structure analysis results of a large corpus.</Paragraph> <Paragraph position="19"> Let us calculate P(CSi|fi,whi) using the example in Figure 1. In the sentence, &quot;wa&quot; is a topic marking (TM) postposition, and hides the case marker. The generative probability of case structure varies depending on the case slot to which the topic marked phrase is assigned.</Paragraph> <Paragraph position="20"> For example, when a case frame of &quot;taberu&quot; (eat) CFtaberu1 with &quot;ga&quot; and &quot;wo&quot; case slots is used, P(CS(bentou-wa taberu)|te,syuppatsu-suru) is calculated as follows:</Paragraph> <Paragraph position="22"> Such probabilities are computed for each case frame of &quot;taberu&quot; (eat), and the case frame and its corresponding case assignment that have the highest probability are selected.</Paragraph> <Paragraph position="23"> We describe the generative probability of a case</Paragraph> <Paragraph position="25"> We approximate the generative probability of a case component, assuming that: * a generative probability of content partnj is independent of that of type fj, * and the interpretation of the surface case included in fj does not depend on case frames.</Paragraph> <Paragraph position="26"> Taking into account these assumptions, the generativeprobabilityofacasecomponentisapproximated null as follows:</Paragraph> <Paragraph position="28"> generating a content part nj from a case slot sj in a case frame CFl. This probability is estimated from case frames.</Paragraph> <Paragraph position="29"> Let us consider P(fj|sj,fi) in equation (8). This is the probability of generating the type fj of a case component that has a correspondence with the case slot sj. Since the type fj consists of a surface case cj2, a punctuation mark (comma) pj and a topic marker &quot;wa&quot; tj, P(fj|sj,fi) is rewritten as follows 2A surface case means a postposition sequence at the end of bunsetsu, such as &quot;ga&quot;, &quot;wo&quot;, &quot;koso&quot; and &quot;demo&quot;. (using the chain rule):</Paragraph> <Paragraph position="31"> This approximation is given by assuming that cj only depends on sj, pj only depends on fj, and tj dependsonfj andpj. P(cj|sj)isestimatedfromthe KyotoTextCorpus(Kawaharaetal.,2002),inwhich the relationship between a surface case marker and a case slot is annotated by hand.</Paragraph> <Paragraph position="32"> In Japanese, a punctuation mark and a topic marker are likely to be used when their belonging bunsetsu has a long distance dependency. By considering such tendency, fi can be regarded as (oi,ui), where oi means whether a dependent bunsetsu gets over another head candidate before its modifying head vi, and ui means a clause type of vi. The value of oi is binary, and ui is one of the clause types described in (Kawahara and Kurohashi, 1999).</Paragraph> <Paragraph position="34"> 3.3 Generative Model for Predicate Type Now, consider P(fi|fhi) in the equation (3). This is the probability of generating the predicate type of a clause Ci that modifies bhi. This probability varies depending on the type of bhi.</Paragraph> <Paragraph position="35"> When bhi is a predicate bunsetsu, Ci is a subordinate clause embedded in the clause of bhi. As for the typesfi andfhi, it is necessary to consider punctuation marks (pi, phi) and clause types (ui, uhi). To capture a long distance dependency indicated by punctuation marks, ohi (whether Ci has a possible head candidate before bhi) is also considered.</Paragraph> <Paragraph position="37"> When bhi is a noun bunsetsu, Ci is an embedded clause in bhi. In this case, clause types and a punctuation mark of the modifiee do not affect the probability. null</Paragraph> <Paragraph position="39"> probability what is generated data</Paragraph> <Paragraph position="41"/> </Section> </Section> <Section position="5" start_page="180" end_page="181" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> We evaluated the syntactic structure and case structure outputted by our model. Each parameter is estimated using maximum likelihood from the data described in Table 2. All of these data are not existing or obtainable by a single process, but acquired by applying syntactic analysis, case frame construction and case structure analysis in turn. The process of case structure analysis in this table is a similarity-based method (Kawahara and Kurohashi, 2002). The case frames were automatically constructed from the web corpus comprising 470M sentences, and the case structure analysis results were obtained from 6M sentences in the web corpus.</Paragraph> <Paragraph position="1"> The rest of this section first describes the experiments for syntactic structure, and then reports the experiments for case structure.</Paragraph> <Section position="1" start_page="180" end_page="181" type="sub_section"> <SectionTitle> 4.1 Experiments for Syntactic Structure </SectionTitle> <Paragraph position="0"> We evaluated syntactic structures analyzed by the proposed model. Our experiments were run on hand-annotated 675 web sentences 3. The web sentences were manually annotated using the same criteria as the Kyoto Text Corpus. The system input was tagged automatically using the JUMAN morphological analyzer (Kurohashi et al., 1994). The syntactic structures obtained were evaluated with re3The test set is not used for case frame construction and probability estimation.</Paragraph> <Paragraph position="1"> gard to dependency accuracy -- the proportion of correct dependencies out of all dependencies except for the last dependency in the sentence end 4.</Paragraph> <Paragraph position="2"> Table 3 shows the dependency accuracy. In the table, &quot;baseline&quot; means the rule-based syntactic parser, KNP (Kurohashi and Nagao, 1994), and &quot;proposed&quot; represents the proposed method. Theproposedmethodsignificantlyoutperformedthe baseline method (McNemar's test; p < 0.05). The dependency accuracies are classified into four types according to the bunsetsu classes (VB: verb bunsetsu, NB: noun bunsetsu) of a dependent and its head. The &quot;NB-VB&quot; type is further divided into two types: &quot;TM&quot; and &quot;others&quot;. The type that is most related to case structure is &quot;others&quot; in &quot;NB-VB&quot;. Its accuracy was improved by 1.6%, and the error rate was reduced by 10.9%. This result indicated that the proposed method is effective in analyzing dependencies related to case structure.</Paragraph> <Paragraph position="3"> Figure 3 shows some analysis results, where the dotted lines represent the analysis by the baseline method, and the solid lines represent the analysis by the proposed method. Sentence (1) and (2) are incorrectly analyzed by the baseline but correctly analyzed by the proposed method.</Paragraph> <Paragraph position="4"> There are two major causes that led to analysis errors.</Paragraph> <Paragraph position="5"> Mismatch between analysis results and annotation criteria In sentence (3) in Figure 3, the baseline method correctly recognized the head of &quot;iin-wa&quot; (commissioner-TM) as &quot;hirakimasu&quot; (open). However, the proposed method incorrectly judged it as &quot;oujite-imasuga&quot; (offer). Both analysis results can be considered to be correct semantically, but from 4Since Japanese is head-final, the second last bunsetsu unambiguously depends on the last bunsetsu, and the last bunsetsu has no dependency.</Paragraph> <Paragraph position="6"> clause 107/155 (69.0%) 121/155 (78.1%) the viewpoint of our annotation criteria, the latter is not a syntactic relation, but an ellipsis relation. To address this problem, it is necessary to simultaneously evaluate not only syntactic relations but also indirect relations, such as ellipses and anaphora.</Paragraph> <Paragraph position="7"> Linear weighting on each probability We proposed a generative probabilistic model, and thus cannot optimize the weight of each probability. Such optimization could be a way to improve the system performance. In the future, we plan to employ a machine learning technique for the optimization. null</Paragraph> </Section> </Section> <Section position="6" start_page="181" end_page="181" type="metho"> <SectionTitle> 4.2 Experiments for Case Structure </SectionTitle> <Paragraph position="0"> We applied case structure analysis to 215 web sentences which are manually annotated with case structure, and evaluated case markers of TM phrases and clausal modifiees by comparing them with the gold standard in the corpus. The experimental results are shown in table 4, in which the baseline refers to a similarity-based method (Kawahara and Kurohashi, 2002). The experimental results were really good compared to the baseline. It is difficult to compare the results with the previous work stated in the next section, because of different experimental settings (e.g., our evaluation includes parse errors in incorrect cases).</Paragraph> </Section> <Section position="7" start_page="181" end_page="182" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> There have been several approaches for syntactic analysis handling lexical preference on a large scale.</Paragraph> <Paragraph position="1"> Shirai et al. proposed a PGLR-based syntactic analysis method using large-scale lexical preference (Shirai et al., 1998). Their system learned lexical preference from a large newspaper corpus (articles of five years), such as P(pie|wo,taberu), but did not deal with verb sense ambiguity. They reported 84.34% accuracy on 500 relatively short sentences from the Kyoto Text Corpus.</Paragraph> <Paragraph position="2"> Fujio and Matsumoto presented a syntactic analysis method based on lexical statistics (Fujio and Matsumoto, 1998). They made use of a probabilistic modeldefinedbytheproductofaprobabilityofhaving a dependency between two cooccurring words and a distance probability. The model was trained on the EDR corpus, and performed with 86.89% accuracy on 10,000 sentences from the EDR corpus 5.</Paragraph> <Paragraph position="3"> On the other hand, there have been a number of machine learning-based approaches using lexical preference as their features. Among these, Kudo and Matsumoto yielded the best performance (Kudo and Matsumoto, 2002). They proposed a chunking-based dependency analysis method using Support accuracy 5. However, it is very hard to learn sufficient lexical preference from several tens of thousands sentences of a hand-tagged corpus.</Paragraph> <Paragraph position="4"> There has been some related work analyzing clausal modifiees and TM phrases. For example, Torisawa analyzed TM phrases using predicate-argument cooccurences and word classifications induced by the EM algorithm (Torisawa, 2001). Its accuracy was approximately 88% for &quot;wa&quot; and 84% for &quot;mo&quot;. It is difficult to compare the accuracy of their system to ours, because the range of target expressions is different. Unlike related work, it is promising to utilize the resultant case frames for subsequent analyzes such as ellipsis or discourse analysis.</Paragraph> </Section> class="xml-element"></Paper>