File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-4002_metho.xml
Size: 11,822 bytes
Last Modified: 2025-10-06 14:09:43
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-4002"> <Title>Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with Respect to Dependency Measures</Title> <Section position="4" start_page="9" end_page="10" type="metho"> <SectionTitle> 2 Annotation Policy </SectionTitle> <Paragraph position="0"> In this section, we start by introducing our policy for annotating a Japanese syntactically annotated corpus briefly. The details are given in (Noro et al., 2004a; Noro et al., 2004b) Although a large-scale CFG can be easily derived from a syntactically annotated corpus, such a CFG has a problem that it creates a largenumber of parse results during syntactic parsing (i.e. high ambiguity). A syntactically annotated corpus should be built so that the derived CFG would create less ambiguity.</Paragraph> <Paragraph position="1"> We have been building such Japanese corpus by using the following method (Figure 1): 1. Derive a CFG from an existing corpus.</Paragraph> <Paragraph position="2"> 2. Analyze major causes of ambiguity.</Paragraph> <Paragraph position="3"> 3. Determine a policy for modifying the corpus. null 4. Modify the corpus according to the policy and derive a CFG from it again.</Paragraph> <Paragraph position="4"> 5. Repeat steps (2) - (4) until most problems are solved.</Paragraph> <Paragraph position="5"> We focused on two major causes of ambiguity: Lack of Syntactic Information: Some syntactic information which is important for syntactic parsing might be lost during the CFG derivation since CFG rules generally represent only structures of subtree with the depth of 1 (relation between a parent node and some child nodes).</Paragraph> <Paragraph position="6"> Need for Semantic Information: Not only syntactic information but also semantic information is necessary for disambiguation in some cases.</Paragraph> <Paragraph position="7"> To avoid the first cause, we considered which syntactic information is necessary for syntactic parsing and added the information to each intermediate node in the structure. On the other hand, we considered ambiguity due to the second cause better be left to the subsequent semantic processing since it is difficult to reduce such ambiguity without recourse to semantic information during syntactic parsing. This can be achieved by representing the ambiguous cases as the same structure. We assume that syntactic analysis based on a large-scale CFG is followed by semantic analysis, and the second cause of ambiguity is supposed to be disambiguated in the subsequent semantic processing.</Paragraph> <Paragraph position="8"> The main aspects of our policy are as follows: Verb Conjugation: Information about verb conjugation is added to each intermediate node related to the verb (cf. &quot;SPLIT-VP&quot; in (Klein and Manning, 2003) and &quot;Verb Form&quot; in (Schiehlen, 2004)).</Paragraph> <Paragraph position="9"> Compound Noun Structure: Structure ambiguity of compound noun is represented as the same structure regardless of the meaning or word-formation as Shirai et al. described in (Shirai et al., 1995).</Paragraph> <Paragraph position="10"> Adnominal and Adverbial Phrase Attachment: Structure ambiguity of adnominal phrase attachment is represented as the same structure regardless of the meaning while structure ambiguity of adverbial phrase attachment is distinguished by meaning.</Paragraph> <Paragraph position="11"> In case of a phrase like &quot;watashi no chichi no hon (my father's book)&quot;, the structure is same whether the adnominal phrase &quot;watashi no (my)&quot; attaches to the noun &quot;chichi (father)&quot; or the noun &quot;hon (book)&quot;. On the other hand, in case of a sentence like &quot;kare ga umi wo egaita e wo katta&quot;, we distinguish the structure according to whether the adverbial phrase &quot;kare ga (he)&quot; attaches to the verb &quot;egaita (paint)&quot; (it means &quot;I bought a picture of a sea painted by him&quot;) or the verb &quot;katta (buy)&quot; (it means &quot;he bought a picture of a sea&quot;).</Paragraph> <Paragraph position="12"> Conjunctive Structure: Conjunctive structure is not specified during syntactic parsing, instead their analysis is left for the subsequent processing (contrary to (Kurohashi and Nagao, 1994)).</Paragraph> <Paragraph position="13"> We have decided to deal with adnominal phrase attachment and adverbial phrase attachment separately in our policy since we believe that a different algorithm should be used to disambiguate them. In the subsequent processing, we assume that adverbial phrase attachment would be disambiguated by choosing one parse tree among the results at first, and adnominal phrase attachment would be disambiguated by choosing one interpretation among all of interpretations which the parse tree represents (Figure 2).</Paragraph> <Paragraph position="14"> We used the EDR corpus (EDR, 1994) for developing our annotation policy, and annotated 8,911 sentences in the corpus and 20,190 sentences in the RWC corpus (Hashida et al., 1998). In the following evaluation, we used the latter one.</Paragraph> </Section> <Section position="5" start_page="10" end_page="12" type="metho"> <SectionTitle> 3 Experimental Setup </SectionTitle> <Paragraph position="0"> As mentioned previously, in general, analyzing dependency relations between bunsetsu is preferred in Japanese, which makes it difficult to compare the result by the CFG with the result by dependency analysis. In order to compare with other dependency analysis, we evaluated our derived CFG with respect to dependency measures shown in Figure 3. Note that sentences which are not segmented into bunsetsu correctly are dropped from the evaluation data when we evaluate dependency accuracy and sentence accuracy. null A CFG is derived from all sentences in our corpus, with which we parsed 6,931 sentences (POS sequences) in the Kyoto corpus 1 by MSLR parser (Shirai et al., 2000). The Kyoto corpus has an- null notation in terms of dependency relations among bunsetsu, and it is usually used for evaluation of dependency analysis. The parser is trained according to probabilistic generalized LR (PGLR) model (Inui et al., 2000) (all sentences are used for training), and parse results are ranked by the model.</Paragraph> <Paragraph position="1"> The experiment was carried out as follows (Figure 4): 1. Convert POS tags automatically to the RWC tag set.</Paragraph> <Paragraph position="2"> 2. Parse the POS sequence using a CFG derived from our corpus.</Paragraph> <Paragraph position="3"> 3. Rank the parse results by PGLR model and pick up the top-D2 parse results.</Paragraph> <Paragraph position="4"> 4. Extract dependency relations among bunsetsu for each result.</Paragraph> <Paragraph position="5"> 5. Choose the result which is closest to the gold-standard and evaluate it.</Paragraph> <Paragraph position="6"> Since the tag set of the Kyoto corpus is different from that of the RWC corpus, a POS conversion in step (1) is necessary. It is a rule-based conversion, and the accuracy is about 80%. It seems that the low conversion accuracy would damage the evaluation result. We will discuss this issue in the next section.</Paragraph> <Paragraph position="7"> In the 4th step of the experimental procedure, we determine boundaries of bunsetsu and dependency relations among the bunsetsu in a sentence with the CFG rules included in the phrase structure of the sentence. Some CFG rules in our CFG indicate positions of bunsetsu boundaries. For example, a CFG rule &quot;NP AX AdnP NP&quot; (&quot;NP&quot; and &quot;AdnP&quot; stand for a noun phrase and an adnominal phrase respectively) indicates that there is a boundary of bunsetsu between the two phrases in the right-hand side of the CFG rule (i.e. between the noun phrase and the adnominal phrase), and that a bunsetsu including the head word of the adnominal phrase depends on a bunsetsu including the head word of the noun phrase. An example of &quot;Nihon teien no nagame ga subarashii (The view of the Japanese garden is wonderful)&quot; is shown in Figure 5.</Paragraph> <Paragraph position="8"> Structure ambiguity of adnominal phrase attachment needs to be disambiguated in extracting dependency relations in step (4) since it is represented as the same structure according to our policy 2. We disambiguate adnominal phrase attachment based on one of the following assumptions: NEAREST: Every ambiguous adnominal phrase attaches to the nearest noun among the nouns which the phrase could attach to.</Paragraph> <Paragraph position="9"> BEST: Choose the best noun among the nouns which could be attached to (assume that disambiguation of adnominal phrase attach- null Kyoto corpus, it is difficult to know how many relations representing adnominal phrase attachment are included in the evaluation data. On the other hand, among the top parse results ranked by PGLR model (i.e in case of D2 BP BD in section 4), about 34.1% of all dependency relations represent adnominal phrase attachment, and about 23.4% of them (i.e. about 8.0% of all relations) remain ambiguous.</Paragraph> <Paragraph position="10"> &quot;NEAREST&quot; is a quite simple way for disambiguation, and it would be the baseline model. On the other hand, since we assume that structure ambiguity of adnominal phrase attachment is supposed to be disambiguated in the subsequent semantic processing, &quot;BEST&quot; would be the upper bound and we could not overcome the accuracy even if the disambiguation was done perfectly in the subsequent processing.</Paragraph> <Paragraph position="11"> To take two noun phrases &quot;watashi no chichi no hon (my father's book)&quot; and &quot;watashi no kagaku no hon (my book on science)&quot; as examples (the correct answer is that the adnominal phrase &quot;watashi no (my)&quot; attaches to the noun &quot;chichi (father)&quot; in the former case, and attaches to the noun &quot;hon (book)&quot; in the latter case), &quot;NEAREST&quot; attaches to the adnominal phrase &quot;watashi no&quot; to the nouns &quot;chichi&quot; and &quot;kagaku (science)&quot; regardless of their meanings. &quot;BEST&quot; attaches the adnominal phrase to the noun &quot;chichi&quot; in the former case, and attaches to the noun &quot;hon&quot; in the latter case.</Paragraph> <Paragraph position="12"> Although structure ambiguity of compound noun is also represented as the same structure rethe Kyoto corpus. If the noun which is attached to in the Kyoto corpus is not in the candidates, we choose the nearest noun (i.e. &quot;NEAREST&quot;).</Paragraph> <Paragraph position="13"> gardless of the meaning or word-formation, we have nothing to do with the structure ambiguity since a bunsetsu is a larger unit than a compound noun. Furthermore, since dependency relations are not categorized, we do not have to care about whether two bunsetsu have conjunctive relation with each other or not.</Paragraph> <Paragraph position="14"> In order to compare our result with that of other dependency analyzers, we used two well-known Japanese dependency analyzers, KNP and CaboCha, and analyzed dependency structure of the sentences in the same evaluation data set. In both cases, POS tagged sentences are used as the input. Since CaboCha uses the same tagset as the RWC corpus, we converted POS tags in the same way as step (1) in our experimental procedure. On the other hand, since KNP uses the tagset adopted by the Kyoto corpus, POS tags do not have to be converted in case of analyzing by KNP.</Paragraph> </Section> class="xml-element"></Paper>