XML Viewer - p06-2046

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2046_metho.xml
Size: 14,071 bytes
Last Modified: 2025-10-06 14:10:25
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2046">
  <Title>Japanese Idiom Recognition: Drawing a Line between Literal and Idiomatic Meanings</Title>
  <Section position="4" start_page="3" end_page="353" type="metho">
    <SectionTitle>
2 Two Challenges of Idiom Recognition
</SectionTitle>
    <Paragraph position="0"> Two factors make idiom recognition difficult: ambiguity between literal and idiomatic meanings and &amp;quot;transformations&amp;quot; that idioms could undergo. null  In fact, the mistranslation in (1) is caused by the inability of disambiguation between the two meanings. &amp;quot;Transformation&amp;quot; also causes mistrans- null Some idioms represent two or three idiomatic meanings. But those meanings in an idiom are not distinguished. We concerned only whether a phrase is used as an idiom or not.  For a detailed discussion of what constitutes the notion of (Japanese) idiom, see Miyaji (1982), which details usages of commonly used Japanese idioms.</Paragraph>
    <Paragraph position="1">  The term &amp;quot;transformation&amp;quot; in the paper is not relevant to the Chomskyan term in Generative Grammar.</Paragraph>
    <Paragraph position="2">  lation. Sentences in (2) and (3a) contain an idiom, yaku-ni tatu (part-DAT stand) &amp;quot;serve the purpose.&amp;quot;  mistranslates (3a) as in (3b), which does not make sense,  though it successfully translates (2). The only difference between (2) and (3a) is that bunsetu  constituents of the idiom are detached from each other.</Paragraph>
  </Section>
  <Section position="5" start_page="353" end_page="356" type="metho">
    <SectionTitle>
3 Knowledge for Idiom Recognition
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="353" end_page="353" type="sub_section">
      <SectionTitle>
3.1 Classification of Japanese Idioms
</SectionTitle>
      <Paragraph position="0"> Requisite lexical knowledge to recognize an idiom depends on how difficult it is to recognize it. Thus, we first classify idioms based on recognition difficulty. The recognition difficulty is determined by the two factors: ambiguity and transformability.</Paragraph>
      <Paragraph position="1"> Consequently, we identify three classes (Figure 1).</Paragraph>
      <Paragraph position="2">  Class A is not transformable nor ambiguous. Class B is transformable but not ambiguous.  Class C is transformable and ambiguous. Class A amounts to unambiguous single words, which are easy to recognize, while Class C is the most difficult to recognize. Only Class C needs further classifications, since only Class C needs disambiguation and lexical knowledge for disambiguation depends on its part-of-speech (POS) and internal structure. The POS of Class C is either verbal or adjectival, as in Figure 1. Internal structure represents constituent words' POS and a dependency between bunsetus. The internal structure  In fact, the idiom has no literal interpretation.  A bunsetu is a syntactic unit in Japanese, consisting of one independent word and more than zero ancillary words. The sentence in (3a) consists of four bunsetu constituents.  The blank space at the upper left in the figure implies that there is no idiom that does not undergo any transformation and yet is ambiguous. Actually, we have not come up with such an example that should fill in the blank space.  Anonymous reviewers pointed out that Class A and B could also be ambiguous. In fact, one can devise a context that makes the literal interpretation of those Classes possible. However, virtually no phrase of Class A or B is interpreted literally in real texts, and we think our generalization safely captures the reality of idioms.</Paragraph>
    </Section>
    <Section position="2" start_page="353" end_page="353" type="sub_section">
      <SectionTitle>
Recognition Difficulty
</SectionTitle>
      <Paragraph position="0"> of hone-o oru (bone-ACC bone), for instance, is &amp;quot;(Noun/Particle Verb),&amp;quot; abbreviated as &amp;quot;(N/P V).&amp;quot; Then, let us give a full account of the further classification of Class C. We exploit grammatical differences between literal and idiomatic usages for disambiguation. We will call the knowledge of the differences the disambiguation knowledge.</Paragraph>
      <Paragraph position="1"> For instance, a phrase, hone-o oru, does not allow passivization when used as an idiom, though it does when used literally. Thus, (4), in which the phrase is passivized, cannot be an idiom.</Paragraph>
      <Paragraph position="2">  In this case, passivizability can be used as a disambiguation knowledge. Also, detachability of the two bunsetu constituents can serve for disambiguating the idiom; they cannot be separated. In general, usages applicable to idioms are also applicable to literal phrases, but the reverse is not always true (Figure 2). Then, finding the disam-Usages Applicable to Only Literal Phrases</Paragraph>
    </Section>
    <Section position="3" start_page="353" end_page="354" type="sub_section">
      <SectionTitle>
Usages Applicable to Both
Idioms and Literal Phrases
</SectionTitle>
      <Paragraph position="0"> biguation knowledge amounts to finding usages applicable to only literal phrases.</Paragraph>
      <Paragraph position="1"> Naturally, the disambiguation knowledge for an idiom depends on its POS and internal structure.  As for POS, disambiguation of verbal idioms can be performed by the knowledge of passivizability, while that of adjectival idioms cannot. Regarding internal structure, detachability should be annotated on every boundary of bunsetus. Thus, the number of annotations of detachability depends on the number of bunsetus of an idiom.</Paragraph>
      <Paragraph position="2"> There is no need for further classification of Class A and B, since lexical knowledge for them is invariable. The next section mentions their invariableness. After all, Japanese idioms are classified as in Figure 3. The whole picture of the subclasses of Class C remains to be seen.</Paragraph>
    </Section>
    <Section position="4" start_page="354" end_page="354" type="sub_section">
      <SectionTitle>
3.2 Knowledge for Each Class
</SectionTitle>
      <Paragraph position="0"> What lexical knowledge is needed for each class? Class A needs only a string information; idioms of the class amount to unambiguous single words.</Paragraph>
      <Paragraph position="1"> A string information is undoubtedly invariable across all kinds of POS and internal structure.</Paragraph>
      <Paragraph position="2"> Class B requires not only a string but also knowledge that normalizes transformations idioms could undergo, such as passivization and detachment of bunsetus. We identify three types of transformations that are relevant to idioms: 1) Detachment of Bunsetu Constituents, 2) Predicate's Change, and 3) Particle's Change. Predicate's change includes inflection, attachment of a negative morpheme, a passive morpheme or modal verbs, and so on. Particle's change represents attachment of topic or restrictive particles. (5b) is an example of predicate's change from (5a) by adding a negative morpheme to a verb. (5c) is an example of particle's change from (5a) by adding a topic particle to the preexsistent particle of an idiom.</Paragraph>
      <Paragraph position="3">  tatu.</Paragraph>
      <Paragraph position="4"> stand &amp;quot;He serves the purpose.&amp;quot; To normalize the transformations, we utilize a dependency relation between constituent words, and we call it the dependency knowledge. This amounts to checking the presence of all the constituent words of an idiom. Note that we ignore, among constituent words, endings of a predicate and case particles, ga (NOM) and o (ACC), since they could change their forms or disappear.</Paragraph>
      <Paragraph position="5"> The dependency knowledge is also invariable across all kinds of POS and internal structure.</Paragraph>
      <Paragraph position="6"> Class C requires the disambiguation knowledge, as well as all the knowledge for Class B. As a result, all the requisite knowledge for idiom recognition is summarized as in Table 1.</Paragraph>
      <Paragraph position="7">  As discussed in SS3.1, the disambiguation knowledge for an idiom depends on which sub-class it belongs to. A comprehensive idiom recognizer calls for all the disambiguation knowledge for all the subclasses, but we have not figured out all of them. Then, we decided to blaze a trail to discover the disambiguation knowledge by investigating the most commonly used idioms.</Paragraph>
    </Section>
    <Section position="5" start_page="354" end_page="355" type="sub_section">
      <SectionTitle>
3.3 Disambiguation Knowledge for the
Verbal (N/P V) Idioms
</SectionTitle>
      <Paragraph position="0"> What type of idiom is used most commonly? The answer is the verbal (N/P V) type like hone-o oru (bone-ACC break); it is the most abundant in terms of both type and token. Actually, 1,834 out of 4,581 idioms (similarequal40%) in Kindaichi and Ikeda (1989), which is a Japanese dictionary with more than 100,000 words, are this type.</Paragraph>
      <Paragraph position="1">  Also, 167,268 out of 220,684 idiom tokens in Mainichi newspaper of 10 years ('91-'00) (similarequal76%) are this type.  Then we discuss what can be used to disambiguate the verbal (N/P V) type. First, we examined literature of linguistics (Miyaji, 1982; Morita, 1985; Ishida, 2000) that observed characteristics of Japanese idioms. Then, among the characteristics, we picked those that could help with the disambiguation of the type. (6) summarizes them.  Counting was performed automatically by means of the morphological analyzer ChaSen (Matsumoto et al., 2000) with no human intervention. Note that Kindaichi and Ikeda (1989) consists of 4,802 idioms, but 221 of them were ignored since they contained unknown words for ChaSen.  We counted idiom tokens by string matching with inflection taken into account. And we referred to Kindaichi and Ikeda (1989) for a comprehensive idiom list. Note that counting was performed totally automatically.</Paragraph>
      <Paragraph position="2">  That is, the Genitive Phrase Prohibition, (6aII), is in effect for the idiom. Likewise, the idiom does not allow its case particle o (ACC) to be substituted with restrictive particles such as dake (only).  &amp;quot;Volitional Modality&amp;quot; represents those verbal expressions of order, request, permission, prohibition, and volition. This means the Restrictive Particle Constraint, (6b), is also in effect. Also, (4) shows that the Passivization Prohibition, (6cI), is in effect, too. Note that the constraints in (6) are not always in effect for an idiom. For instance, the Causativi- null zation Prohibition, (6cII), is invalid for the idiom, hone-o oru. In fact, (9a) can be interpreted both literally and idiomatically.</Paragraph>
      <Paragraph position="3"> (9) a. kare-ni</Paragraph>
      <Paragraph position="5"> b. &amp;quot;(Someone) makes him break a bone.&amp;quot; c. &amp;quot;(Someone) makes him make an effort.&amp;quot;</Paragraph>
    </Section>
    <Section position="6" start_page="355" end_page="356" type="sub_section">
      <SectionTitle>
3.4 Implementation
</SectionTitle>
      <Paragraph position="0"> We implemented an idiom dictionary based on the outcome above and a recognizer that exploits the dictionary. This section illustrates how they work, and we focus on Class B and C hereafter.</Paragraph>
      <Paragraph position="1"> The idiom recognizer looks up dependency patterns in the dictionary that match a part of the dependency structure of a sentence (Figure 4). A dependency pattern is equipped with all the requisite knowledge for idiom recognition. Rough sketch of the recognition algorithm is as follows:  1. Analyze the morphology and dependency structures of an input sentence.</Paragraph>
      <Paragraph position="2"> 2. Look up dependency patterns in the dictionary that match a part of the dependency structure of the input sentence.</Paragraph>
      <Paragraph position="3"> 3. Mark constituents of an idiom in the sentence if any.</Paragraph>
      <Paragraph position="4">  Constituents that are marked are constituent words and bunsetu constituents that include one of those constituent words.  As a constituent marker, we use an ID that is assigned to each idiom in the dictionary.</Paragraph>
      <Paragraph position="5">  As in Figure 5, we use ChaSen as a morphology analyzer and CaboCha (Kudo and Matsumoto, 2002) as a dependency analyzer. Dependency matching is performed by TGrep2 (Rohde, 2005), which finds syntactic patterns in a sentence or treebank. The dependency pattern is usually getting complicated since it is tailored to the specification of TGrep2. Thus, we developed the Dependency Pattern Generator that compiles the pattern database from a human-readable idiom dictionary. Only the difference in treatments of Class B and C lies in their dependency patterns. The dependency pattern of Class B consists of only its dependency knowledge, while that of Class C consists of not only its dependency knowledge but also its disambiguation knowledge (Figure 6).</Paragraph>
      <Paragraph position="6"> The idiom dictionary consists of 100 idioms, which are all verbal (N/P V) and belong to either Class B or C. Among the knowledge in (6), the Selectional Restriction has not been implemented yet. The 100 idioms are those that are used most frequently. To be precise, 50 idioms in Kindaichi and Ikeda (1989) and 50 in Miyaji (1982) were extracted by the following steps:  1. From Miyaji (1982), 50 idioms that were  We counted idiom tokens by string matching with inflection taken into account. Note that counting was performed automatically without human intervention.</Paragraph>
      <Paragraph position="7"> used most frequently in Mainichi newspaper of 10 years ('91-'00) were extracted.</Paragraph>
      <Paragraph position="8"> 2. From Kindaichi and Ikeda (1989), 50 idioms that were used most frequently in the newspaper of 10 years but were not included in the 50 idioms from Miyaji (1982) were extracted.</Paragraph>
      <Paragraph position="9"> As a result, 66 out of the 100 idioms were Class B, and the other 34 idioms were Class C.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML