File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2125_metho.xml

Size: 16,391 bytes

Last Modified: 2025-10-06 14:14:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2125">
  <Title>Identifying Syntactic Role of Antecedent in Korean Relative Clause Using Corpus and Thesaurus Information</Title>
  <Section position="4" start_page="757" end_page="759" type="metho">
    <SectionTitle>
3 Extraction of Statistic Information
from Corpus
</SectionTitle>
    <Paragraph position="0"> First, for each of 100 verbs selected by order of frequency in the KLIB (Korean Language Information Base) corpus of 6 million words, its syntactic relational patterns (SRPs) of the form (Noun, Syntactic relation, Verb) are extracted from the corpus. Then, the nominal words in the SRPs are substituted with their corresponding concept codes at level 4 of the Kadokawa thesaurus. A nominal word may have multiple meanings such as C1,C2, ..., Cn. However, since we cannot determine which meaning of the nominal word is used in a SRP, we uni- 1 formly add n to the frequency of each concept code. Through this processing, the syntactic relational pattern (SRP) changes into the conceptual frequency pattern (CFP), ({&lt; C1, fl &gt; ,&lt; C2, f2 &gt;,...,&lt; Crn, fm &gt;},SRj,Vk), where Ci represents a concept code at level four of the Kadokawa thesaurus, fi indicates the frequency of the code Ci, and SRj shows a syntactic relation between these concept codes and verb Vk.</Paragraph>
    <Paragraph position="1"> These patterns are then generalized by a concept type filter into more abstract conceptual patterns (CPs), {({el, C2, ..., Cn}, SRj, Vk)ll &lt; j &lt; 5, 1 _&lt; k &lt; 100}. Unlike in CFPs, the concept code in the more generalized CPs may be not only at level four (denoted as L4), but also at level three (L3) and two (L2). In addition to the CPs, we also extract the syntactic role distributiion of antecedents.</Paragraph>
    <Section position="1" start_page="757" end_page="757" type="sub_section">
      <SectionTitle>
3.1 Retrieving Syntactic Relational
</SectionTitle>
      <Paragraph position="0"> Unlike the conventional parsing problem whose main goal is to completely analyze a whole sentence, the extraction of syntactic relational patterns (SRPs) aims to partially analyze sentences and thus to get the syntactic relations between nominals and verbs. For this, we designed a partial parser, the analysis result of which is obviously not as precise as that of a full-parser.</Paragraph>
      <Paragraph position="1"> However, it can provide much useful information. For the set of 100 verbs, a total of 282,216 syntactic relational patterns (SRPs) was extracted from the KLIB corpus. During the generalization step, the problematic patterns are filtered out.</Paragraph>
      <Paragraph position="2"> In Korean, the syntactic relation of nominal words toward a verb is mainly determined by case particles. During the extraction of SRPs (Ni, SRj,Vk), we only consider the syntactic relation SRjs determined by 5 types of case particles: nominative (-i/ka/kkeyse), accusative (-ul/lul), and three adverbial (-ey/eynun, se/eyse/eysenun, -to/ulo/ulonun).</Paragraph>
    </Section>
    <Section position="2" start_page="757" end_page="759" type="sub_section">
      <SectionTitle>
3.2 Conceptual Pattern Extraction
</SectionTitle>
      <Paragraph position="0"> For the purpose of type generalization of nominal words in SRPs, the Kadokawa thesaurus titled New Synonym Dictionary (Ohno and Hamanishi, 1981) is used, which has a four-level hierarchy with about 1,000 semantic classes.</Paragraph>
      <Paragraph position="1"> Each class of upper three levels is further divided into 10 subclasses, and is encoded with a unique number. For example, the class 'stationary' at level three is encoded with the number 96 and classified into ten subclasses, Figure 3 shows the structure of the Kadokawa thesaurus.</Paragraph>
      <Paragraph position="2"> To assign the concept code of Kadokawa thesaurus to Korean words, we take advantage of the existing Japanese-Korean bilingual dictionary (JKBD) that was developed for a Japanese-Korean MT system called COBALT-J/K. The bilingual dictionary contains more than 120,000 words, the meaning of which is encoded with the concept codes that are at level four in the Kadokawa thesaurus. Thus, Korean words in the SRPs are automatically assigned their corresponding concept codes of level four through JKBD.</Paragraph>
      <Paragraph position="3">  We encoded the nouns in SRPs extracted by the parser with concept codes from the Kadokawa thesaurus, and examined histograms of the frequency of concept codes. We observed that the frequency of codes for different syntactic relations of a verb showed very different distribution shapes. This means that we could use the distribution of concept codes, together with their frequencies as clues for conceptual pattern ex-</Paragraph>
      <Paragraph position="5"> traction. From the histograms of codes of both subject and object relational patterns for the verb ttena-ta (leave), we observed that concept codes about human (codes from 500 to 599) appear most frequently in the role of subject, and codes of position (from 100 to 109), codes of place (from 700 to 709) and codes of building (from 940 to 949) appear most often in the role of object.</Paragraph>
      <Paragraph position="6"> For each verb Vk, we first analyzed the co-occurrence frequencies fi of concept codes Ci of noun N, and then computed an average frequency fave,t and standard deviation at around lave,t, at level g (denoted as Lt) of the concept hierarchy. We then replaced fi with its associated z-score k$,e. k$,e is the strength of code frequency f at Lt, and represents the standard deviation above the average of frequency fave,t. Referring to Smadja's definition (Smadja, 1993), the standard deviation at at Lt and strength kf,t of the code frequencies are defined as shown in formulas 1 and 2.</Paragraph>
      <Paragraph position="8"> where fi,t is the frequency of concept code Ci at Lt of Kadokawa thesaurus, fave,t is the average frequency of codes at Lt, nt is the number of concept codes at Lt.</Paragraph>
      <Paragraph position="9">  The standard deviation at at Lt characterizes the shape of the distribution of code frequen- null cies. If al is small, then the shape of the histogram will tend to be flat, which means that each concept code can be used equally as an argument of a verb with syntactic role SRi. If at is large, it means that there is one or more codes that tend to be peaks in the histogram, and the corresponding nouns for these concept codes are likely to be used as arguments of a verb. The filter in our system selects the patterns that have a variation larger than threshold a0,t, and pulls out the concept codes that have a strength of frequency larger than threshold k0,l. If the value of the variation is small, than we can assume there is no peak frequency for the nouns. The patterns that are produced by the filter should represent the concept types of extracted words that appear most frequently as syntactic role SRi with verb Vk.</Paragraph>
      <Paragraph position="10"> We later analyzed the distribution of frequency f/ in CFPjs to produce an average frequency fave,t and standard deviation at. Through experimentation, we decided the threshold of standard deviation a0,t and strength of frequency k0,t as shown in Table 1. The lower the value of threshold k0,t is assigned, the more concept codes can be extracted as conceptual patterns from the CFPs. We maintained a balance between extracting conceptual codes at low levels of the conceptual hierarchy for the specific usage of concept type and extracting general concept types for enhancing overall system performance. These values may be variable in different application.</Paragraph>
      <Paragraph position="11"> In Table 2, we enlist the concept types that have more than 5 appearances in the CFP of verb ttena-ta (leave). The strength of frequencies for generalization is calculated with formula</Paragraph>
      <Paragraph position="13"> code code code code code l code l (freq.) (freq.) (freq.) (freq.) (freq.) (freq.)</Paragraph>
      <Paragraph position="15"/>
      <Paragraph position="17"> Since the value of k0,4 is set at 4.0, as shown in Table 1, the concept codes with frequencies of more than 13, as the equation for k14,4 shows, are selected as generalized concept types at L4.</Paragraph>
      <Paragraph position="18"> After abstraction at L4, the system performs generalization at L3. It removes selected frequencies, such as frequency 14 of code 411 in Table 2, and sums up the frequencies of the remaining concept codes to form the frequency of higher level group. For example, the system removes the frequency for code 411 from the group {410(12), 411(14), 412(3), 413(0), 414(0), 415(0), 416(1), 417(0), 415(0), 419(0)}, then sums up the frequencies of the remaining codes for a more abstract code of 41. The frequency of code 41 then becomes 16. Through this process, the system performs a generalization at L3 for the more abstract types of the concept. The system calculates ae and strength Kf,e, selects the most promising codes, and stores conceptual patterns ({C1, C2, C3, ...}, SRj, Vk) as the knowledge source for syntactic role determination in real texts, where concept type Ci is created by the generalization procedure. After generalization of the CFP patterns for the subject role of the verb ttena-ta (leave), the produced conceptual patterns are: ({411,430, 500, ..., 06, 11, ..., 99, 1}, subj, ttena-ta).</Paragraph>
    </Section>
    <Section position="3" start_page="759" end_page="759" type="sub_section">
      <SectionTitle>
3.3 Syntactic Role Distribution of
Antecedents
</SectionTitle>
      <Paragraph position="0"> In (Yang et al., 1993), they defined subcategorization score (SS) of a verb considering the verb argument structure in a corpus. They asserted that the SS of a verb represents how likely a verb might have a specific grammatical complement.</Paragraph>
      <Paragraph position="1"> We observed from analyzing the corpus that we cannot infer the syntactic roles of antecedents from subcategorization scores since the syntactic role distribution of verb arguments in a corpus is so different from the syntactic role distribution of antecedents due to the property of free word language. In Korean, an argument of a verb could be omitted, and so the subcategorization score don't provide possible trend of the role of antecedent in many cases. For example, 26.8% of arguments of the verb ttena-ta (leave) are used as subjects, and 54.4% are used as objects, but 74.41% of antecedents of the verb are of subject role, and 6.9% are of object role.</Paragraph>
      <Paragraph position="2"> Although the distribution of antecedents is necessary to our task, we cannot automatically retrieve the syntactic role distribution of them from the corpus. We extracted relative clauses for specific verbs from the corpus, and then counted the number of syntactic roles of the antecedents manually by language trained people. Since there are about 200 to 500 relative clauses for each verb in the corpus, it is possible to check this information. This information is represented by relative score RSk(SRi) of syntactic role SRi for antecedents of verb Vk as is shown bellow and is used in syntactic role determination as described in section 4: RSk(SRi)- freqk(SRi) (3) freq(Vk) where freq(Vk) are the frequency of verb Vk of relative clauses, and freqk(SRi) is the frequency of syntactic role SRi of antecedents in relative clauses including verb Vk in the corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="759" end_page="760" type="metho">
    <SectionTitle>
4 Identifying Deep Syntactic
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="759" end_page="760" type="sub_section">
      <SectionTitle>
Relation
</SectionTitle>
      <Paragraph position="0"> While determining syntactic relation for antecedents of relative clauses, the system checks the argument structure of the verb in a relative clause first, and then records the empty (or omitted) arguments of the verb in relative  clause referring to the verb valency information. The antecedent that the verb phrase is modifying can be one of these empty arguments. An antecedent (a noun) usually has one or more meanings, which causes ambiguity in determining the correct syntactic relation between the antecedent and a verb.</Paragraph>
      <Paragraph position="1"> We assume that an antecedent has meanings C1, C2, C3, ..., Cn, and that CPi is a conceptual pattern ({P1, P2, ..., Pro}, SRi, Vk) corresponding to syntactic relation SP~ of verb Vk. The evaluation score SIMi (Np, Vk) of an antecedent Np that can be syntactic role SRi with verb Vk is defined as formula 4, and conceptual similarity Csim(Cw, Pj) between concept Cw and Pj as formula 5.</Paragraph>
      <Paragraph position="3"> where MSCA(Cw, Pj) in Csim(Cw, Pj) represents the most specific common ancestor (MSCA) of concepts Cw and Pj in the Kadokawa concept hierarchy. Level(Cw) refers to the depth of concept Cw from the root node in the concept hierarchy. Is_a Penalty is a weight factor reflecting that Cw as a descendant of Pj is preferable to other cases. Conceptual similarity computation with formula 5 is shown in  Based on these definitions, the syntactic relation SRj between antecedent Np and verb Vk can be calculated as follows:  1. Let R = {SP~\[SRi is a syntactic relation of an empty (or omitted) argument in the relative clause of Irk, 1 &lt; i &lt; 5}.</Paragraph>
      <Paragraph position="4"> 2. For each conceptual pattern CPi of verb Vk of which SRi is in R, and for each concept code Pi in CPi, compute SIMi(Np, Vk).</Paragraph>
      <Paragraph position="5"> 3. Determine the syntactic relation of antecedent Np to SRj on the condition that SIMj(Np, Vk) has the largest value in  {SIMi(Np, Vk)\[1 &lt; i &lt; 5} and SRj in R.</Paragraph>
      <Paragraph position="6"> If two or more SIMi(Np, Vk) have the same value, decide syntactic role referring to the higher relative score RSk(SRi) of the syntactic role of the verb Vk.</Paragraph>
      <Paragraph position="7"> Here, syntactic relation can be one of subj, obj, advl, adv2, and adv3. The symbols advl, adv2, and adv3 represent adverbs with case particles -ey, -eyse, and -lo, respectively.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="760" end_page="761" type="metho">
    <SectionTitle>
5 Experimental Evaluation
</SectionTitle>
    <Paragraph position="0"> An informal way to evaluate the correctness of syntactic relation determination is to have an expert examine the test patterns and source sentences that the patterns appears, and give his/her judgment about the correctness of the results produced by the system. In our experiment, the correctness of syntactic and conceptual relation determination was evaluated manually by humans who were well trained in dependency syntax.</Paragraph>
    <Paragraph position="1"> As a test set, we extracted 1,772 sentences that included relative clauses for the 100 verbs from 1.5 million word corpora of integrated Korean information base and test books of primary school. The distribution of syntactic relation of antecedents among them and the test results were shown in Table 3. There were 1,087 antecedents (61.34%) that were of subject role.</Paragraph>
    <Paragraph position="2"> The baseline accuracy of the problem is 61.34%.</Paragraph>
    <Paragraph position="3"> That is, if we always select subject role for antecedents, the accuracy will reach 61.34%.</Paragraph>
    <Paragraph position="4">  Our system showed 90.4% of accuracy on average in syntactic relation identification, which shows that the conceptual patterns and relative score of syntactic relation produced in the first phase can be a good source for determining the syntactic relation of an antecedent.</Paragraph>
    <Paragraph position="5"> Through experiment, we observed several factors that affect the performance of the system. First, the multiple meanings of a noun will affect the frequency distribution of concept codes. In our system, we cope with this problem by adjusting the threshold of standard deviation and strength value. The second problem is the sparseness of corpus domain. If the corpus for learning is specified as a certain domain, it will greatly increase the validity of conceptual patterns. If we use a sense tagged corpus in the learning stage, we can achieve high accuracy in syntactic relation determination.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML