File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1093_intro.xml

Size: 13,537 bytes

Last Modified: 2025-10-06 14:05:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1093">
  <Title>D B'~Z~A * D D BL?cA * D D B~'LT~A * D D B~-/'cA * D D * A.~3 ~ i~'B * D ~ D * AL g - D 1.6 D * B,t~ J: LFA * D D'B~A'D D * A~Zo~'~cCo')B * D' 1.7 D * A~,~-1~-9~ ~ B - D / D * B~:_o~,~Z'09A * D D. B~Y-I~-~A * D</Title>
  <Section position="3" start_page="550" end_page="553" type="intro">
    <SectionTitle>
2. Text Scanning Approach
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="550" end_page="551" type="sub_section">
      <SectionTitle>
2.1 Overview
</SectionTitle>
      <Paragraph position="0"> Figure 1 illustrates the processing I\]ow. An input compound noun is first analyzed by JMA and segmented into a sequence of registered words. The output is stored as an initial value in a list called WORDLIST (WL).</Paragraph>
      <Paragraph position="1"> For every word in WL, a search for its collocational pattern is conducted, and the results are stored in tile evidence data base (EDB). It is important that there is a feedbackloop from EDB to WL through which newly tbund words can be a&amp;ted to WL. The search is continued until every wold in WL is used as a key. This f~dback enables tile bootstrapping acquisition of evidence.</Paragraph>
      <Paragraph position="2"> Figure 1 Arch itecture of Direct Scanning Mct hod Input &amp;quot;4t d~Aa,ML,Yt~'P / ~aqzer \[ ,17 Result 0f Initial JMA * gCll! sW~ acl/ I,; Wi)~ n(~l: sn&amp;quot; Newly Found Word</Paragraph>
      <Paragraph position="4"> head: i~iE head: )&lt;hli f).~ t\]ead: ~fi med-rel:nil mocl~ret: nil modrol: dl Alter the searches, the input is re-analyzed using newly found words. The final result of JMA is then passed to a CFG-parser which calculates the cost of possihlc structures and the attribute-values attached to each node in a solution. In the case that there is ambiguity in the final morphological analysis of a given compound noun, the morphological analyzer picks up the solution with the least number of segmentations.</Paragraph>
      <Paragraph position="5"> The procedure of the cost calculation era dcpendcncy structure is basically the same in Kobayashi et al. (1994). The cost of the dependency between two nodes is given by nsing mulual information between the lexical heads of ihe taxies (fig. 2).</Paragraph>
      <Paragraph position="6">  Here two kind of attributes are used; head, which records the head of a node as a value, and nu, d-rel, which records the kind of relationship found between two heads of children.</Paragraph>
      <Paragraph position="7"> In Japanese, if the two children are both content words, the value of the head attribute of the parent node is usually identical to the value of the hend attribute of the right daughter.</Paragraph>
      <Paragraph position="8"> Figure 2 Depe nde ncy R epres ta~t ati on U sing</Paragraph>
      <Paragraph position="10"> head: a head: fl mod-rel: {r ....... r,m,} m3d-rel:{r ....... r~}</Paragraph>
    </Section>
    <Section position="2" start_page="551" end_page="551" type="sub_section">
      <SectionTitle>
2.2 Basic CFG Rules
</SectionTitle>
      <Paragraph position="0"> The category which the morphological analyzer assigns to a word is one of the following: sn (stem of a sino-verb), n (noun), pn (proper noun), num (number), adj (stem of an adjective or an adjectival verb), prfx (nominal prefix), sfix (nominal suffix), num-prfx (numerical prefix), and numsfix (numerical suffix). CFG rules for compound noun construction use these categories as non-terminals. The following two rules are the most basic: \[np -~ np np\] and \[np --~ n\]. These rules construct the basic framework of the dependency-structure of a compound noun. We assume that the structure of a compound noun can be represented in the framework of binary-tree grammar by using attribute-wdue pairs.</Paragraph>
    </Section>
    <Section position="3" start_page="551" end_page="552" type="sub_section">
      <SectionTitle>
2.3 Co-occurrence Data Collection by Direct
Text Scanning
</SectionTitle>
      <Paragraph position="0"> This subsection describes the most important part of our method: the pattern matchers and heuristics on unregistered word treatment.</Paragraph>
      <Paragraph position="1"> &amp;quot;Fable 1 shows the main part of the pattern matchers. We will describe the procedure for collecting evidence by using the example mentioned previously, &amp;quot;~\]E.~)~t~tJ~)~ The initial segmentation of the compound noun is &amp;quot;~k~ sn/~ adj/Y~ n/i-~ n/~'~T sn&amp;quot;. Thus the WL initially contains these five words. The words are used as keys lot the search. As mentioned in the previous section, this solution contains an over-segmentation error, which is the most likely error in the situation when unregistered words appear. Therefore this example captures the typical problem laced in our task.</Paragraph>
      <Paragraph position="2"> In Table I, 'A' stands for a given key, 'B' stands for a sequence of kanji characters (we only treat kanjicompound nouns in this paper), and 'D' stands for an &amp;quot;extended&amp;quot; delimiter: D is identical to a space, a symbol, a katakana or a hiragana except &amp;quot;(c)&amp;quot; (no; o3'). After preliminary experiments, we decided to eliminate &amp;quot;(c)&amp;quot; from the delimiters because if it is used, a pattern such ~ts</Paragraph>
      <Paragraph position="4"> collocation of A and B. If the length of A is more than or equal to 2, The length of B is limited to less than or equ~d to 3. If the length of A is !, the length of B is limited to less than or equal to 2. Additional explanation will be given later in this subsection.</Paragraph>
      <Paragraph position="5"> Patterns in 1.2 collect the evidence of particlecombined collocation of A and B. A and B are combined by a particle &amp;quot;(c)&amp;quot; which is similar to &amp;quot;of' in English. Note that no part of a phrase such as &amp;quot;A(c)B(c)C&amp;quot; is picked up so that erroneous evidence can be to avoided. The length of B is limited to less than or equal to 3 (in 1.3, .... 1.7, the same condition on B is used).</Paragraph>
      <Paragraph position="6"> Patterns in 1.3 collect the evidence of an adjectival modifier-modifiee relationship between an adjective (or an adjectival noun) and a noun.</Paragraph>
      <Paragraph position="7"> Patterns in 1.4 collect the evidence of a predicate-argument relation between a sino-verb and a noun. Particles &amp;quot;C/j~&amp;quot; Q~a), &amp;quot;~ &amp;quot;(wo) and &amp;quot;l~-&amp;quot;(ni) roughly indicate AGENT, OBJECT and GOAL, respectively.</Paragraph>
      <Paragraph position="8"> Patterns in 1.5 collect the evidence of a modifier-modifiee relationship between a sino-verb and a noun, the sino-verb which appears at the tail of a noun modifier phrase and the noun which is modified by the phrase.</Paragraph>
      <Paragraph position="9"> Patterns in 1.6 collect the evidence of a coordination relationship between two words.</Paragraph>
      <Paragraph position="10"> Patterns in 1.7 collect phrases such as &amp;quot;A about B&amp;quot; ~md &amp;quot;B about A&amp;quot;.</Paragraph>
      <Paragraph position="11"> Here we omit the others. One can ,add any pattern as long as it supplies reliable evklence.</Paragraph>
      <Paragraph position="12"> In the following part of this subsection, we will illustrate the search procedure using the initial value of WE {(?~k.d(sn), (~ adj), (/~ n), ('~}~ n), ()j~-~ sn)}.</Paragraph>
      <Paragraph position="13"> From the first item &amp;quot;~kll:Z', evidence shown in 3.1 of figure 3 is collected, and the result is stored in the form  shown in 3.1'. Note that the number of occurrences ~uxt the observed relationships are recorded. At this stage, the unregistered word &amp;quot;Jql~'~J~&amp;quot; is already captured by using a pattern marcher in 1.5.</Paragraph>
      <Paragraph position="14"> As for the second word, however, one has to be  careful because a word with length 1 is very likely to appear through an over-segmentation error. The pattern matchers gather evidence such as &amp;quot;AS~ ~:~{U' (~ ~oC/:big; ~(~: change), &amp;quot;J&lt;~&amp;quot; (university), &amp;quot;)2~!!&amp;quot; (large), &amp;quot;J&lt;ldi'{):,&amp;quot; (large retail-shop law) etc. as given in 3.2. This evidence contains not only correct examples (such as &amp;quot;AS L~ ~oc&gt;~.'\[~ '') but also registered words (such as &amp;quot;AS~&amp;quot;, &amp;quot;~ ~&amp;quot;) and unregistered words (such as &amp;quot;J&lt;h~&amp;quot;).  To classify the evidence, we developed the following rules: R-(a) If(l) the length of A is 1, and the length of B is l, ~md (2) there is no entry for the concatenated string AB (BA) in the dictionary used by JMA, then recognize the concatenated string as an unregistered word, and apply R-(c).</Paragraph>
      <Paragraph position="15"> R-(b) If (1) the length of A is 1, and the length of B is 2, (2) there is no entry for the concatenatod string AB (BA) in the dictionary, (3) the category of B is not 'sn' (the condition for AB), and (4) the concatenated string AB (BA) cannot be segmented as a sequence of two registered words A'B'(B'A'), where A':#A, then recognize the concatenated string as an unregistered word and apply R-(c).</Paragraph>
      <Paragraph position="16"> R-(c) If (1) the character string consisting of B is identical to the concatenated string of the first or the first two words following A in the initial solution (the condition for AB), or (2) the character string consisting of B is identical to the concatenated string of the first previous or the first two previous words preceding A in the initial solution (the condition for BA), then record AB in WL as an unregistered word, which will invoke pattern matching using AB as a key.</Paragraph>
      <Paragraph position="17"> R-(d) If (1) tile length of A is larger than or equal to 2, and (2) the concatenated string AB (BA) cannot be segmented as a sequence of two registered words A'B'(B'A'), where A' A, then, record an evidence of inner-word co-occurrence of A and B.</Paragraph>
      <Paragraph position="18"> We admit that the definition of a word might be controversial. However, we do not mention the arguments here because of the lack o1' space. We only say that the standpoint we chose is simple and umchine-tractable, ~md works well lbr our purpose.</Paragraph>
      <Paragraph position="19"> &amp;quot;~-~C/'~'\[~&amp;quot; is recorded as evidence of a straighttorward adjectival moditier-nlodifiee relationship between &amp;quot;.k&amp;quot; and &amp;quot;~C\[g&amp;quot;.</Paragraph>
      <Paragraph position="20"> According to R-(a), &amp;quot;ASq:&amp;quot; and &amp;quot;)~&amp;quot; are neglected. According to R-(b) and R-(c)-(l), ~)t~)2 is recorded as an unregistered word and stored in WD, which invokes a search of the patterns around it.</Paragraph>
      <Paragraph position="21"> Having worked through all the elements in WD, the evidence given in 3.1', 3.2', 3.3', 3.4', 3.5' and finally 3.6' is obtained.</Paragraph>
      <Paragraph position="22"> At this stage, \]MA re-analyzes the input compound noun by using newly found words. Thus the con'cct segmentation &amp;quot;~.~iE sn / .~l~ n / )~,~l: sn&amp;quot; is obtained,</Paragraph>
    </Section>
    <Section position="4" start_page="552" end_page="552" type="sub_section">
      <SectionTitle>
2.4 Selection of Proper Analysis
2.4.1 Cost Calculation and Mutual
Information
</SectionTitle>
      <Paragraph position="0"> The rest of the procedure is straightforward. An augmented bottom-up CFG parser chooses the minimum cost tree for the given word sequence. Let NP 3 be the parent of NP~ and NP~ in a subtree. Each node has three kinds of attributes: head, mod-rel and accum-cost, head has the lexical head of the subtree under NP i as its value. ,u)d-rel keeps tile observed relationships captured by the pattenl matchers between the two lexical heads of child nodes (this value is not actually used in the fi,llowing experiments), accum-cost c i records the accumulated cost of the subtree which has NP i as its root. ~ is calculated as IMiows:</Paragraph>
      <Paragraph position="2"> where N(headi) stands for the number of patterns containing ha~ i, N(headl, head2) stands for the number of the patterns containing both heM~ and head 2. The value of accum-cost of each leaf node is set to 0.</Paragraph>
    </Section>
    <Section position="5" start_page="552" end_page="553" type="sub_section">
      <SectionTitle>
Observed Evidence
</SectionTitle>
      <Paragraph position="0"> The corpus based approach inevitably encounters tile  sparseness problem. Our approach also encounters this problem, although it turned out to be not serious, as will be explained in section 3.3. This subsection describes the heuristic that is employed when the evidence cannot cover any of the entire trees.</Paragraph>
      <Paragraph position="1"> Figure 4 shows two possible dependency structures in a three-word compound noun. For simplicity, the values of the head attribute are indicated instead of the non-terminal symbols. For three noun words, the following rule is applied: If only the dependency between Hj and H 2 was observed, then 4-(a) is chosen, else if only the dependency between H l and H 3 was observed, then 4-(b) is chosen, else if only the dependency between H 2 and H 3 was observed, then 4(b) is chosen.</Paragraph>
      <Paragraph position="2"> In general, priority is given to the solution containing more subtrees which directly reflect the observed evidence.</Paragraph>
      <Paragraph position="3"> In our experiments, the analysis which has multiple minimum cost solutions was considered to have failed.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML