File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1057_metho.xml

Size: 19,008 bytes

Last Modified: 2025-10-06 14:07:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1057">
  <Title>Robust Segmentation of Japanese Text into a Lattice for Parsing</Title>
  <Section position="4" start_page="390" end_page="391" type="metho">
    <SectionTitle>
3 Japanese Orthographic Variation
</SectionTitle>
    <Paragraph position="0"> Over tile centuries, Japanese has evolved a complex writing system that gives tile writer a great deal of flexibility when composing text.</Paragraph>
    <Paragraph position="1"> Four scripts are in common use (kanji, hiragana, katakana and roman), and can co-occur within lexical entries (as shown ill Table 1).</Paragraph>
    <Paragraph position="2"> Some mixed-script entries could be handled as syntactic compounds, for example, ID ~a---1-&amp;quot; /at dii kaado=&amp;quot;ID card'7 could be derived fl'om 1DNotJN + 79-- I ~ NOUN. tlowever, many such items are preferably treated as lexical entries because  i!i \[~ ~ ' \[atarashii ,: &amp;quot;'new &amp;quot;\] Kanji-Iliragana I~LII~J \[ho~(vtnn'ui = &amp;quot;'mammal&amp;quot;\] ~ 7/&amp;quot; &amp;quot;7 :./\[haburashi -~ &amp;quot;'toothbrush &amp;quot;\] ................... .........</Paragraph>
    <Paragraph position="3"> K'}n.!i-&lt;~!P!\]!! ...................... E(!S '!t(!Z./~(:GS tm!'i t? ::c:(';S's:vstet!* :'1 ........ 12 ) J \[lmtmL, atsu - &amp;quot;December&amp;quot;/ Kallji-Synlbol v ,{'~ \[gcmma sen = &amp;quot;'~amma rays &amp;quot;\] .i-3 1&amp;quot; 4&amp;quot; t/ \[otoim - &amp;quot;'toilet &amp;quot;'\] Mixed kana ............................. \[.-/&amp;quot;)3 hl~C/!,,!! PS .;;(9 cC/f~,,,~/5' ;7/ ........... II3 )-~ -- b&amp;quot; \[aidtt kaado = &amp;quot;lP card&amp;quot;\] Kana-Alpha ..t .y-t~ .&gt;&amp;quot;-5;-V --RNA \[messe~gaa RA'A = ................................................................. T ie~t.~:s~'~lxe~ ~U.:I. '7 .................................. 7, J- U 2/-)'- ~) 2, 90 \[suloronchiumu 90 &amp;quot;'Strontiunt 90 &amp;quot;\] Kana-Symbol I~ 2_ 7~ 40 deg \[hoeru yonjuu do = &amp;quot;'roaring ............................................................................ fot:!!~,,;.7 ......................................................... ~i~ b ~&amp;quot; ~, \[keshigomu - &amp;quot;'eraser &amp;quot;1 a &gt;,I- 21&amp;quot; ~ e/ ~) ~: \[artiSt kentauri set : Other mixed &amp;quot;'Alpha Centauri &amp;quot;\] \[. ~: ~ \[togaki = &amp;quot;'stage directions&amp;quot;\]  they have non-compositional syntactic or semantic attributes.</Paragraph>
    <Paragraph position="4"> In addition, many Japanese verbs and adjectives (and words derived from them) have a variety of accepted spellings associated with okurigana, optional characters representing inflectional endings. For example, the present tense of e)j b ~;C/; ~- (kiriotosu = &amp;quot;to prune &amp;quot;) can be written as any of: ~)Ji:~~J --, ~;)J b ..&amp;quot;C/,:t, C/:JJTf; &amp; ,~, ~JJb.~s&amp;~-, ~ ~:; ~s ~ I or even ~ ~) ~'g ~ 4-, -~ 9 7~;-~-.</Paragraph>
    <Paragraph position="5"> Matters become even more complex when one script is substituted for another at the word or sub-word level. This can occur for a variety of reasons: to replace a rare or difficult ka@ (.~PS~ \[rachi= &amp;quot;kMnap&amp;quot;\] instead of f,):~); to highlight a word in a sentence ( ~ &gt;&amp;quot; t2 -h, o ~_ *) \[henna kakkou = &amp;quot;strange appearance '7); or to indicate a particular, often technical, sense ( 7 Y o -c \[watatte =&amp;quot;crossing over&amp;quot;\] instead of iA~o-c, to emphasize the domain-specific sense of &amp;quot;connecting 2 groups&amp;quot; in Go literature).</Paragraph>
    <Paragraph position="6"> More colloquial writing allows for a variety of contracted tbrms like ~ t~j~.\-~ ~ ~ t~ + !=t \[ore~ tacha = ore-tachi + wa = &amp;quot;we'&amp;quot; + TOPIC\] and phonological inutations as in ~d.~:--~- = -d'4 \[deesu ~ desu = &amp;quot;is &amp;quot;\].</Paragraph>
    <Paragraph position="7"> This is only a sampling of the orthographic issues present in Japanese. Many of these variations pose serious sparse-data problems, and lexicalization of all variants is clearly out of the questioi1.</Paragraph>
  </Section>
  <Section position="5" start_page="391" end_page="392" type="metho">
    <SectionTitle>
4 Segmenter Design
</SectionTitle>
    <Paragraph position="0"> Given the broad long-term goals for' the overall system, we address the issues of recall/precision  and orthographic variation by narrowly defining the responsibilities of the segmenter as: (i) Maximize recall (2) Normalize word variants</Paragraph>
    <Section position="1" start_page="391" end_page="392" type="sub_section">
      <SectionTitle>
4.1 Maxinffze Recall
</SectionTitle>
      <Paragraph position="0"> Maximal recall is imperative, Any recall mistake lnade in the segmenter prevents the parser from reaching a successful analysis. Since the parser in our NL system is designed to handle ambiguous input in the fbrm of a word lattice of potentially overlapping records, we can accept lower precision if that is what is necessary to achieve high recalldeg Conversely, high precision is specifically not a goal for the segmenter. While desirable, high precision may be at odds with the primary goal of maximizing recall. Note that the lower bound for precision is constrained by the lexicon.</Paragraph>
      <Paragraph position="1">  Given tile extensive amount of orthographic variability present in Japanese, some form of normalization into a canonical form is a pre-requisite for any higher-.order linguistic processing. The segmenter performs two basic kinds of nomlalization: \[,emmatization of inflected forms and Orthographic Norlnalization.</Paragraph>
      <Paragraph position="2">  ,~kurieana .... n),: g.). z~ -+ i,J,: ~- ~),~ :~ \[lhkmuk~ :: &amp;quot;drafty&amp;quot;/ ;5'./~ &lt;% J)-tJ- ;'3_ \[11i1:i:;5,.,7~-\]\[ ~&amp;quot;, ,. ,,,,.. ~ ' &gt;\]-~J-'&amp;quot;o kammwaseru = 7o ................... ~ !o{ ~ c!i ~s /./!,a!!{,6! : ::e ,!:e!l?!el&lt;, 71 .... ........................... engage (gear.w i' a mll.~'llmori - &amp;quot;'~111 non-standard tA'ct) ::~ &amp;quot;~ ~-0~ \]'- \[onnanoko :: &amp;quot;girl&amp;quot;/ .;.'/. &lt;) ~, l) \[ ~u:&lt;;'~-\]\[ ~\[:~-). {,. ~') \] estimate &amp;quot;' script +~:t ,-t - &amp;quot;+ 0:4 7, :~ \[d,s'uko : &amp;quot;disco&amp;quot;/ II) ):2 I&amp;quot; \[li',;4- lll\]):&lt;fi4 --i:DZ i D: a/ih, kaado2:&amp;quot;lDciml '&amp;quot; * D, I\] &amp;quot;+ &amp;quot; &amp;quot;~ Jl \[tkka,4etsu :: &amp;quot;one month&amp;quot;/ ................ 9:0 variants kagi tahako &amp;quot;'smf/'f --I, I!:~ &amp;quot; \['~\] \[ktl.~'uml~aseki = &amp;quot;Kasunli~aseki &amp;quot;\] numerals fi~r 5 i\[i~ .+ ti\]!i~ \[gorm : &amp;quot;'Olympics &amp;quot;\] kaltji 1 ),, ~ &amp;quot;)v \[hitori : &amp;quot;one person&amp;quot;/ :t.; i Z -- ~ ,t~, k. ~ -~J { 2 b ' ~5 /~ \[oniisan = &amp;quot;'older vovcel  lexical entry. Examples are given in &amp;quot;fable 3. Two cases of special interest are okurigana and inline yomi/kanji normalizations. The okurigana normalization expands shortened forms into fully specified forms (i.e., fornls with all optional ................................. 7&amp;quot;~'!~t!?C'\] ..................................................................................... characters present). Tile yomi/kanji handling takes inline kanji lYo(N)~:iil2d -'~ tgs ~ft~ \[~essh~rul =: &amp;quot;'rodent&amp;quot;/  LEMMATIZATION iri Japanese is the same as that for any language with inflected forms - a lemma, or dictionary form, is returned along with the inflection attributes. So, a form like ~:~-7~ \[tabeta = &amp;quot;ate &amp;quot;J would retum a lemma of f,~ ~&lt;~; \[taberu = &amp;quot;'eat&amp;quot;\] along with a PAST attribute.</Paragraph>
      <Paragraph position="3"> Contracted forms are expanded and lemmatized individually, so that f,~ ~&lt;-&lt;o ~:~ .&gt; o ?= \[tabe~ tecchatta = &amp;quot;has ectten and gone'7 is returned as: f~ ~-Z. 7 0 G\[-RUND -F (, x &lt; GERUND -F L. +E &amp;quot;) PASr \[taberu: &amp;quot;eat&amp;quot; + iku--++go&amp;quot; F s\]Timazl=..iSpE(7\[.'\].  ORTIIOGRAPttlC NORMALIZATION smoothes out orthographic variations so that words are returned in a standardized form. This facilitates lexical lookup and allows tile system to map the variant representations to a single lexicon entry.</Paragraph>
      <Paragraph position="4"> We distinguish two classes of orqthographic normalization: character type normalization and script normalization.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="392" end_page="392" type="metho">
    <SectionTitle>
CIIARAC'IER TYPE NORMAI.IZATION takes tile
</SectionTitle>
    <Paragraph position="0"> various representations allowed by the Unicode specification and converts them into a single consistent form. Table 2 summarizes this class of normalization.</Paragraph>
    <Paragraph position="1"> SCR.II'T NORMAI,IZAI'ION rewrites the word so that it conforms to tile script and :~pelling used in the infixed parenthetical material and normalizes it out (after using the parenthetical infommtion to verify segmentation accuracy).</Paragraph>
  </Section>
  <Section position="7" start_page="392" end_page="393" type="metho">
    <SectionTitle>
5 Lexicon Structures
</SectionTitle>
    <Paragraph position="0"> Several special lexicon structures were developed to support these features. Tile most significant is an orthography lattice* that concisely encapsulates all orthographic variants for each lexicon entry and implicitly specifies the normalized form. This has the advantage of compactness and facilitates lexicon maintenance since lexicographic information is stored in one location.</Paragraph>
    <Paragraph position="1"> The orthography lattice stores kana inforrnation about each kanji or group of kanji in a word. For example, the lattice far the verb Y~:-&lt;~D \[taberu = &amp;quot;eat'7 is \[~:#_\]-&lt;~, because the first character (ta) can be written as either kanji 5~ or kana 1=. A richer lattice is needed for entries with okurigana variants~ like LJJ 0 i'~:~ 4 \[kiriotosu = &amp;quot;'to prune &amp;quot;\] cited earlier: commas separate each okurigana grouping. The lattice for kiriotosu is \[OJ:~, 0 \]\[i&amp;quot;#: ~, E \]~j-. Table 4 contains more lattice examples.</Paragraph>
    <Paragraph position="2"> Enabling all possible variants can proliferate records and confiise the analyzer (see \[Kurohashi 94\]). We therefore suppress pathological variants that cause confusion with more common words and constructions. For example, f:L-t,q- \[n~gai = &amp;quot;a long visit'7 never occurs as I.~ ~' since this is ambiguous with the highly fi'equent adjective ~-~v, /nasal - &amp;quot;l&lt;mg'7. Likewise, a word like !t ' Not to be confiised with the word lattice, which is the set of records passed fi'om the segmenter to the parser.  \[nihon = &amp;quot;,Aq)an &amp;quot;7 is constrained to inhibit invalid variants like 124&lt; which cause confusion with: {c I'OSl' + # NOUN \[tIi::I'.-tRTICI.I:&amp;quot; + /1on = &amp;quot;book &amp;quot;\]. We default to enabling all possible orthographies for each ennT and disable only those that are required. This saves US from having to update the lexicon whenever we encounter a novel orthographic variant since the lattice anticipates all possible variants.</Paragraph>
  </Section>
  <Section position="8" start_page="393" end_page="395" type="metho">
    <SectionTitle>
6 Unknown Words
</SectionTitle>
    <Paragraph position="0"> Unknown words pose a significant recall problem in languages that don't place spaces between words. The inability to identify a word in the input stream of characters can cause neighboring words to be misidentified.</Paragraph>
    <Paragraph position="1"> We have divided this problem space into six categories: variants of lexical entries (e.g., okurigana variations, vowel extensions, et al.); non-lexiealized proper nouns; derived forms; foreign Ioanwords; mimetics; and typographical errors. This allows us to devise focused heuristics to attack each class of unfound words.</Paragraph>
    <Paragraph position="2"> The first category, variants of lexical entries, has been addressed through the script normalizations discussed earlier.</Paragraph>
    <Paragraph position="3"> Non-lexicalized proper nouns and derived words, which account for the vast majority of unfound words, are handled in the derivational assembly component. This is where compounds like -: ~ &gt;i x ~':, ffuransugo = &amp;quot;French (language)&amp;quot;\] are assembled from their base components ;1 5~ J x \[furansu : &amp;quot;France &amp;quot;\] and at~ \[go = &amp;quot;language &amp;quot;J. Unknown foreign Ioanwords are identified by a simple maximal-katakana heuristic that returns the longest run of katakana characters. Despite its simplicity, this algorithm appears to work quite reliably when used in conjunction with the other mechanisms in our system.</Paragraph>
    <Paragraph position="4"> Mimetic words in Japanese tend to follow simple ABAB or ABCABC patterns in hiragana or katakana, so we look for these patterns and propose them as adverb records.</Paragraph>
    <Paragraph position="5"> The last category, typographical errors, remains mostly the subject for future work. Currently, we only address basic : (kanji) ~-~ -: (katakana) and i-, (hiragana) +~ :'- (katakana) substitutions.</Paragraph>
    <Paragraph position="6">  Our goal is to improve the parser coverage by improving the recall in the segmenter. Evaluation of this component is appropriately conducted in the context of its impact on the entire system, Z 1 Parser Evaluation Running on top of our segmenter, our current parsing system reports ~71% coverage + (i.e, input strings for which a complete and acceptable sentential parse is obtained), and -,97% accuracy for POS labeled breaking accuracydeg A full description of these results is given in \[Suzuki00\]. Z 2 Segmenter Evaluatkm Three criteria are relevant to segmenter performance: recall precision and speed.</Paragraph>
    <Paragraph position="8"> Analysis of a randonlly chosen set of tagged sentences gives a recall of 99.91%. This result is not surprising since maxindzing recall was a prinlary focus of our efforts.</Paragraph>
    <Paragraph position="9"> The breakdown of the recall errors is as follows: missing proper nouns = 47%, missing nouns = 15%.. missing verbs/adjs = 15%, orthographic idiosyncrasies = 15%, archaic inflections = 8%.</Paragraph>
    <Paragraph position="10"> It is worth noting that for derived forms (those that Tested on a 15,000 sentence blind, balanced corpus. See \[SuzuldO0\] for details.</Paragraph>
    <Paragraph position="11">  length (x-axis) for se~&lt;ginenter alone (upper curve) and our NL system as a whole (lower curve) are tiandled in the derivational assembly corn-. ponent), tim segmenter is considered correct as long as it produces the necessary base records needed to build the derived fom-t.</Paragraph>
    <Section position="1" start_page="394" end_page="395" type="sub_section">
      <SectionTitle>
ZZ2 Precision
</SectionTitle>
      <Paragraph position="0"> Since we focused our effbrts on maximizing recall,, a valid concern is the impact of the extra records on the parser, that is, the effect of lower segmenter precision oll the system as a whole.</Paragraph>
      <Paragraph position="1"> Figure 2 shows the baselirie segrnenter precision plotted against sentence length using the 3888 tagged sentences ~: For compaiison~ data for Chinese ~ is included. These are baseline vahles in the sense they represent the riumber of records looked up in the lexicon without application of ariy heuristics to suppress invalid records. Thus, these mnnbers represent worst--case segmenter precision.</Paragraph>
      <Paragraph position="2"> The baseline precisior, for the Japariese segmenter averages 24.8%, whicl-i means that a parser would need to discard 3 records for each record it used in the final parse. TMs value stays fairly constant as the sentence length increases. The baseline precision for Chir, ese averages 37.1%. The disparity between the Japanese and Chinese worst-case scenario is believed to reflect the greater ambiguity inherent in the Japanese v&lt;'riting system, owing to orthographic w~riation and the use of a syllabic script.</Paragraph>
      <Paragraph position="3"> ++ The &amp;quot; &lt;,&lt;,&amp;quot; o .~ t&lt;%~,% was obtained by usin,, the results of the parser on untagged sentences.</Paragraph>
      <Paragraph position="4"> 39112 sentences tagged in a sirnilar fashion using our  Using conservative pruning heuristics, we are able to bring the precision tip to 34.7% without affecting parser recall. Primarily, these heuristics work by suppressing the hiragana form of shork ambiguous words (like ~ \[ki=&amp;quot;tree, air, .slJirit, season, record, yellow,... '7, which is normally written using kanji to identify the intended sense). Z2..3 Speed Another concern with lower precision values has to do with performance measured in terms of speed.</Paragraph>
      <Paragraph position="5"> Figure 3 summarizes characters-per.-second performance of the segmentation component and our NL system as a whole (irmluding the segmentation component). As expected, the system takes more time for longer senterlces. Crucially, however, the system slowdowri is shown to be roughly linear, Figure 4 shows how nluch time is spent in each component during sentence analysis. As the sentence length increases, lexical lookup+ derivational morphology and '+other&amp;quot; stay approximately constarit while the percentage of time spent in the parsing component increases.</Paragraph>
      <Paragraph position="6"> Table 5 compares parse time performance for tagged and untagged sentences. This table qnantifies the potential speed improvement that the parser could realize if the segmenter precision was improved. Cohunn A provides baseline lexical lookup and parsing times based on untagged input.</Paragraph>
      <Paragraph position="7"> Note that segmenter time is not given this table because it would not be comparable to tile hypothetical segmenters devised for columns P, and C.</Paragraph>
      <Paragraph position="8">  experiment where untagged input (A) is compared with space-broken input (B) and space-broken input with POS tags (C).</Paragraph>
      <Paragraph position="9"> Columns B and C give timings based on a (hypothetical) segmenter that correctly identifies all word botmdaries (B) and one that identifies all word boundaries and POS (C) 1'I&amp;quot;. C represents the best-case parser performance since it assumes perfect precision and recall in the segmenter. The bottom portion of Table ,5 restates these improvements as percentages.</Paragraph>
      <Paragraph position="10"> This table suggests that adding conservative pruning to enhance segmenter precision may improve overall system performance. It also provides a metric for evaluating the impact of heuristic rule candidates. The parse-time improvemeuts from a rule candidate can be weighed against the cost of implementing this additional code to determine the overall benefit to the entire system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML