Restructuring Tagged Corpora with Morpheme Adjustment Rules 
Abstract 
Toshihisa Tashiro Noriyoshi Uratani Tsuyoshi Morimoto 
ATR Interpreting Telecommunications Research Labor~tories 
2-2 tIikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, JAPAN 
{tashiro,uratani,morimoto} ~itl.at r.co.jp 
A part-of-speech tagged corpus is a very hnportant knowledge source for natural language processing researchers. 
'Poday, several part-of-speech tagged corpora are readily available for research use. \[Iowever, because there is wide 
diversity of morphological information systems (word-segmentation~ part-of-speech system, etc.), it is ditficult to 
use tagged corpora with an incompatible morphological infm'mation system. This paper proposes a me,hod of 
converting tagged corpora frOlll olte lllorphellle system to allother, 
1 Introduction 
Recently, mm,y natural language processing re- 
searchers have concentrated on corpus-based ap- 
proaches. Linguistic corpora can be classified as 
word-segmented corpora, part-of-speech tagged cor- 
pora, and parsed corpora, tlecause a part-of-speech 
tagged corpus is the most important corpus, much 
corpus-based natural language processing research 
has been performed using part-of-speech tagged cor- 
pora. 
Ilowever, building a large part-of-speech tagged 
corpus is very dillicult. It is even more difficult to 
build a corpus for languages without explicit word 
boundary characters, such as Japanese. q'herefore, 
researchers always complain of the scarcity of data in 
the corpus. 
To solve this data scarcity problem, previous works 
proposed methods of increasing the productivity 
of the labor required for building a part-of-speech 
tagged corpus. \[1\]. 
'.l'his paper proposes another method of acquir- 
ing large part-of-speech tagged corpora: restructur- 
ing tagged corpora by using morpheme adjustment 
rules. This method assures good use of the sharable 
part-of-speech tagged corpora that are already awdl- 
able such as the ATR Dialog Database (ADD) \[2; 
3\]. 
Ideally, these corpora could be used by all re- 
searchers and research groups without any modifica- 
tions, llowever, actual part-of-speech tagged corpora 
have the following problems: 
• Diversity of orthography: 
A word can be spelled in various ways. In 
Japanese, there are three types of character sets: 
kanji (i~.~"-), hiragana (~ to ~ ~), and katakana 
(:~J # :~J ¢). Also, people can use these character 
sets at their discretion. 
• Diversity of word segmentation: 
Because the Japanese language has no word 
boundary characters(i.e, blank spaces), there 
are no standards of word segmentation. A sin- 
gle word in a certain corpus may be considered 
nmltiple words in other corpora, and vice versa. 
• Diw;rsity of part-of speech systems: 
There are no standards for part-of-speech sys- 
tems. It is true that a detailed part-of-speech 
system can help the application of part-of-speech 
information, lint the labor required \['or lmilding 
corpora will continue to increase. This problem 
is language-independent. 
Diversities of word-segmentation and part-of- 
speech systems are fatal problems. The simplest way 
to solve these problems is to perform a morphological 
analysis on the raw text in the corpus, with no regard 
to the word-segmentation and part-of-speech infer 
mation. Uowever, making a high-quality morphologi- 
cal analyzer demands much time and care. Addition- 
ally, it is wasteful to ignore the word-segmentation 
and part-of-speech i,fformation that h~s been ac- 
quired with much effort. 
In restructuring tagged corpora with nmrpheme ad- 
justmel,t rules, tim word-segrnentation and part-of- 
speech inlbrmation of the original corpus is rewrittm,, 
making good use of the original corpus information. 
"riffs method is characterized by reduced manual ef- 
fort. 
In the next section, the method of rest, rueturing 
tagged corpora is described in detail. Section 3 re- 
ports the result of an experiment in rewriting the 
corpus using this method. 
2 Restructuring Tagged Col 
pora 
Restructuring tagged corpora involves the following 
three steps: 
• preparation of training set 
• extraction of morpheme adjustment rules 
• rewrith~g of corpora 
569 
2.1 Preparation of Training Set 
First, sentences for the training set are chosen from 
the corpus to be rewritten. New word-segmentation 
and part-of-speech information (morphological infor- 
mation) is given to the sentences by a morphologi- 
cal analyzer or by hand. Consequently, the training 
set has two sets of morphological information for the 
same raw text. Figure I shows an example of the 
training set. 
A large number of training sentences is desirable, 
but preparing many sentences requires much time and 
effort. A vmst nmnber of sentences would be required 
to extend coverage to content words (such as nouns, 
w'A)s, ete), but flmctionaI words (such as particles, 
auxiliary verbs, ere) can be covered with a smaller 
number of sentences. 
ntw text '~!,'q\]O IfI~& .5/~'(" L ~ 5 Z}~ 
morphological ('/~qJg \[N~- n-com)(tJ: postp-topic) 
information A (aS vste,n)(~ vinlt)(,G f,,) 
(-OL~ 5 auxv-Imlt-aux-nom) 
(Z~ auxv-sfp-1) 
information B (t-± NItJJ~J) (~7o 2b:N~,i\]) 
(,t, ~,:i¢l~JJ~l)(et. x ~JJ~OJa,~l) 
(5 I~NjDJ)(fi, $gtl)377~\]) 
Figure 1: An Example of the 'Fraining Set 
2.2 Extraction of Morpheme Adjust- 
ment Rules 
The method of extracting morpheme ad.iustment 
rules from the training set involves finding correspon- 
dence between rewriting units and extracting rules for 
unknown words: 
2.2.1. CorrespoiMenees of Rewriting Units 
In languages without explicit word boundary char- 
acters, such as Japanese, a single word in a certain 
morphological information system may be divided 
into multiple words (one-to-many correspondence) 
in other morphological information systems, multi- 
pie words may be unified (many-to-one correspon- 
dence), or the segmentation of multiple words may 
be changed (many-to-many correspondence). Figure 
2 shows these correspondences. 
We developed an algorithm to find these correspon- 
dences (Appendix A). By using this algorithm, mor- 
pheme rewriting rules (Figure 3) can be extracted. 
2.2.2 Rules lot" Unknown Words 
Rewriting rules such as those shown in Figure 3 can 
rewrite only the words that appeared in the train- 
ing set,. If the training set is small, the coverage 
of the rules will be limited. Ilowever, because this 
n~orpheme adjnstment is a method of rewriting part- 
of-speech tagged corpora, the treatment of unknown 
one-to-one\] 
"I\[l~ (NOUN)" o "JIl~sl~ (NOUN)" 
|one-to-many\] 
"~ (V|,qRB)" ~ "~ (V-ST~M) .... ~ (~Nt~L)" 
\[many-to-oriel 
"~.,~ (NOUN)" "~115~ (NOUN)" <~ "~/\]\]~ (NOUN)" 
\[hi ~tny-to-lnally\] 
"m (PAR'FICIA,~)" "v, ;5 (VERB)" 
~# "-Cv, (AUXV-STEM) .... -5 (\[NFL)" 
Figure 2: Various correspondences 
~* x:~', (auxvstem-aspc) ~ (vinfl) 
(t.~tl)Ji~\]) +~ ~_" (postp-oblg) 
Z.-\[ ~- (~i~i\]) ¢>_-_:: (n-digit) -I" (digit-sul/ix-zyuu) -- 
(n-digit) 
l?ignre 3: An Example of Extracted Rules 
words is easier than with an ordinary morphological 
anaIyzer, because that our method can make good 
use of the part-of-speech information of the origi- 
nal corpus. Rules for unknown words without word- 
segmentation changes between two morphological in- 
formation systems can be extracted automatically 
from one-to-one correspondence rules in the rewrit- 
ing rules. 
Rules tbr unknown words with word-segmentation 
changes can also be extracted automatically by us- 
ing information concerning the length of the word's 
characters. For examph'., when a single verb with two 
characters in a certaiu morphological information sys- 
tem corresponds to two words (verb-stem with one 
character and verb-inflection with one character) in 
another morphological information system, the fol- 
lowing rewriting rule is extracted. 
2(verb)--~ l(verb-stem) l(verb-inflection) 
Figure 4 shows sample rules for unknown words. 
The heuristic knowledge of character sets that an 
ordinary Japanese morphological analyzer uses (such 
as "katakana words are usually proper nom,s", "verb 
iMleetion words are spelled using hiragana", etc.) 
are also available in this nlorpheme adjustment ted> 
nique. 
¢ r 2.3 Rewriting of Tagged Corpora 
2.3.1 Application of Rewriting Ihnns 
By applyi,~g the rewriting rules described in the last 
subsection to the tagged corpus, a lattice structure 
~ll)j~/il <* FADN 
+ftll)j;~ ,~> POSTP-OBLG 
2(~j~,~) l(:/~l~)JffO}il,q) o a(VSTEM) I(VINFL) 
"2(tdIJlgl~,il) ¢> I(POSTP-OPTN) I(POSTP-CONTR) 
Figure 4: An Exarnple of gules for Unknown Words 
570 
I>OSTP-oP'rN AOXV-ST£M 
r--- f i 
ADV-REItNEL\] VSTEM AUXV\] AUXV-STEbl I AUXV I I 
~V-KE..,:*, wN~.,~ l~Dv. l~j._l. 
Figure 5: An Exanlple of the Lattice Formed by the 
Morpheme Adjuster 
~IJXV- STEI~ 
I 
\[ AUXV 
I 
~UX'! V~.~ c-- 
I POSTP-OPTN I AU)fV VINFL 
i'-- I 
} PO$~'P"OBLG I AUXV- STIc:l i V (NFI, 
I~ f C------ 
ADV-KERNZI, \[ VSTEM AUXV i ADV-COR3 I AU X"r 
-- 1 I' F-- 
-_x,-,~ ,~ f ~ 2 9 - _ 
l"igure 6: An Example of the l,attice Formed by an 
Ordinary Morphological Analyzer 
(Figure 5) is formed because of I, he ainlliguity in 
rewriting rules. 1 
l{owew;r, this ambiguity is not as great as the aml>i- 
guity that occurs in ordinary morphological analysis 
because our method makes good use of the inform;v- 
tion of the original corpus. Figure 6 shows the lattice 
structure formed when using the ordinary morpholog- 
ical analysis on the same raw text. Note that the size 
of this lattice is greater i.hail the size oF the lattice 
made by our method. 
2.3.2 Lattiee Search 
The last step in restructuring tagged corpora can be 
considered a lattice search l)rol)lenl. \[U this step, all 
of the following knowledge sources for anlhiguil, y reso- 
lution used in ordinary morphological analysis is also 
available in our method: 
• connection matrix 
• heuristic preferences (longest word preference, 
minimum phrase preference, etc.) 
• stochastic preferences (word n-gram, IIMM, etc.) 
By using these knowledge sources, the most plausi- 
ble candidate is chosen. In effect, the original corpus 
is converted to a new eorptls that uses a different 
morphological information system. 
3 Experiment 
3.1 Experimental Condition 
The targets in our experiment are a morl~hological 
information system for the A'I?II, l)ialog Database\[2; 
l'\])hls ambiguity mainly conies from l.he difl;Jrence iii part- 
oLspeech granularity between the two morphological infornla- 
\[,ion systenls, 
3\] and a morphological infornlation system lbr the 
unification-based Japanese grammar used in A'I'R's 
spoken language parser\[4\]. 'l.'hese two morphological 
information systenls haw; the following characteris- 
tics. 
• The ATR Dialog 13at.abase was developed as ma- 
terial for analyzing the characteristics of spoken- 
style Japanese. '\['herefore, the part-of-speech 
granularity is coarse. Additionally, because the 
word-.segmentation is based on a morphological 
and etymological criterion, compound nouns ~md 
compound words that fluiction as a single aux- 
iliary w~rb (e.g. "-C ~ zj ") are divided into sev- 
eral shorter word units. Ou the other hand, be- 
cause this database giw;s little consideration to 
mechanical processing, stems and inflections oF 
inflectional words are not segmented. 
• The nnification-based Japanese gramnmr has 
a medium-grained part-.of-speech (pre-ternlinal) 
system to make it both c\[tleient and easy to 
maintain\[5\]. Because the objective of the gram- 
Ynar is to extract the syntactic structures el 
Japanese sentences automatically and elIiciently, 
COml~OUnd words that funel.iml as a single word 
are usually recognized as a single word. On the 
other hand, steins amt ilHlectious of inllectional 
words are segl\[lellted for eollvellience ()l' nlechall.- 
ieal processiug. 
The above descriptions show that these nl¢~rl~hologi- 
cal infl)rulation syst.ems differ. The objectiw~ of this 
experiment is to examine whether our method can 
adjust the differences between the two niorphological 
information systems to ~i considerable extent. 
Firsl;, we chose 1,000 sentences fronl the A;I'li, 
Dialog Database as the training set and provided 
the morphological information (word-segmentation 
and lmrl.-of'-speech) of the unification-based Japanese 
grallilnar. \%Te prepared 350 seiiteliCes as the Lest selb, 
separate from the training set. 'Phe t,'st sentences 
were also giw~n the lnorphological infornlal, ion. 
We extracted 1,5;18 r.orrespondeiiees el' rewrithlg 
unil.s (i.e. rewi'il.hlg rilies) alid <128 rules For Uliknown 
words. '\]'\]lc.se rllles can l)e used for the \]Ji direetioila\] 
rewril.ing experiuienl.. 
As the kliowh~dge sOtll'<:e in searching lattices, word 
bigrauis and part-ol-sl)eech I)igralns were trained wil;h 
the training set. To perform the hi-directional rewrit- 
ing experinlent, these bigralns were trained in both 
inorphologieal infornial;ion systenls. 
'\['O eOl'ilpare o/lr niethod with ordinary niorpholog- 
ical analysis, we dew, loped a sinlple stocha.<~tic iiior- 
phological analyzer that uses the santo bigrams as the 
knowledge sourc~,s 2. Ilecause this morphological ana- 
lyzer has been developed for the comparative exper- 
iment, it. catlnot inanage unknown words. 'Fherelbre, 
the rewriting test was performed by using not only the 
"2 Of Cotlrse, the ordinary nlorphologlcal analyzer can rewrite 
the corpus Iilllch nitre accurately by tlshlg richer knowledge 
gOllr('t~s, llowevei', it onlst he llo{ed tilat onl' nlel, ilod itlvlo c;lll 
tlSf~ 51lch knowl.dgc S(-ItlI'CI~S, 
571 
Morphological Unification-Based ATR Dialog 
Information Japanese Gralnlnar Datal)euse 
Training Set 
sentences 1,000 1,000 
(words) (I0,510) (10,723) 
Test Set (Full) 
sentences 
(words) 
Test Set (Sub) 
sentences 
(words) 
Vocabulary 
350 
(3,804) 
148 
(904) 
350 
(4,060) .... 
148 
(949) 
1,284 1,168 
POS System 75 
Word Bigram ' 41325 
POS Bigram 503 
26 
4,292 
262 
Table 1: Experimental Condition 
test sentences, but also the training sentences (close 
experiment) and the sentences having no imknown 
words (a subset of the test set). 
Table 1 shows the experinlental conditions ill de- 
tail. 
3.2 Rewriting of Morphological Infor- 
mation 
The experiment was performed bi-directionally be- 
tween the morphological information system of the 
ATt~ Dialog Database (ADD) and the morphologi- 
cal information system of unification-based Japanese 
grammar. 
3.2.1 From Unification-Based Granmxar to 
ADD 
This experiment rewrites from a medimn-graiued 
morphological information system to a coarse-grained 
morphological information system. Table :1.2.1 shows 
the result of this rewriting. The segmentation er- 
ror rate and part-of-speech error rate were calculated 
using the same definition in \[1\]. Table 2 shows the 
result. 
The error rates seem to be rather large, but it 
should be noted that only simple knowledge sources 
are used both ill our method (the morpheme adjuster) 
and by the morphological analyzer. Also, it is signif- 
icant that our targets are spoken-style Japanese sen- 
tences. Ordinary morphological analyzers can ana- 
lyze written-style Japanese sentences with a less than 
50_/o error rate, by using richer knowledge sources\[I\]. 
However, previous work reported that the error rate 
for automatic morphological analysis of the ADD text 
is more than 15%\[6\]. 
hi comparing the two methods, the part-of-speech 
error rates of our method are clearly better than those 
of the morphological analyzer. This shows that our 
method can make good use of tile original part-of- 
speech information. 
3.2.2 From ADD to Unlficatlon-Based 
Japanese Grammar 
This experiment is more difficult because this rewrit- 
ing is from the coarse-grained morphological infor- 
nlation system to the medium-grained morphological 
information systeln. Table 3 shows the result. 
The part-of-speech error rates of our method are 
better in this rewriting experiment, too. 
4 Conclusion 
This paper proposed restructuring of tagged corpora 
by using morpheme adjustment rifles. The even- 
tuaI goal of this work is to make precious knowledge 
sources truly sharable among many researchers. Tile 
results of tile experiment seem promising. 
Our rnorpbenle adjustment method has some re- 
semblaace to Brill's part-of-speech tagging method\[7\]. 
Brill's simple part-of-speech tagger call be considered 
a morpheme adjnster that adjusts differences between 
initial (default) tags and correct ta~s. 
As Brill applied his part-of-speech tagging tech- 
nique to tbe syntactic bracketing technique\[8\], we 
believe that our method can be applied to the ad- 
justment of parsed corpora. In the work of Grish- 
man et al.\[9\], tree rewriting rules to adjust differences 
between Tree Bank and their grammar were proba- 
bly prepared manually. By applying our method to 
parsed corpora, such rewriting rules call be extracted 
automatically. 
Acknowledgments 
The attthors would llke t.o thank Dr. Yasnhlro Yamazaki, Pres- 
ident of ATI/ interpreting Telecommunications Lalmrat.orles, 
for his collstant supl)ot't and encouragement.. 
References 
\[1\] Maruyama, ll., Oglno, S., l\[idano, M., "The Mega-Word 
Tagged-Corpus Project," TMI-93, pp.15-23, 1990 
\[2\] Ehara, T., Ogtlra, a. and Morimoto, T, "ATR, Dialogue 
Database," ICSLP-90, pp.1093-1096, 1990. 
\[3\] Sagisaka, Y., Uratani, N., "ATI~ Spoken Language 
Datahase," The Journal ,ff Ihe Acoustical Society of Japan, 
Vol. ,18, 12, l)p. ggs-S82, 1992. (in a*qmnese) 
\[,1\] Nagata, M. and Morimoto, T.: "A Unification-Based 
.\]apanese Pm'ser for Speech-to-Speech "\['ranslatlon," IEICE 
Trans. Inf. ,¢.~ Syst., VoI.F76-D, No.l, pp.51-61, 1993. 
\[5\] Nagata, M. "An Empirical Study on lq.ule Granularity and 
Unification Interleaving. - Toward an Etlleient Unification- 
Based Parsing System," in Proe. of COLING-92, 1992. 
\[6\] Kita, K., Ogura, K., Morlmoto, T., Yano, Y., "Autotnatl- 
tally Extract.tug Frozen Patterns fronl Corpora Using Cost 
Ch'iterla.", IPSJ "\['rans. Vol.34, No.9,pp.1937-19,13, 1993.(in 
Japanese) 
\[7\] Brill, E.,: "A Simple I/.ule-Based Part of Speech Tagger," 
Proceedings of the Third Conference on Applied Nal.m'al 
Language Processing, 1992. 
\[8\] Brill, E., "Aut.omatic Grammar Induction and Parsing 
Free Text: Transformation-Based Error-l)riven Parsing," 
ACt93, 1993. 
\[9\] Ralph Grishman, Catherine Maeleod and Jolm Sterling 
"Evaluating Parsing Strategies Using Standardized Parse 
Files," Proceedings of the Ttfird Conference on Applied Nat- 
m'al Language Processing, pp.156-161,1992. 
572 
Method segmentation error rate 
'rest Set (Full) 7.8% 2.8% 
Test Set (Sub) 
Morpheme Adjuster 5,1% 1.5% (Morphological Analyzer) (<3%) (3.4%) 
Training Set(close test) 
Morpheme Adjuster 0,2% 1.5% 
(1Viorphologieal Analyzer) (1.3%) (3.7%) 
Table 2: From Unification-B~sed Cranmmr to ADD 
Method segmentatioli error rate l)art-of-speeeh error i:ate 
Test Set (l?ull) 8,2% 6.9% 
Test Set, (Sub) 
Morpheme Adjuster 4.2% 3.1% 
(Morl)hologieal Analyzer) (8,5%) (6.8%) 
Training Set (close test) 
Morpheme Adjuster 0.5% 3.3% 
(Morphological Analyzer) (0.5%) (6.3%) 
Imrt-ol'speech error rate '\]?~al 
10.(3% 
6,6% 
(9.7%) 
1.7% (5.0%) 
Total 
t5.t% 
7.3% (lr,.3%) 
3.8% 0~.~_ ,)_ 
Tabh.' 3: l;'rom ADD to Unification-I\]ased (\]ramruar 
Appendix 
A.The Rule Extraction Algorithm 
type 
~ord = record 
symbol :string; {ex. "~"} 
part-of-speech :string; {ex. "NOUN"} 
end 
wordlist = record 
elem : array\[I..MAXLENGTI\[\] of word 
last : integer 
end 
procedure FIND_CDRRESPONDENCES (h ,B : uordlist) ; 
{The arguments of this procedure are 
two kinds of morphological information 
of the same sentence . 
For example : 
h : (~, vstem)(~ vinfl) 
(~ auxv-stem) ( I~ auxv-infl) 
(7"< auxv-t ense) 
(/c li)lg#Jil"\] ) 
The OUTptrr subroutine outputs the correspon- 
dences such as: 
Because the total "LENGTHs" of two arguments 
are the same, this algorit}bm is guaranteed to 
he completed normally.} 
vat 
lhs,rhs : wor(11ist ; 
cur a, cur_}): integer; {cursors} 
begin 
cur_a := l; cur_b := J; 
\]hs.last := I; rhs.last := 1; {Initialize} 
lhs. elem \[lhs. hast\] : = A. s\]em \[cur a\] ; 
lhs.last := lhs.last+l; 
rhs.elem\[rhs.last\] :: B.elem\[cur b\]; 
rhs.last := rhs.last+l; 
while ( A.last > ctrr a ) do begin 
if LENGTII(Ihs) = LEN(;TH(rhs) then begin 
OUTPUT(Ihs, rhs); 
cur_a := cm' a+i; cur b := cur b+t; 
Initialize (lhs, rhs) ; 
lhs.elemElhs.last\] := A.elem\[cur_a\]; 
lhs.last := lhs.last+1; 
rhs.elem\[rhs.last\] := B.elem\[cur_b\]; 
rhs.\].ast := rhs.last4.1; 
end 
else if LENGTH(Ihs) > LENGTH(rhs) then beg: 
cur b := cur b+l; 
rhs.elem\[rhs.last\] := B.elem\[cur_b\]; 
rhs.last := rhs.last+l; 
end 
else begin 
c%hr a : = cur a+l ; 
lhs.elem\[lhs.last\] := A.elem\[cur a\]; 
\],hs,last := lhs.last+l; 
end 
end ; 
function LENGTH(A: uordlJst); 
{This function returns the total length of 
word\]ist. When dm arg is "((~) 
g~ vstem)(O vinfl))", 
this function returns 3.} 
v~r 
length, count : integer; 
begin 
length = O; 
:\[or count := 1 to h.last do 
length := length+lA.elem\[count\].symboll; 
return length; 
end 
573 
