Machine Translation Based on NLG from XML-DB
Yohei Seki
Aoyama Gakuin / Department of Informatics,
University The Graduate University
for Advanced Studies
(Sokendai)
Ken'ichi Harada
Department of Computing Science
Keio University
Abstract
The purpose of this study is to propose a
new method for machine translation. Wehave
proceeded through with two projects for report
generation (Kittredge and Polguere, 2000) :
Weather Forecast and Monthly Economic Re-
port to be produced in four languages : En-
glish, Japanese, French, and German. Their in-
put data is stored in XML-DB. We applied a
three-stage pipelined architecture (Reiter and
Dale, 2000), and each stage was implemented
as XML transformation processes. Weregard
XML stored data as language-neutral interme-
diate form and employ the so-called `sublan-
guage approach' (Somers, 2000). The machine
translation process is implemented via XML-
DB as a kind of interlingua approach instead of
the conventional structure transfer approach.
1 Introduction
As the variety of users accessing the common
resources on the World Wide Web, the impor-
tance of multimedia and multilingual informa-
tion presentation technology has increased. Ma-
chine translation technology is essential for mul-
tilingual presentations, and many researchers
pursue language independent structures;; i.e.
Rassinoux et al. (1998), etc. Many seman-
tic structures like`semantic frame' or `feature
structures' have been developed and common
language attributes were embedded in these
structures.
On the other hand, to store the resources,
there were lots of databases all over the world.
Those DBs stored in relational style with nu-
merical data format would necessarily have lan-
guage independent features. There, however,
was a gap between DB structures and lan-
guage independent semantic structures. Re-
cently, the XML-DB technologies have been de-
veloped to support a more structured function
for databases. The structuring techniques are
useful not only for data structure but to repre-
sent linguistic structures.
Wedeveloped the XML-based report gen-
eration system as in Figure 1. The system is
based on a three-stage pipelined architecture :
document planning, microplanning, and sur-
face realization. This system produces four
languages : Japanese, English, French, and
German from the common resources. The sys-
tem also supports the publishing in VoiceXML
format
1
and synthesis function by using IBM
websphere VoiceServer SDK
2
. Therefore, this
system is also useful for handicapped people like
the visually handicapped to share information.
Figure 1. XML-Based Report Generation System
This paper consists of four sections. In Sec-
tion 2, we discuss language independent and de-
pendent features. Section 3 details multilingual
generation and voice synthesis technologies in
our system. Finally, in Section 4, we present our
conclusions.
1
http ://www.voicexml.org
2
http ://www-3.ibm.com/software/speech/enterprise/ep 11.html
Hokkaido(a)
8
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
:
inland areas
h
Asahikawa, Obihiro, Iwamizawa,
Sapporo, Kucchan
i
Sea of Okhotsk side
h
Kitamiesashi, Oumu,
Monbetsu, Abashiri
i
Sea of Japan side
8
<
:
Sea of Japan
north side
h
Wakkanai, Haboro,
Rumoi
i
SeaofJapan
west side
h
Otaru, Sutsutsu,
Esashi
i
Pacic Ocean side
8
<
:
Pacic Ocean
east side
h
Nemuro, Kushiro,
Hiroo
i
Pacic Ocean
west side
h
Urakawa, Tomakomai,
Muroran, Hakodate
i
Hokkaido(b)
8
>
>
>
<
>
>
>
:
Northern area
n
Sea of Japan north side
[Asahikawa, Kitamiesashi, Oumu, Monbetsu]
Eastern area
n
Pacic Ocean east side
[Obihiro, Abashiri]
SouthIstern area
(
Sea of Japan west side
Pacic Ocean west side
[Iwamizawa, Sapporo, Kucchan]
Fig. 2: The Ontologies with District Names for Weather Forecasts in Hokkaido
Monthly
Economic Report
8
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
:
Business Conditions
Domestic Demands
8
<
:
Personal Consumption
Wages
Housing Construction
Plant and EquipmentInvestment
Productivity
and Employment
(
Mining Industry
State of Employment
Bankruptcy
International
Payments
8
<
:
Export
Import
International Balance
of Payments
Prices
(
Domestic Wholesale Prices
Bargain Prices for Enterprises
Consumer Prices
Finances
n
Financial Situation
Money Supply
Fig. 3: The Ontologies for Monthly Economic Reports
2 Language Independence and
Dependency on Discourse Unit
2.1 Language Independence
The intentional structure to retrieve data
from DB is based on language-independent on-
tological structure (Rassinoux et al., 1998). Al-
though wetake the input database whichis
originally provided with RDB style format
3
,
the input data is restructured according to an
ontologically-formulated structure with XML
format. In case of the `weather forecast', our
3
In fact, our weather domain input data is the
Annual Report published by the Japan Meteorological
Agency (http ://www.jmbsc.or.jp/oine/cd0040.htm).
On the other hand, our economy domain in-
put data is retrieved from NEEDS (Nikkei
Economic Electronic Databank System,
http ://www.nqi.co.jp/english/needs/n top.html).
data from each observatory pointwas struc-
tured according to place names. On the other
hand, in case of the `economy report', each
numeric item was structured according to the
contents of each heading. These structures are
shown in Figures 2 and 3.
2.2 Lexical Variation Depends on
Discourse Unit
In order to generate individual languages, lex-
ical paraphrasing processes according to dis-
course units must be carried out. For example,
the numerical data concerning `increasing' and
`decreasing' expressions in Japanese varies ac-
cording to the subject;; i.e. prices : increasing =
`joushou';; decreasing = `geraku' and other sub-
jects : increasing = `zou';; decreasing = `gen'.
We implement this paraphrasing as a surface
realization process within discourse unit.
3 Multilingual Generation and Voice
Synthesis
The system is based on a three-stage
pipelined architecture : document planning, mi-
croplanning, and surface realization. The doc-
ument planning stage is independent of indi-
vidual languages. The microplanning stage con-
tains the process of conversion for lexical para-
phrasing of each language. The surface realiza-
tion module is dependent on each language.
3.1 Document Planning
The document planning module consists of
two tasks : `document structuring' and `content
determination'. The code fragmentofeachtask
is shown in Appendix A. In the Economy Re-
ports' case, the output data is produced based
on the previous one to three months data. Our
input data is stored in Yggdrasill, the XML-DB
product of Mediafusion Corporation in Japan
4
,
and the contents are retrieved with XPath no-
tations and structured with DOM (Document
Object Model). DOM trees are used to remove
overlapping data between overview and shallow
data. They are corresponding to two-stage con-
tent determination (Sripada et al., 2001). The
output DTD (DocumentType Denition) of
this module is shown in Appendix B.
3.2 Microplanning
In the microplanning module, the XML tag
and elements are replaced to produce text spec-
ication, which is based on lexical constraint
in each language. More precisely, microplan-
ning contains the following tasks : determin-
ing the detailed (sentence-internal) organiza-
tion, looking at alternativeways to group in-
formation into phrases, and so on (McDonald,
2000, pp.156). This stage is implemented with
SAX (Simple API for XML). Lexicalisation task
according toeach language and aggregationtask
to eachsentence are modularized with SAX.
The output DTD of this module is shown in
Appendix C.
3.3 Surface Realization
We followed the sublanguage approach
(Somers, 2000), because surface lexical expres-
sion strongly depends on each discourse struc-
4
http ://www.mediafusion.co.jp/seihin/ygg/index.html
ture. We implemented the surface realization
stage with XSLT (eXtensible Stylesheet Lan-
guage Transformations) and Xalan
5
, and the
output had twovariations : the XHTML and
VoiceXML format. The combination of xsl :
param and xsl : choose command was used for
lexical paraphrasing based on discourse struc-
ture constraints.In addition, context-based lexi-
cal paraphrasing is an important factor in avoid-
ing repetitious text. We use the Java exten-
sion function in Xalan, and count the repeating
element and change the expression. The com-
pleted texts of the Monthly Economic Reports
are shown in Appendix D and Weather Fore-
casts in E.
4 Conclusions
We implemented a three-stage pipelined NLG
architecture (Reiter and Dale, 2000) as XML
transformation process. Our system proved the
eectiveness of using XML to translate reports
from a database based on the distinction be-
tween domain selection and linguistic selection.
XML is useful especially for content determi-
nation from a hierarchical structured database.
If wehave a time series or chronological data
whichischaracterized by information dense at
the same reference time point, our approachcan
be applied to NLG from such data.
Our system used the common document plan-
ner to translate into four dierent languages.
The document planning module only depends
on its input database domain. Therefore, our
system makes a distinction between data selec-
tion and linguistic selection processes in order
to produce reports from the DB.
A The Implementation Code in
Document Planning Stage
A.1 Code Fragment in Document
Structuring
if (fxoBasket.Open(Host) == 0){
if (fxoBasket.Login(Alias, UserID, Password) == 0){
if (fxoBasket.OpenBasket() == true){
DomesticDemands plan1 = new DomesticDemands(xmldoc);;
Element ecoElm1 = plan1.MakePlan(year,month_i,obj,
fxoBasket);;
PandE plan2 = new PandE(xmldoc);;
Element ecoElm2 = plan2.MakePlan(year,month_i,obj,
fxoBasket);;
InternationalPayments plan3 = new
InternationalPayments(xmldoc);;
...
fxoBasket.CloseBasket();;
} else {
System.out.println("OpenBasket ----- " +
fxoBasket.Get_Reason());;
}
fxoBasket.Logout();;
} else {
System.out.println("Login ----- "
+ fxoBasket.Get_Reason());;
}
fxoBasket.Close();;
} else {
System.out.println("Open ----- "
+ fxoBasket.Get_Reason());;
}
A.2 Code Fragment in Content
Determination
public class DomesticDemands{
private XmlDocument xmldoc;;
private Element subroot;;
public DomesticDemands(XmlDocument doc) {
xmldoc = doc;;
}
public Element MakePlan(String year, String month_i,
String obj, JYggdrasill fxoBasket){
Element subroot
= xmldoc.createElement("EconomyEvent");;
subroot.setAttribute("Type", "DomesticDemands");;
Element pc = PersonalConsumption(year,month_i,
obj,fxoBasket);;
subroot.appendChild(pc);;
Element wg = Wages(year,month_i,obj,fxoBasket);;
subroot.appendChild(wg);;
.....
private Element PersonalConsumption(String year,
String month_i, String obj, JYggdrasill fxoBasket) {
Element pc
= xmldoc.createElement("PersonalConsumption");;
int month_lll = Integer.parseInt(month_i)-3;;
int year_lll = Integer.parseInt(year);;
.....
String item1 = fxoBasket.GetDocumentFragments
("/EconomyData[/@Year=\"" + year_lll + "\"]
/MonEcoRep[/@month=\"" + month_lll + "\"]
//LivingExpenditures/text()").substring(65);;
item1 = item1.substring(0,item1.length()-17);;
.....
Element elm1 = xmldoc.createElement
("LivingExpenditures");;
elm1.setAttribute("Month"
,Integer.toString(month_lll));;
elm1.setAttribute("ComparedTo","LastYear");;
elm1.appendChild(xmldoc.createTextNode(item1));;
.....
return pc;;
.....
B The DocumentPlanDTD
Example
<?xml version="1.0" encoding="Shift_JIS" ?>
<!ELEMENT Set ( EconomyEvent+)>
<!ATTLIST Set Year NMTOKEN #REQUIRED >
<!ATTLIST Set Object NMTOKEN #REQUIRED >
<!ATTLIST Set Month NMTOKEN #REQUIRED >
<!ELEMENT EconomyEvent ( PersonalConsumption, Wages,
HousingConstruction, PlantandEquipmentInvestment,
MiningIndustory?, EmploymentState?, Bankruptcy?, Export?,
Import?, BalanceofPayments?,
DomesticWholesalePricesSituation?,
BargainPricesforEnterpriseSituation?,
ConsumerPricesSituation?,
FinancialSituation?, MoneySupply? ) >
<!ATTLIST EconomyEvent Type NMTOKEN #REQUIRED >
<!ELEMENT PersonalConsumption ( LivingExpenditures+,
LivingExpendituresforWorkers, LevelofConsumption,
LevelofConsumptionforWorkers ) >
<!ELEMENT LivingExpendituresforWorkers ( #PCDATA ) >
<!ATTLIST LivingExpendituresforWorkers Month
NMTOKEN #REQUIRED >
<!ATTLIST LivingExpendituresforWorkers ComparedTo
NMTOKEN #REQUIRED >
.....
C The Text Specication DTD
Example
<?xml version="1.0" encoding="Shift_JIS" ?>
<!ELEMENT Set ( EconomyEvent+)>
<!ATTLIST Set Year NMTOKEN #REQUIRED >
<!ATTLIST Set Object NMTOKEN #REQUIRED >
<!ATTLIST Set Time NMTOKEN #REQUIRED >
<!ELEMENT EconomyEvent ( Heading, SubHeading+ ) >
<!ELEMENT Heading ( #PCDATA ) >
<!ELEMENT SubHeading ( Phrase+ ) >
<!ATTLIST SubHeading Title CDATA #REQUIRED >
<!ELEMENT Phrase ( #PCDATA ) >
<!ATTLIST Phrase Use CDATA #IMPLIED >
<!ATTLIST Phrase Class NMTOKEN #IMPLIED >
<!ATTLIST Phrase Head ( True ) #IMPLIED >
<!ATTLIST Phrase Post CDATA #IMPLIED >
<!ATTLIST Phrase Household NMTOKEN #IMPLIED >
<!ATTLIST Phrase Sbj CDATA #REQUIRED >
<!ATTLIST Phrase Rhetoric ( Sequence | Embed ) #IMPLIED >
<!ATTLIST Phrase Product CDATA #IMPLIED >
<!ATTLIST Phrase Time ( September | May | August |
October | July ) #REQUIRED >
<!ATTLIST Phrase Prep CDATA #IMPLIED >
D Monthly Economic Report
D.1 English Output Example
1. Domestic Demands
Personal Consumption
Living expenditures�whole�for July decreased 2.6
�compared to the same period last year, and for
August a 4.1�decrease compared to the same pe-
riod last year.
When you look at the change classied by household
spending, there was a 2.9�decrease compared to
the same period last year for working people in Au-
gust.
The consumption level for August decreased 3.09
�compared to the same period last year.
The consumption level for working people in August
decreased 2.09�compared to the same period last
year.
Wages
Income for August decreased 1.19�compared to
the same period last year for companies employing
30 or more people.
Additional allowances for August decreased 5.46�
compared to the sameperiod last year for companies
employing 30 or more people.
Real wages for August decreased 2.12�compared
to the same period last year for companies employ-
ing 30 or more people.
Housing Construction
The number of housing starts�seasonally adjusted
rate�for July decreased 2.44�compared to the
last month, and a 0.53�decrease compared to the
same period last year. The number of housing starts
�seasonally adjusted rate�for August decreased
0.11�compared to the same period last year.
The oor space of new houses for August decreased
0.93�compared to the last month,and a 2.30�de-
crease compared to the same period last year.
D.2 Japanese Output Example
1.�	'A
x
	��
x
	��x,��	���	Z�
�
H3�x
���Dzp
7D2.6�nw�,8Dx4.1�nqslh{

H3w�V�_�q,��	 
H3px,
���Dz
p8D2.9�nqslh{
	��
+	j�
:x
�
H3p
���Dz8D3.09�n,
��	 
H3px�2.09�nqslh{
��
��x,q��)
�����	tF�
�	��x
�
��Dzp8D1.19�
�qslh{
	t��)���	tF�
�	��x
���Dzp
8D5.46�
�qslh{
�������	tF�
�	��x
���Dzp8
D2.12�
�qslh{
	EPP
�
	EPP
�x,
��
:�B
��
T��x
�Dzp7D
2.44�n�
���Dz0.53�n�w�,8Dx0.11
�
�qslh{
��	��
ux
�Dzp8D0.93�
��
���Dz2.30
�n�qslh{
E Weather Forecast
E.1 English Output Example
The weather at three o'clockinOkinawa is cloudy
throughout the archipelago .
Today's weather will be fair , but locally cloudy over
theMiyako Islands .It will be clear over the Daito Islands
at sunrise .
Tonight's weather will be fair throughout . It will be
cloudy over the Miyako Islands , and there will be show-
ers on the Daito Islands in the afternoon .
The outlook for tomorrow in Okinawaisfair,but
locally slightly cloudy over the Daito Islands .
Tomorrownight's weather will be fair , but partly
cloudy over the Yonaguni Islands .
E.2 French Output Example
Le temps a trois heures dans zone de l'Okinawaest
nuageux dans toutes les parties .
D'aujourd'hui temps dans zone de l'Okinawa sera par-
fait , mais sera nuageux localement dans
^
Iles de Miyako
. Il sera clair dans
^
Iles de Daito a l'aube .
Ce soir temps dans zone de l'Okinawa sera parfait
dans toutes les parties . Il sera nuageux dans
^
Iles de
Miyako , et il sera pluvieux dans
^
Iles de Daito l'apres-
midi .
Les perspectives pour le demain dans zone de l'Oki-
nawa sera parfait , mais dans
^
Iles de Daito .
Demain le temps de la nuit dans zone de l'Okinawa
sera parfait , mais sera nuageux en partie dans
^
Iles de
Yonaguni .
E.3 German Output Example
Das Wetter um drei Uhr in Okinawa ist es wolkig
uberall in den Inseln .
Heute ist es sonnig , aber ist es vereinzelt wolkig uber
den Inseln Miyako . Am Sonnenaufgang wird es frei uber
den Inseln Daito .
Heute abend wird es sonnig uberall . Am Nachmittag
gibt es Duschen auf den Inseln Daito , und wolkig uber
den Inseln Miyako.
Morgen in Okinawa wird es sonnig , aber vereinzelt
etwas wolkig uber den Inseln Daito .
Morgen abend wird es sonnig , aber ortlichwolkig
uber den Inseln Yonaguni .

References

R. I. Kittredge and A. Polguere. 2000. The generation
of reports from databases. In R. Dale, H. Moisl, and
H. Somers, editors, Handbook of Natural Language
Processing,chapter 11, pages 261{304. Marcel Dekker.

D. D. McDonald. 2000. Natural language generation. In
R. Dale, H. Moisl, and H. Somers, editors, Handbook
of Natural LanguageProcessing,chapter 7, pages 147{
179. Marcel Dekker.

A. Rassinoux, R. H. Baud, C. Lovis, J. C. Wagner, and
J. Scherrer. 1998. Tuning up conceptual graph repre-
sentation for multilingual natural language processing
in medicine. In M. Mugnier and M. Chein, editors,
Conceptual Structures : Theory,Tools and Applica-
tions, pages 390{397. Springer LNAI 1453, Montpel-
lier, France, 8.

E. Reiter and R. Dale. 2000. Building Natural Language
Generation Systems. Cambridge University Press.
http ://xml.apache.org/xalan-j/index.html

H. Somers. 2000. Machine translation. In R. Dale,
H. Moisl, and H. Somers, editors, Handbook of Nat-
ural Language Processing,chapter 13, pages 329{346.
Marcel Dekker.

S. G. Sripada, E. Reiter, J. Hunter, and J. Yu. 2001. A
two-stage model for content determination. In Proc.
of the 8th European Workshop on Natural Language
Generation associated to ACL 39th Ann. Meeting
and 10th Conf. of the European Chapter, pages 3{10,
Toulouse, France, July 6{7.
