From Controlled Document Authoring
to Interactive Document Normalization
Aur¶elien Max
Groupe d’¶Etude pour la Traduction Automatique
GETA-CLIPS
Grenoble, France
aurelien.max@imag.fr
Abstract
This paper presents an approach to nor-
malize documents in constrained domains.
This approach reuses resources developed
for controlled document authoring and is
decomposed into three phases. First, can-
didate content representations for an input
document are automatically built. Then,
the content representation that best corres-
ponds to the document according to an ex-
pert of the class of documents is identifled.
This content representation is flnally used to
generate the normalized version of the docu-
ment. The current version of our prototype
system is presented, and its limitations are
discussed.
1 Document normalization
The authoring of documents in constrained
domains and their translation into other lan-
guages is a very important activity in industrial
settings. In some cases, the distinction between
technical writers and technical translators
has started to blur, so as to minimize the
time and efiorts needed to obtain multilingual
documents. The paradigm of translation for
monolinguals introduced by Kay in 1973 (Kay,
1997)1 led the way to a new conception of
the authoring task, which flrst materialized
with systems involving human disambiguation
(e.g. (Boitet, 1989; Somers et al., 1990)). A
related paradigm emerged in the 90s (Hartley
and Paris, 1997), whereby a technical author
is responsible for providing the content of a
document and a generation system produces
multilingual versions of it. Updating docu-
ments is then done by updating the document
content, and only some postediting may take
place instead of full translation by a human
translator.
Systems implementing this paradigm range
from template-based multilingual document
1This is a reedition of the original article.
Figure 1: Architecture of a MDA system
creation to systems presenting the user with
the evolving text of the document (often called
the feedback or control text) in her language,
following from the WYSIWYM (What You See
Is What You Meant) approach (Power and
Scott, 1998).2 Anchors (or active zones) in
the text of the evolving document allow the
user to specify further its semantics by making
choices presented to her in her language. The
underlying content representation is then used
to generate the text of the document in as many
languages as the system supports. The MDA
(Multilingual Document Authoring) system
(Dymetman et al., 2000; Brun et al., 2000)
follows the WYSIWYM approach, but puts
a strong emphasis on the well-formedness of
document semantic content. More particularly,
document content can be specifled in terms
of communicative goals, allowing the selection
of messages which are contrastive within
the modelled class of documents in no more
steps than is needed to identify a predeflned
communicative goal. Figure 1 illustrates the ar-
chitecture of a MDA system. A MDA grammar
specifles the possible content representations of
a document in terms of trees of typed semantic
objects in a formalism inspired from Deflnite
Clause Grammars (Pereira and Warren, 1980).
2We have done a review of these systems in (Max,
2003a) in which we have identifled and compared flve
families of approaches.
Figure 2: Document normalization in a given
class of documents
Considering all the possibilities ofiered by
having the semantic description of a docu-
ment, for example the exploitation within
the Semantic Web, it seemed very interesting
to reuse resources developed for controlled
document authoring to analyze existing docu-
ments. Also, a corpus study of drug lea ets
that we conducted (Max, 2003a) showed that
documents from the same class of documents
could contain a lot of variation, which can
hamper the reader’s understanding. We de-
flned document normalization as the process of
the identiflcation of the content representation
produced by an existing document model
corresponding best to an input document,
followed by the automatic generation of the
text of the normalized document from that
content representation. This is illustrated in
flgure 2.
In the next section, we brie y describe our
paradigm for document content analysis, which
exploits the MDA formalism in a reverse way.
Candidate content representations expressed
in the MDA formalism are flrst produced and
ranked automatically, and a human expert then
identifles the one that best accounts for the
communicative content of the original docu-
ment. The core of this paper is devoted to our
implementation of interactive negotiation for
document normalization. Finally, we discuss
our results and propose ways of improving the
system.
2 Document normalization system
A MDA grammar can enumerate the well-
formed content representations for documents
of a given class and associate textual reali-
zations to them (Dymetman et al., 2000).
Content representations are typed abstract
semantic trees in which dependencies can be
established through uniflcation of variables.
Generation of text is done in a compositional
manner from the semantic representation.
Figure 3 shows an excerpt of a MDA grammar
describing well-formed commands for the Unix
shell. Such a grammar describes both the
abstract semantic syntax and the concrete
syntax for a particular language, English in this
case. The flrst rule reads as follows: lsCommand
is a semantic object of type shellCommand type,
which is composed of an object of type flleSe-
lection type, an object of type sortCriteria type,
and an object of type displayOptions type. Text
strings appearing in the right-hand side of
the rules are used together with the strings
associated with the present semantic objects to
compose the normalized text associated with
the described abstract semantic trees.
Our approach to normalize documents has
been described in (Max, 2003b). A heuristic
search procedure in the space of content repre-
sentations deflned by a MDA grammar is flrst
performed. Its evaluation function measures
a similarity score between the document to
be normalized and the normalized documents
that can be produced from a partial content
representation. The similarity score is inspired
from information retrieval, and takes into
account common descriptors and their relative
informativity in the class of documents. The
admissibility property of the search procedure
guarantees that the flrst complete content
representation found is the one with the best
global similarity score. This process uses text
generation to measure some kind of similarity,
and has been called fuzzy inverted generation.
In order to better cover the space of texts
conveying the same communicative content,
the MDA formalism has been extended to
support non-deterministic generation, allowing
the production of competing texts from the
same content representation, as is illustrated in
flgure 4. For each considered content represen-
tation, texts are produced and compared to the
document to be normalized, thus allowing the
ranking of candidate content representations
% semantic object of type ’shellCommand_type’ describing the ’ls’ command
lsCommand(FileSelection, SortCriteria, DisplayOptions)::shellCommand_type-e-[] -->
[’List ’],
FileSelection::fileSelection_type-e-[],
SortCriteria::sortCriteria_type-e-[],
DisplayOptions::displayOptions_type-e-[].
fileSelection(ListOfFilesAndDirectories, HiddenFilesSelection, DirectoriesContentsListing, LinksReferenceListing)::
fileSelection_type-e-[] -->
ListOfFilesAndDirectories::listOfFilesAndDirectories_type-e-[], [’.’],
HiddenFilesSelection::hiddenFilesSelection_type-e-[],
DirectoriesContentsListing::directoriesContentsListing_type-e-[],
LinksReferenceListing::linksReferencesListing_type-e-[].
% ...
% description for the type ’linksReferencesListing_type’
type_display(e, linksReferencesListing_type, ’specifies how links are shown’).
% description for the objects ’displayLinksReferences’ and ’dontDisplayLinksReferences’
functor_display(e, displayLinksReferences,’show the files and directories that are referenced by links’).
functor_display(e, dontDisplayLinksReferences,’show links as such (not the files and directories they point to)’).
displayLinksReferences::linksReferencesListing_type-e-[] -->
[’ Display referenced files and directories instead of links. ’].
dontDisplayLinksReferences::linksReferencesListing_type-e-[] -->
[’ Display links as such. ’].
Figure 3: MDA grammar extract for the description of the ls Unix command
Figure 4: Identiflcation of the best content
representation through fuzzy inverted genera-
tion
by decreasing the similarity score.
Given the limitations of the similarity
measure inspired from information retrieval,
the search is continued to flnd the N flrst
documents with the best similarity scores. The
identiflcation of the content representation that
represents best the communicative content
of the original document is then done by
interactive negotiation between an expert of
the class of the document and the system based
on the candidates previously extracted.
To demonstrate how the implemented system
works, we will consider the normalization of the
following description in English of a command
for the Unix shell with the grammar of flgure
3: List all flles. Do not show hidden flles and visit
subdirectories recursively. Sort results by date of
last modiflcation in long format in single-column
in reverse chronological order. Give flle size in
bytes.
2.1 Finding candidate document
representations: fuzzy inverted
generation
The MDA grammar used is flrst precompiled
o†ine by a separate tool, in order to associate
proflles of text descriptors to semantic objects
and types in the grammar (see (Max, 2003b)
for details). In our current implementation,
descriptors are WordNet synsets. The text
of the input document is then lemmatized
and the descriptors are extracted, yielding the
proflle of descriptors for the input document.
The grammar is then used to construct partial
abstract semantic trees, which are ordered in
a list of candidates according to the similarity
score computed between their proflle and that
of the input document. At each iteration, the
search algorithm considers the most promising
candidate content representation and performs
one step of derivation on it, which corresponds
to instantiating a variable in the tree with a
value for its type. The flrst complete candidate
(i.e. an abstract tree not containing any
variable) found is then kept, and the search
continues until a given number of candidates
has been found. This number deflnes a value
of system confldence, which can be selected
by the user of our normalization system: the
higher the confldence, the fewer candidates are
kept, at the risk that the best one according
to an expert may not be present. Given the
size of the grammar used and the complexity
of the analysed document, a small number of
candidates can be kept (20 in our example).
This process restricts the search space from
a large collection of virtual documents3 to
a comparatively smaller number of concrete
textual documents, associated with their se-
mantic structure. A factorization process then
builds a unique content representation that
contains all the difierent alternative subtrees
found in the candidates. Each semantic object
in the resulting factorized semantic tree is
then decorated by a list of all the candidates
to which it belongs. Competing semantic
objects are ranked according to the score of
the candidate with the highest score to which
they belong. This compact representation
permits to consider underspeciflcations from
the analysis of the input document present at
any depth in the candidate semantic trees.
2.2 Identifying the best document
representation: interactive
negotiation
Document normalization implies a normative
view on the analysis of a document. Because the
communicative content that will be ultimately
retained may not be exactly that of the original
document, some negotiation must take place to
determine which alternative semantic content, if
any, is acceptable. This is analoguous to what
happens in translation. As (Kay et al., 1994)
put it:
Translation is not a meaning-
preserving function from a source
to a target text. Indeed, it is probably
3We call virtual documents all the documents that
can be produced by a given grammar.
Figure 5: Resolving underspeciflcations by in-
teractive negotiation
not helpful to think of it as a function
at all, but rather as a matter of
compromise.
In our view, a human expert should be
responsible for making di–cult decisions that
the machine cannot make without signiflcant
interpretation capabilities. Furthermore, these
decisions encompass cases where no explicit
content in the input document can be used to
determine content that is expected in order
to obtain a well-formed representation in the
semantic model used.4 This will be illustrated
below with the negotiation dialogue of flgure 8.
A naive way to select the candidate content
representation found by the system that best
corresponds to the input document would
be to show to an expert all the normalized
texts corresponding to the candidates. This
would however be a tedious and error-prone
task. The compact representation built at the
end of fuzzy inverted generation allows the
discrimination of candidates based on local
underspeciflcations corresponding to compe-
ting semantic choices. We have implemented
three methods for supporting interactive
negotiation that will be described below.
They allow an expert to resolve underspeci-
flcations and therefore update the factorized
content representation by eliminating incorrect
hypotheses. This is iterated until the facto-
rized content representation does not contain
any underspeciflcation, as illustrated in flgure 5.
Figure 6 shows the main interface of our nor-
malization system after the automatic selection
4This suggests that document normalization can be
used as a corrective mecanism applied on ill-formed do-
cuments that can be incomplete or semantically incohe-
rent relatively to a given semantic model.
Figure 6: Interface of our document normalization system
of candidate content representations and the
construction of the compact representation.
Semantic view The middle panel on the
right of the window contains the semantic view,
which is a graphical view of the factorized
abstract semantic tree that can be interpreted
by the expert. It uses the text descriptions
for semantic objects and types as described by
the functor display and type display predicates
present in the original MDA formalism (see
flgure 3). The tick symbol represents a
semantic object that dominates a semantic
subtree containing no underspeciflcations. In
our example, this is the case for the object
described as output type and detail level for
display. The arrow symbol describes a
semantic object that does not take part in
an underspeciflcation, but which dominates a
subtree that contains at least one. The ex-
clamation mark symbol denotes a semantic
type that is underspecifled, and for which at
least two semantic objects are in competition.
Semantic objects in competition are denoted
by the interrogation mark symbol , and are
ordered according to the highest score of the
candidate representation to which they belong.
This view can be used by the expert to
Figure 7: Validation of a semantic choice within
the semantic view
navigate at any depth inside the compact
representation. By clicking on a semantic
object in competition, the expert can decide
whether this object belongs to the solution or
not. On the example of flgure 7, the expert has
selected the flrst possibility (subdirectories are
recursively visited) for an underspecifled type
(specifles how subdirectories are visited), which
is itself dominated by another underspecifled
type (specifles whether only directory names are
shown or...). The menu that pops up allows
the validation of the selected object: this will
have for efiect to prune the factorized tree of
any subtree that does not belong to at least
one of the candidates of the validated object.
In the present case, not only will it prune the
alternative subtree dominated by subdirectories
are not recursively visited, but also the subtree
dominated by only show directory names (not
Figure 8: Negotiation dialogue about how links
should be shown
their content) present at a shallower level in
the tree. Furthermore, subtrees that would
be incompatible elsewhere in the compact
representation because of failed parameter
uniflcation would disappear. Conversely, the
invalidation operation prunes all the subtrees
which have at least one candidate in common
with the invalidated object. The expert can
also ask for a negotiation dialogue, which will
be introduced shortly.
MDA view It seemed very natural to propose
aview withwhichauserof aMDA systemwould
already be familiar. Such a view shows the nor-
malized text corresponding to all the objects
from the root object that are not in competi-
tion. Underspecifled semantic types appear as
underlined text spans called active zones, which
trigger a pop up menu when clicked. Whereas
in the MDA authoring mode all the possible ob-
jects for the semantic type that do not violate
any semantic dependencies are shown, our sys-
tem only proposes those that belong to can-
didates that are still in competition. Further-
more, these semantic objects are not ordered
by their order in appearance in the grammar,
but by the score of their most likely candidate
according to our system. Selecting an object
corresponds to validating it, implying that the
invalidation operation is not accessible from this
view. Also, underspecifled semantic types do-
minated by other underspecifled types cannot
be resolved using this view, as they do not ap-
pear in the text.5 However, dealing with a text
in natural language corresponding to the nor-
malized document may be a more intuitive in-
terface to some users, although it may require
more operations.
Negotiation dialogues The key element in
this task is the minimization of the number of
5We thought that showing these types using cascade
menus would be too confusing for the user.
operations by the user. The two previous views
allow the expert to choose some underspeciflca-
tion to resolve. The List of underspeciflcations
panel on the left of the window in flgure 6
contains an enumeration of all underspecifl-
cations found in the compact representation.
They are ordered by decreasing score, where
the score can indicate the average score of
the objects in competition, or the inverse of
the average number of candidates per object
in competition. Therefore, the expert can
choose to resolve flrst underspeciflcations that
contain likely objects, or underspeciflcations
that involve few candidates so that the valida-
tion of an object will prune more candidates
from the compact representations. Clicking
on an underspeciflcation in the list triggers a
negotiation dialogue similar to that of flgure 8.
The semantic type on that dialogue, specifles
how links are shown, is not supported by any
evidence in the input document. The expert
can however choose a value for it. When the
underspeciflcation is resolved, all the views
are refreshed to re ect the new state of the
compact representation, and a negotiation
dialogue for the underspeciflcation then ranked
flrst in the list is shown. The expert can either
discard it, or continue in the dialogue mode,
with the possibility to skip the resolution of an
underspeciflcation.
3 Discussion and perspectives
We have presented an approach to normalize
documents in constrained domains and its
implementation. Our approach combines the
strictness of well-formed content representa-
tions and and the  exibility of information
retrieval techniques, and makes use of human
expertise to resolve di–cult interpretation
problems in an attempt to build an opera-
tional system. Although our initial results are
promising, our approach could be improved in
several ways.
First of all, an important evaluation factor of
our approach is how much efiort has to be done
by the human expert. We have only conducted
informal experiments of evaluation by the task,
which have revealed that normalization can
be performed quite fast when the user has a
good command of the difierent views available.
Nevertheless, it seems crucial to be able to
present the expert with at least some evidence
from the text of the input document to support
competing semantic objects. Morever, the
evidence extracted from the input document
could be used as the basis for learning new
formulations for particular communicative
goals that would match better subsequent
similar input. Although our system already
supports non-deterministic generation, we have
not implemented a mechanism that would
allow supervised learning of new formulations
yet. We expect this \normalization memory"
functionality to have an important impact for
the normalization of documents from the same
origin, as it should improve the automatic
selection of content representations.
In case the candidates returned by fuzzy
inverted generation do not contain the content
representation representing best the input
document, the user can choose to reanalyze
the document. This will start search again
from the (N+1)th content representation. But
because the expert might have already resolved
some underspeciflcations and thus identifled
subparts that should belong to the solution,
this information should be taken into account
while reanalyzing the document, which is not
the case in the current implementation. If
the solution has to be present in the list of
candidates returned, it should be as close to
the top of the list as possible, so that the flrst
choices for each underspeciflcation represent
the actual best choices. To this end, we intend
to implement a second-pass analysis that
would rerank the candidates produced by fuzzy
inverted generation by computing text similari-
ties over short passages such as those proposed
in (Hatzivassiloglou et al., 1999). These techni-
ques were much harder to implement during
the search in the virtual space of documents
produced by the document model, because par-
tial content representations are not actual texts.
4 Acknowledgements
Many thanks to Marc Dymetman, who super-
vised my work at Xerox Research Centre Eu-
rope and who originally came up with the con-
cept of fuzzy inverted generation. Many thanks
also to Christian Boitet, my university PhD su-
pervisor, and to Anne-Lise Bully, C¶edric Leray
and Abdelkhalek Rherad for their programming
work on the interface of the presented system.
This work was funded by a PhD grant from
ANRT and XRCE.

References

Christian Boitet. 1989. Speech Synthesis and
Dialogue Based Machine Translation. In Pro-
ceedings of the ATR Symposium on Basic Re-
search for Telephone Interpretation, Kyoto,
Japan.

Caroline Brun, Marc Dymetman, and Veronika
Lux. 2000. Document Structure and Multi-
lingual Authoring. In Proceedings of INLG
2000, Mitzpe Ramon, Israel.

Marc Dymetman, Veronika Lux, and Aarne
Ranta. 2000. XML and Multilingual Doc-
ument Authoring: Convergent Trends. In
Proceedings of COLING 2000, Saarbrucken,
Germany.

Anthony F. Hartley and C¶ecile L. Paris. 1997.
Multilingual Document Production - From
Support for Translating to Support for Au-
thoring. Machine Translation, 12:109{128.

Vasileios Hatzivassiloglou, Judith L. Klavans,
and Eleazar Eskin. 1999. Detecting Text
Similarity over Short Passages: Exploring
Linguistic Feature Combinations via Machine
Learning. In Proceedings of EMNLP/VLC-
99, College Park, United States.

Martin Kay, Jean Mark Gawron, and Peter
Norvig. 1994. Verbmobil { A Translation
System for Face-to-Face Dialog. CSLI Lec-
ture Notes.

Martin Kay. 1997. The Proper Place of Men
and Machines in Language Translation. Ma-
chine Translation, 12:3{23.

Aur¶elien Max. 2003a. De la cr¶eation de docu-
ments normalis¶es  a la normalisation de doc-
uments en domaine contraint. PhD thesis,
Universit¶e Joseph Fourier, Grenoble.

Aur¶elien Max. 2003b. Reversing Controlled
Document Authoring to Normalize Docu-
ments. In Proceedings of the EACL-03 Stu-
dent Research Workshop, Budapest, Hungary.

Fernando Pereira and David Warren. 1980.
Deflnite Clauses for Language Analysis. Ar-
tiflcial Intelligence, 13.

Richard Power and Donia Scott. 1998. Multi-
lingual Authoring using Feedback Texts. In
Proceedings of COLING/ACL-98, Montr¶eal,
Canada.

H. Somers, J.-I. Tsujii, and D. Jones. 1990.
Machine Translation without a Source Text.
In Proceedings of COLING-90, Helsinki, Fin-
land, volume 3, pages 217{276.
