XML Viewer - w04-2007

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2007_metho.xml
Size: 19,303 bytes
Last Modified: 2025-10-06 14:09:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2007">
  <Title>Using an incremental robust parser to automatically generate semantic UNL graphs</Title>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
2 The Universal Networking
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
Language (UNL)
2.1 The language
</SectionTitle>
      <Paragraph position="0"> UNL is an artificial language that describes semantic networks. Sentence information is represented by hypergraphs having universal words (UWs) as nodes and relations as arcs. A hypergraph can also be represented as a set of directed binary relations, between UWs in the sentence. Linguistic information is encoded by means of the UWs, the relations that exist between them and the attributes that are associated with them.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.2 Universal Words
</SectionTitle>
      <Paragraph position="0"> Universal Words represent simple or compound concepts. They denote interlingual acceptions (word senses) for a given lemma.</Paragraph>
      <Paragraph position="1"> An entry in the dictionnary of Universal Words contains, as illustrated in Figure 1, a head word (the French lemma &amp;quot;membre&amp;quot; in this example) followed by a list of morpho-syntactic constraints. The last part of the entry contains the UW itself: a character string (an English-language lemma) between double quotes, which usually contains a list of semantic constraints in brackets.</Paragraph>
      <Paragraph position="3"> Words.</Paragraph>
      <Paragraph position="4"> When present, the list of semantic constraints describes conceptual restrictions. For example, the first three entries in Figure 1 define three different acceptions while the last one provides only the lemma and is thus more general.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Relations
</SectionTitle>
      <Paragraph position="0"> Binary relations are the building blocks for UNL expressions. They link together two UWs in a linguistic utterance and have labels that depend on the roles the UWs play in the sentence.</Paragraph>
      <Paragraph position="1"> A UNL relation is represented by a headword (the label of the semantic relation) followed by a bracketed expression containing the UWs. The UWs are separated by a comma and decorated with different kinds of linguistic information.</Paragraph>
      <Paragraph position="2"> Figure 2 shows the UNL enconversion for the following French sentence: &amp;quot;Lors de la 29e session de la Conf'erence g'en'erale de l'Unesco, les 186 Etats membres ont ratifi'e`a l'unanimit'eceprojet.&amp;quot;</Paragraph>
      <Paragraph position="4"> The UNL expressions in Figure 2 encode relations such as agt (agent), qua (quantifier), mod (modifier), tim (instant time), man (manner) and obj (object).</Paragraph>
      <Paragraph position="5"> As can be seen on the figure, the information given by a UNL relation may be very semantically precise : for example, the notion of &amp;quot;time&amp;quot; is composed of six labels, corresponding to an instant time (tim), an initial time (tmf), a final time (tmt), a period (dur), a sequence (seq)or a simultaneous action (coo).</Paragraph>
      <Paragraph position="6">  In their 29th General Conference, the 186 member states of the Unesco ratified their unanimous support of the project.</Paragraph>
      <Paragraph position="7"> The couple of UWs present in a relation have different kinds of attributes : morphological information (def, pl, etc.), information about tense (past), etc.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.4 Representation of UNL graphs
</SectionTitle>
      <Paragraph position="0"> The list of UNL relations for a linguistic utterance is represented by a UNL hypergraph (a graph where a node is simple or recursively contains a hypergraph). The arcs bear semantic relation labels and the nodes are UWs with their attributes as showed in Figure 3.</Paragraph>
      <Paragraph position="1">  UNL hypergraphs must contain one special node, called the entry of the graph (usually the finite verb). This information is encoded with the label entry in the list of UNL relations representing the corresponding hypergraph.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 An incremental robust parser
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 Overview of the parser
</SectionTitle>
      <Paragraph position="0"> XIP (Ait-Mokhtar et al., 2002; Hagege and Roux, 2002) is a rule-based platform for building robust incremental parsers. It is developped at the Xerox Research Centre Europe (XRCE) and shares the same computationnal paradigm as the PNLPL approach (Jensen, 1992) and the FDGP approach (Tapanainen and Jarvinen, 1997).</Paragraph>
      <Paragraph position="1"> At present, various grammars for XIP have been built for English and French. The different phases of linguistic processing are organized incrementally : syntactic analysis is done by first chunking (Abney, 1991) a morphosyntactic annotated input text and then extracting functionnal dependencies (links between the words). The aim of the system is to produce a list of syntactic dependencies which may be later used in applications such as information retrieval, semantic disambiguation, coreference resolution, etc.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Incremental approach
</SectionTitle>
      <Paragraph position="0"> A XIP parser, like the French parser (that we will call XIPF hereafter), is composed of different modules that transform and process incrementally the linguistic information given as input. XIPF contains three main modules: one for morphological disambiguation (disambiguation of POS tags depending on contextual information), another one for chunking (marking structural groups) and a last one for dependency calculus (identifying links between words).</Paragraph>
      <Paragraph position="1"> Each module may have a number of grammars which are applied one after the other depending on the linguistic complexity of the phenomena present. For example, for French, the identification of verbal phrases comes after the identification of nominal phrases. The different rules in the grammars also apply incrementally. They are organized in levels so that they apply sequentially to enrich stepwise the linguistic analysis. This strategy favors linguistic precision over recall.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Data representation
</SectionTitle>
      <Paragraph position="0"> Within the XIP formalism, information is represented by means of syntactic trees with terminal nodes or sequences of constituant nodes (such as nominal phrases (NPs), finite verbal phrases (FVs), etc.). The maximal node for each tree (sentence) is a virtual node called GROUPE.</Paragraph>
      <Paragraph position="1"> All nodes, lexical (membre)ornot(NP), have a list of features associated with them and describing precise features : typographical (capital letter [maj:+]), lexical (proper noun [proper:+]), morphological (number [plu:+]), syntactic (subcategorization with the preposition &amp;quot;a&amp;quot; [sfa:+]) or semantic (time [tim:+]).</Paragraph>
      <Paragraph position="2"> Since the complete linguistic information of  anodeisalwayspresent,evenifitisnotdisplayed in the output, it is simple to manipulate at any time during the analysis. Therefore, the possibility of taking into account different kinds of features at any step of the analysis is a considerable advantage when building a semantic application (the enconversion into UNL expressions). null Indeed, semantic information can be enriched by adding new particular features when necessary (a feature title has been added to be applied in titles).</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 XIPF output
</SectionTitle>
      <Paragraph position="0"> The final result of the parser (a list of syntactic dependencies) is obtained from the linguistic processing done by the different modules. Figure 4 shows the XIPF analysis for the French sentence given as example in section 2.3.</Paragraph>
      <Paragraph position="1">  For this sentence, the parser extracts relations such as subject (SUBJ), verbal subcategorization (VARG), verbal and nominal modification (VMOD, NMOD and NN), determination (DETERM) and verbal auxiliary (AUXIL). The head of the dependency appears as the first element except in the case of a determination relation. null Relations usually have a list of morpho-syntactic features associated with them : the POS tag of the word linked to the head (NOUN in a SUBJ relation, ADJ in a NMOD, etc.), morphological precisions (NUM, DEM) or syntactic features (the position of the adjective, POSIT1, RIGHT, etc.).</Paragraph>
      <Paragraph position="2"> The process of dependency extraction is deterministic: the most plausible relation according to the system is extracted. The only exception is that of prepositionnal attachment (VMOD and NMOD): the linguistic information that the parser has is not enough to handle structural ambiguities. In this case, all possible relations appear in the result.</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="4" type="sub_section">
      <SectionTitle>
3.5 Parser evaluation
</SectionTitle>
      <Paragraph position="0"> Parsers built with the XIP engine (XIPF) are able to process about 2.000 words/s using 10 Mo of memory footprint (only grammars, without</Paragraph>
      <Paragraph position="2"> As for linguistic performance, an evaluation of XIPF subject and object (VARG) dependencies, conducted on French newspapers (Ait-Mokhtar et al., 2001), showed the following precision (P) and recall (R) rates:</Paragraph>
      <Paragraph position="4"> Another evaluation carried out with XIPF+ (Gala, 2003), a second French parser containing more specialized grammars to handle complex phenomena such as punctuation, lists, titles etc., using varied raw corpora from different types and domains  gives P = 94 % for subject even in sentences being or containing lists, enumerations etc. and P = 93 % and R = 89,6 % for key words in titles (CLE relation).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="4" end_page="6" type="metho">
    <SectionTitle>
4 A French UNL enconverter
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Overview
</SectionTitle>
      <Paragraph position="0"> The principal motivation to create a French enconverter is to easily obtain huge amounts of UNL enconverted corpora which can be subsequently used in other applications (for example, multilingual information retrieval). To achieve this objective, one of the main requirements was also the reusability of existing robust linguistic resources.</Paragraph>
      <Paragraph position="1"> The choice of a XIP parser was motivated by several reasons. First, its robustness permits to deal with huge amounts of text (a result is always produced whatever the complexity of the input). Second, its modular architecture facilitates the articulation of different ressources (it is easy to enrich the parser with new lexicons and grammars and to desactivate a particular module when necessary). Finally, the flexibility of the formalism permits to enrich the rules and the features with no harm. We have prefered XIPF+ over the standard XIPF because of its broader linguistic coverage.</Paragraph>
      <Paragraph position="2"> The French UNL enconverter is thus a processor that automatically transforms annotations  Obtained with a Pentum III 1 GHz.</Paragraph>
      <Paragraph position="3">  About 108.000 words extracted from the Web (end 2000) concerning general newspaper (Le Monde)aswell as specialized domains such as economics (journal Les Echos), science (medecine, physics), law (project of law), etc.</Paragraph>
      <Paragraph position="4"> provided by the XIPF+ parser into UNL expressions. null</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.2 Remarks on terminology
</SectionTitle>
      <Paragraph position="0"> To avoid ambiguity, we use the term &amp;quot;dependency&amp;quot; to indicate XIPF+ syntactic links of the form D(x,y) or D(x,y,z),asshowninFigure 4, and the term &amp;quot;feature&amp;quot; to indicate linguistic information provided by the parser. XIPF+ provides twelve types of dependencies and more than two hundred and fifty features, of the types described in section 3.3 (typographical, morphological, etc.).</Paragraph>
      <Paragraph position="1"> As for UNL, we use the term &amp;quot;relation&amp;quot; to denote a semantic link of the form label(UW1.attributes,UW2.attributes),as shown in Figure 2, while an &amp;quot;attribute&amp;quot; corresponds to a UNL annotation. Such an annotation appears to the right of a UW and adds particular linguistic information. The UNL formalism provides about fourty relations and eighty attributes of different types.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
4.3 Generation of UNL expressions
</SectionTitle>
      <Paragraph position="0"> The first step of the enconvertion consists in identifying the information provided by XIPF+ that will be translated into UNL relations.</Paragraph>
      <Paragraph position="1"> There are three kinds of mapping rules performing this task, depending on the input and the result of the transformation: a dependency giving an attribute, a dependency giving a relation, a feature giving a relation.</Paragraph>
      <Paragraph position="2">  The first kind of mapping rules transforms a XIPF+ dependency into a UNL attribute. An example is that of the relation CLE (the head of a title). Within UNL it becomes @title and it is included as an attribute of the UNL relation containing the head word of the title.</Paragraph>
      <Paragraph position="3"> The following example describes a title, its analysis with XIPF+ and its UNL encoversion: Le Forum Universel des Cultures  The second kind of mapping rules transforms a XIPF+ dependency into a UNL relation. In some cases, this transformation is not straight-forward since a number of lexical and semantic features are to be taken into account (and they are not always provided by the parser). This  is the case of dependencies with the verb to be and generally with all verbs denoting a state. While in the UNL formalism the verb to be is considered a copula and does not appear in the semantic representation, the parser produces the syntactic dependencies in which the verb participates and marks the fact of being a copula by means of features ([copula] as lexical feature and SPRED -predicative- as syntactic feature) as illustrated on the example  In this case, an aoj relation shows the link between the noun in the subject and the adjective. The parser's feature permitting the identification of a copula is thus crucial in order to map precisely a SUBJ and a VARG into an agt and a obj or into a single aoj.</Paragraph>
      <Paragraph position="4"> Table 1 gives a summary of the principal transformations performed by this second kind of mapping rules (as it is shown, in the case of modification, two types of XIPF+ relations produce a UNL mod):</Paragraph>
      <Paragraph position="6"> relations.</Paragraph>
      <Paragraph position="7"> 4.3.3 Feature to relation. The last type of mapping rules identifies particular information encoded as features within the parser's output and transforms them into UNL relations with the appropriate words. This is the case for the notions of quantification and time.</Paragraph>
      <Paragraph position="8"> Regarding quantification, this feature, encoded within the dependency DETERM,istransformed to produce a qua UNL relation between a determiner and a noun.</Paragraph>
      <Paragraph position="9"> As for the relations involving the notion of time, the feature time encoded by XIPF+ is too general. Therefore, it is not possible to produce the semantically precise UNL relations expressing variations of the concept of time (duration, final time, sequence, etc.). In this case,wehavechosentocreateanintermediate UNL relation named time in order to keep this semantic information.</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="4" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.4 Accessing the UW base
</SectionTitle>
      <Paragraph position="0"> After identifying the UNL relations, the enconverter retrieves the UWs corresponding to each French word in a relation. UWs are contained into a UW database of 37.901 French lemmas.</Paragraph>
      <Paragraph position="1"> The major difficulty here concerns ambiguity, that is, accessing the right acception, since the database usually contains a list of UWs for a given lemma. The ambiguity can be semantic, when a French lemma corresponds to a single English lemma with different acceptions (cf Figure 1) or lexical, when a French lemma corresponds to several English lemmas. Here is an example of lexical ambiguity with the pronoun il (&amp;quot;he&amp;quot; or &amp;quot;it&amp;quot; in English) :</Paragraph>
      <Paragraph position="3"> To this date, as the lexico-semantic information provided by the parser is not enough to choose the appropriate UW, the enconverter takes the most general acception (that is, the word sense without a constraint list -the last entry in the list showed in Figure 1). When all acceptions of an entry have such list of constraints, the enconverter chooses the first one.</Paragraph>
    </Section>
    <Section position="5" start_page="6" end_page="6" type="sub_section">
      <SectionTitle>
4.5 Enrichment with lexical
</SectionTitle>
      <Paragraph position="0"> information The final step of the enconversion enriches the rough UNL expressions produced (UNL labels with simplified UWs) with more complete morphological information. A set of rules is thus specialized in translating different linguistic features from the parser into UNL descriptors completing the words in a relation.</Paragraph>
      <Paragraph position="1"> Some of this morphological information can also be extracted from the UW base (gender).</Paragraph>
      <Paragraph position="2"> However, we have preferred to extract a maximum of information from the parser because it produces a contextual analysis of the words appearing in a linguistic utterance.</Paragraph>
      <Paragraph position="3"> The features which enrich the UNL output concern definiteness (@def or @indef), number (@sg or @pl) and tense (@past, @present, @fut). A few labels (@ordinal, @complete ...) are absent on the XIPF+ output and therefore not automatically enconverted in the UNL output. Finally, the attribute @entry is systematically added to UWs head of their sentence (the verb): agt, varg, aoj,etc.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML