XML Viewer - w04-2702

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2702_metho.xml
Size: 20,909 bytes
Last Modified: 2025-10-06 14:09:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2702">
  <Title>Syntax to Semantics Transformation: Application to Treebanking</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 UAM Spanish Treebank and SESCO
</SectionTitle>
    <Paragraph position="0"> In order to understand the SST, it is interesting to consider the characteristics of the corpora we have used.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 UAM Spanish Treebank (source corpus).
</SectionTitle>
      <Paragraph position="0"> The UAM Spanish Treebank of the Universidad Autonoma de Madrid is a syntactically annotated corpus made up of 1600 sentences taken from Spanish newspapers (Moreno et al., 1999; Moreno et al., 2003). Since these sentences (particularly the first 500) were chosen as a sample of the complexity of Spanish syntax, they cover an important range of syntactic structures. The fact that the sample was taken selectively from different sections of the sources reflecting different styles implies much more complexity.</Paragraph>
      <Paragraph position="1"> The format was based on the Penn Treebank, although the tag set has been adapted to the characteristics of the Spanish language. The corpus has recently been converted to an XML format, which has helped us a lot in our work.</Paragraph>
      <Paragraph position="2"> The Treebank has four different types of information: null  1. Part-of-Speech (noun, verb, etc.) 2. Syntactic functions (SUBJ, DO, ATTR, etc.) 3. Morpho-syntactic features (gender, number, person, etc.) 4. Semantic features. The UAM Spanish Treebank  has a group of tags called &amp;quot;semantic features&amp;quot; which specify types of prepositional phrases (locative, time, etc.) The aim of this annotation was to reflect the surface syntax. The designers were thus very cautious in regards to empty categories and ambiguities: they used the features only in those cases with the highest certainty. Additionally,, the designers avoided redundancy as much as possible.</Paragraph>
      <Paragraph position="3"> The Treebank tag set has a flexible design allowing the addition of new features. However as more features are added, annotation becomes more difficult, since the human tagger has to choose the suitable tag among the available ones.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 SESCO (target corpus).
</SectionTitle>
      <Paragraph position="0"> SESCO is a tagging system which allows the semantic representation of a linguistic corpus (Alcantara, 2003).</Paragraph>
      <Paragraph position="1"> It is coded using an XML markup and offers a practical basis for tagging both spoken and written corpora.</Paragraph>
      <Paragraph position="2"> The main goal of SESCO is to make an essential and flexible analysis for extracting the largest possible amount of data from a corpus without limiting it to an excessively restrictive theory, taking the argument structure of verbs as starting-point.</Paragraph>
      <Paragraph position="3"> We back J.C. Moreno's proposal (J.C. Moreno 1991a, 1991b, 1997) on event analysis, although we also have considered other very similar approaches (Pustejovsky, 1995; Tenny and Pustejovsky, 2000).</Paragraph>
      <Paragraph position="4"> The events expressed by verbs can be of three major types, forming a universal hierarchy (J.C. Moreno, 1997): states, processes and actions. These three types are divided into subtypes according to the arguments they require.</Paragraph>
      <Paragraph position="5"> This approach is compositional: a state has two arguments, a process is made up of a transition from one state to another, and an action is a process with an agent. This leads to the logical consequence that we need an annotation format for representing both the relation between events and the arguments of the sentence and its sub-event structure.</Paragraph>
      <Paragraph position="6"> Most of the recent work on semantics focuses on ontologies. It is important to distinguish the fact that SESCO does not have an ontology as a basis, but that the ontology can be a result of our work.</Paragraph>
      <Paragraph position="7"> SESCO has been developed taking as point of reference the spoken corpus from the Computational Linguistics Laboratory of the Universidad Autonoma de Madrid (http://www.lllf.uam.es/), which, in turn, forms part of the European project &amp;quot;C-ORAL-ROM&amp;quot; (http://lablita.dit.unifi.it/coralrom/). Texts have been recorded following requirements of spontaneity, quality of the sound and variety of speakers and contexts.</Paragraph>
      <Paragraph position="8"> At the beginning of our experiment, 49500 spontaneous spoken words (4100 sentences) had been analyzed in SESCO format. These sentences are our training corpus and the basis of our SESCO Data Base (SDB) of event structures.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Main differences.
</SectionTitle>
      <Paragraph position="0"> Besides the linguistic background, there are three main differences between the syntactically annotated UAM Treebank and SESCO: First, whereas the Treebank is a corpus of written texts, SESCO contains only spontaneous speech orthografic transcriptions. As we expected, the vocabulary was not the same and the upshot of this was an increase in the number of unknown lemmas. In actual fact, both corpora are designed for covering a wide range of topics and registers.</Paragraph>
      <Paragraph position="1"> Second, the UAM Treebank tagset is far more complex than that of SESCO. In this respect, the SST process is a reduction and it does not use all the features included in the Treebank. Syntactic functions and some semantic features are the only information that SST makes use of.</Paragraph>
      <Paragraph position="2"> Finally, SST raises fundamental questions on the concept of 'sentence'. In the Treebank, the key is the orthography: the limits of a sentence are always established by dots. In SESCO, a sentence is a complete event. Because of this, the 1600 sentences of the UAM Spanish Treebank corpus produce 1666 sentences in the SESCO version. In spite of this, orthographic punctuation has been helpful in the task of recognizing the beginning of most of the sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methodology
</SectionTitle>
    <Paragraph position="0"> The input is a syntactically annotated sentence and the output is the same sentence semantically tagged. Both annotations are in XML and the involves five main stages. The first three stages are automatic, implemented in Perl. The fourth (optional) stage is semi-automatic and the last one is a human-revision.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Getting the event type.
</SectionTitle>
      <Paragraph position="0"> As pointed out earlier, our semantic tagging reflects argument structures related to verbs. Due to this theoretical framework, the first step is to find the lemma of the main verb. It is an easy task since the treebank format provides this information through a particular attribute (&amp;quot;lemma&amp;quot;) in the element &amp;quot;verb&amp;quot;. Once the lemma is found, the program searches the SDB for the most frequent event type for this lemma.</Paragraph>
      <Paragraph position="1"> This selection is made taking into account the syntactic structure: for example, if it is a process and there is a locative complement, the most used displacement will be chosen.</Paragraph>
      <Paragraph position="2"> The SDB data come from the previous analysis (for more details about the SESCO corpus, see section 2.2.).</Paragraph>
      <Paragraph position="3"> That is, this stage is based on a probabilistic model and the automatic mapping is example-based, finding similar examples already in the training corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 From a syntactic structure to a semantic
</SectionTitle>
      <Paragraph position="0"> analysis.</Paragraph>
      <Paragraph position="1"> In order to understand this second step, first of all it is necessary to remark on some characteristics of the UAM Spanish Treebank. When the UAM Treebank was designed in 1997 (Moreno et al., 2003), the aim was only to build a syntactically annotated corpus following the Penn Treebank style - no consideration was given to the possibility of its translation into a semantic corpus. Therefore the Treebank included only those features needed for achieving a correct syntactic analysis. As mentioned above, the UAM Spanish Treebank uses the standard Penn Treebank scheme with the addition of some features. It provides a combination of Part of Speech information with specific grammatical features of words and phrases.</Paragraph>
      <Paragraph position="2"> In SST, this syntactic data is transformed into an event analysis through application of a set of rules. Each rule corresponds to the most frequent correlation between a syntactic phrase and a part of the event structure. Some of the rules are general, but others depend on the lemma. In the current version, lemmas are classified into six different groups:  1. Standard-Type. The rules are consistent with most of the lemmas. By way of illustration, these rules transform the subject (SUBJ) of a sentence, which corresponds to an action, into the agent, and the direct object (DO) into the patient. If the event type is a state, the SUBJ will be the first argument of the state and the attribute will be the second argument. There is a subset of rules for passive sentences.</Paragraph>
      <Paragraph position="3"> 2. First-Type-Actions. The rules transform the indirect object (IO) into the patient. For instance, &amp;quot;pegar&amp;quot; (to hit).</Paragraph>
      <Paragraph position="4"> 3. Second-Type-Actions. The IO is transformed into the first argument of the states. For instance, &amp;quot;devolver&amp;quot; (to give back ).</Paragraph>
      <Paragraph position="5"> 4. Third-Type-Actions. The DO is transformed into the second argument of the states. For instance, &amp;quot;otorgar&amp;quot; (to grant).</Paragraph>
      <Paragraph position="6"> 5. First-Type-States. The IO is transformed with the second argument of the states. For instance, &amp;quot;gustar&amp;quot; (to like).</Paragraph>
      <Paragraph position="7"> 6. Second-Type-States. The second argument of the state is a prepositional phrase. For instance, &amp;quot;coincidir con&amp;quot; (to coincide with). 3.3 References and variables.</Paragraph>
      <Paragraph position="8">  Lemmas of complex events (specifically actions) are classified additionally depending on their references. References are used in SESCO in order to link the arguments of an event with their functions in the arguments of sub-events. As we have seen in section 2.2, SESCO is based on a compositional semantic theory where actions and processes are made up of subevents. These references are determined in the case of actions by five different types of lemmas.</Paragraph>
      <Paragraph position="9"> Those parts of the event structure which have no correspondence with a phrase (for instance, the agent in a sentence without explicit SUBJ) are filled with variables by the program.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Unknown lemmas.
</SectionTitle>
      <Paragraph position="0"> As mentioned, the method requires a database with previous examples, something which is not available for all the potential lemmas of a language. In case the program could not reach a model for a lemma, it prompts the user for the most basic information and tries to carry out the analysis. By this means, the final file contains all the sentences in SESCO format with the most likely structure. null Since SESCO has a DTD-controlled tagset covering all possible analysis, the output file will always be a well formed and valid XML file.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3.5 Revision.
</SectionTitle>
    <Paragraph position="0"> The last step is a manual revision of the output file. As we have used the tagging of the UAM Spanish Treebank in order to develop our system, this step has a great importance.</Paragraph>
    <Paragraph position="1"> The program errors detected during the analisys have served us to implement new rules. That is why the corpus has been tagged in small groups of sentences (with approx. 100 sentences each group).</Paragraph>
    <Paragraph position="2"> When an error is detected during the analysis, typically a new rule is added. For this reason, the corpus has been tagged in small groups of sentences (with approx. 100 sentences each group). Thus, we have performed sixteen re-examinations of our system each time re-testing the reliability of the rules.</Paragraph>
    <Paragraph position="3"> Once the revision is completed, the new sentences are added to the SDB.</Paragraph>
    <Paragraph position="4"> 4 Main problems for SST.</Paragraph>
    <Paragraph position="5"> The last step of the SST process, the revision, provides us with a typology of problems in the automatic part of the system. Let us look at the four most important types and at the number of errors in the 1666 sentences:  1. Sentences without lemmas (69 errors).</Paragraph>
    <Paragraph position="6">  Newspapers have a lot of sentences (words between dots) which do not have a verb.</Paragraph>
    <Paragraph position="7"> Nominalization is frequently used by journalists with pragmatic functions. Taking into account that we are analysing argument structures of verbs, this sentence serves to illustrate this error: &amp;quot;Medidas desesperadas en China para frenar la crecida del Yangtze en la provincia de Hubei.&amp;quot; (&amp;quot;Tough measures in China to stop the Yangzte overflow in Hubei&amp;quot;).</Paragraph>
    <Paragraph position="8"> 2. Verb Type (71 errors). The analysis of the verb is not correct because it is not in its right group (see section 3.2.). When the SST program does not recognize a lemma, it asks for the essential information, but it does not ask for types of references.</Paragraph>
    <Paragraph position="9"> 3. False analysis (66 errors). The most likely analysis (according to the SDB) does not correspond to the sentence. Since we are still developing SESCO, it would be naive to suppose that all these errors are due to SST problems. As we have seen, the SDB is based on a small corpus of 49500 words and they are not enough to get the most likely structure of some verbs (some of which have appeared only once or have not appeared at all).</Paragraph>
    <Paragraph position="10"> 4. Treebank errors (53 errors). We began our work with the last 100 sentences of the UAM Spanish Treebank (sentences 15001600). We have done it in this inverse order because Manuel Alcantara had annotated himself the last sentences of the Treebank. In this process, we have noticed differences between the analysis of the sentences. These differences, even though they are not important for the syntactic analysis, have hindered the SST process since our program expects a particular structure. With the help of SST, we now have a revised version of the syntactic Treebank.</Paragraph>
    <Paragraph position="11"> In addition to these errors, there are others which we have not considered so important because they do not change the event type.</Paragraph>
    <Paragraph position="12"> The rules for the indirect relations (those phrases which are not arguments of the verb) depend on the semantic features of the Treebank tagset and they are not always enough to determine the right tag. It is worth remembering that both systems (Treebank and SESCO) are designed independently.</Paragraph>
    <Paragraph position="13"> Telicity of events is determined by the (indefinite/definite) articles of the phrases. When the head of a phrase is not at the very beginning, errors can occur.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Examples.
</SectionTitle>
    <Paragraph position="0"> Let us point out an uncomplicated example of the SST process: &amp;quot;EEUU tiene ya pistas sobre el doble atentado en Kenia y Tanzania .&amp;quot; (&amp;quot;The United States already has a lead about the terrorist outrage of Kenya and Tanzania&amp;quot;). null First of all, SST searches for the main verb and its lemma. In this case, the verb is &amp;quot;tiene&amp;quot; (has) and the lemma is &amp;quot;tener&amp;quot; (to have). The Treebank tag for this verb is: &lt;V Lemma=&amp;quot;tener&amp;quot; Tensed=&amp;quot;Yes&amp;quot; Form=&amp;quot;PRES&amp;quot; Mode=&amp;quot;IND&amp;quot; Number=&amp;quot;SG&amp;quot; P=&amp;quot;3&amp;quot;&gt;tiene&lt;/V&gt; From this starting-point, SST looks for the most likely structure of &amp;quot;tener&amp;quot; in the SDB. 99.5% of &amp;quot;tener&amp;quot; events are attributive states with a possessor and a property. null The program checks if &amp;quot;tener&amp;quot; belongs to a special verb type. It does not, so the program checks if it is a normal sentence (it is not in passive voice) and follows the standard rules. These rules are the following:  1. The subject of the sentence (&amp;quot;EEUU&amp;quot;) is the possessor.</Paragraph>
    <Paragraph position="1"> 2. If there is an attributive phrase or a direct object,  it is the property. If there is not, the program looks for other possibilities (oblique complement, predicative complement, clauses and prepositional phrases). In our example, &amp;quot;pistas sobre el doble atentado en Kenia y Tanzania&amp;quot; is tagged as direct object.</Paragraph>
    <Paragraph position="2"> 3. In case no possessor or property was found, SST would assign a variable to these arguments.</Paragraph>
    <Paragraph position="3"> 4. The program checks if the arguments are definite or indefinite. &amp;quot;Pistas&amp;quot; is indefinite and SST sets the event as indefinite.</Paragraph>
    <Paragraph position="4"> 5. Finally, SST looks for indirect relations (prepositional phrases which are not arguments).</Paragraph>
    <Paragraph position="5"> Once these rules are applied, the program determines if it is a negative sentence, a question, etc. by means of looking for negative words and punctuation, and sets the appropriate features. It also determines the tense.</Paragraph>
    <Paragraph position="6"> At the end, the final version of the sentence analysis is written in a target file following the SESCO format. To take a more difficult example, let us analyze the sentence &amp;quot;Se ha escapado de casa&amp;quot; (&amp;quot;He/she has escaped from his/her home&amp;quot;). We have only one previous analysis of the lemma &amp;quot;escaparse&amp;quot; (to escape) in SDB and it is an action made up of a displacement.</Paragraph>
    <Paragraph position="7"> Regarding references, &amp;quot;Escaparse&amp;quot; belongs in a particular group of events together with &amp;quot;ir&amp;quot;, &amp;quot;irrumpir&amp;quot;, &amp;quot;marchar&amp;quot;, &amp;quot;presentarse&amp;quot;, etc. For this group, the agent and patient of the action and the first argument of the displacement's states are the same entity.</Paragraph>
    <Paragraph position="8"> The SST checks if it is a normal sentence and follows the fitting rules for this group:</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. The subject of the sentence will be the agent. In
</SectionTitle>
    <Paragraph position="0"> this case, there is no subject and the program establishes a variable (X) chosen arbitrarily.</Paragraph>
    <Paragraph position="1"> 2. Because it is a displacement, SST looks for prepositional phrases with &amp;quot;de&amp;quot; or &amp;quot;desde&amp;quot; (&amp;quot;from&amp;quot;) in order to fill the second argument of the first state. It finds &amp;quot;de casa&amp;quot;.</Paragraph>
    <Paragraph position="2"> 3. SST looks for prepositional phrases with &amp;quot;a&amp;quot; or &amp;quot;hasta&amp;quot; (&amp;quot;to&amp;quot;) in order to fill the second argument of the second state. It does not find it.</Paragraph>
    <Paragraph position="3"> 4. The program establishes a number as identifier of the agent and links it together with the patient and the first arguments of the states.</Paragraph>
    <Paragraph position="4"> 5. SST looks for indirect relations.</Paragraph>
    <Paragraph position="5"> At last, the program determines that it is not a negative sentence and gets time and mood information.</Paragraph>
    <Paragraph position="6"> The annotated sentence in Treebank and SESCO formats can be found in appendix. Most important data is underlined.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML