File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1606_metho.xml

Size: 18,714 bytes

Last Modified: 2025-10-06 14:08:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1606">
  <Title>Normalization and Paraphrasing Using Symbolic Methods</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DESCRIPTION SMELL(1.3-butadiene,gasoline-like)
</SectionTitle>
    <Paragraph position="0"> expresses that the product 1.3-butadiene has a gasoline-like odor.</Paragraph>
    <Paragraph position="1"> SYNONYM/2. This predicate expresses that the second argument is a synonym of the first, which is the name of the toxic product. For instance SYNONYM(acetone,dimethyl ketone) expresses that dimethyl ketone is another name for acetone.</Paragraph>
    <Paragraph position="2"> PROPERTY/5. The PROPERTY predicate is the result of the normalization of strings expressing physical or chemical properties of the toxic product. For instance, PROP-ERTY(acrolein,dissolve,water,in,NONE) expresses that the product acrolein is soluble in water (instantiation of the four first arguments of the predicate), and that we do have precisions about the way this dissolution occurs (last argument NONE is not instantiated by a value). For the same product we have PROP-ERTY(acrolein,burn,NONE,NONE,easily) which expresses that the product is flammable and that the localization of the flammability is unspecified.</Paragraph>
    <Paragraph position="3"> ORIGIN/4 contains the normalized information whether the product is natural or not and where it can be found. For instance, ORIGIN(ammonia,manufactured,NONE,NONE) expresses that the product ammonia is manmade, and ORIGIN(amonnia,natural,soil,in) expressed that the same product can also be found naturally in soil.</Paragraph>
    <Paragraph position="4"> USE/6 is the result of the normalization of the uses of the described product. In this first stage we only concentrate in uses where the product is used alone2. For instance USE(benzidine,NONE,NONE,produce,dye,past) expresses that in the past (last argument is past) the product benzidine was used to produce dyes (4th and 5th arguments) while USE(ammonia,smelling salts,in,NONE,NONE,present) expresses that ammonia is now (last argument is present) used in smelling salts (the purpose of the use is not specified here).</Paragraph>
    <Paragraph position="5"> 2In the texts, uses of a product when it is mixed with another can also be described but we decided to ignore this information. To each of the above-mentioned predicates a suffix NEG can be added if there is a negation.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Paraphrase detection
</SectionTitle>
    <Paragraph position="0"> Paraphrasing means to be able, from some input text that convey a certain meaning, to express the same meaning in a different way. This subject has recently been receiving an increasing interest. For instance, Takahashi et. al. (Takahashi et al., 2000) developed a lexico-structural paraphrasing system. Kaji et al.</Paragraph>
    <Paragraph position="1"> developed a system which is able to produce verbal paraphrase using dictionary definitions (Kaji et al., 2000) and Barzilay and McKeown showed how, using parallel corpora of English literary translations, they extract paraphrases (Barzilay and McKeown, 2001). Paraphrase detection is a useful step in many NLP applications. For instance, in multi-document summarization, paraphrase detection helps to identify similar text segments in order that the summary become more concise (McKeown et al., 1999). Paraphrase detection can also be used to augment recall in different IE systems.</Paragraph>
    <Paragraph position="2"> In our experiment, paraphrase detection is a step in normalization, as we want to instantiate the same way the predicates presented above when the informative content is similar. For instance, we want to obtain the same normalized predicate for the two utterances ProductX is a colorless, nonflammable liquid and ProductX is a liquid that has no colour and that does not burn easily namely:</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DESCRIPTION COLOUR(ProductX,colorless)
PHYS FORM(ProductX,liquid)
PROPERTY NEG(ProductX,burn,NONE,NONE,NONE).
</SectionTitle>
    <Paragraph position="0"> The input to our paraphrase detection system is the whole paragraph that describes the toxic product.</Paragraph>
    <Paragraph position="1"> The analysis of the paragraph produces as output the set of normalized predicates. This output can be produced either in simple text format or in an XML format that can feed directly some database.</Paragraph>
    <Paragraph position="2"> The paraphrase detection system is based on three different modules that are described in the following subsections. As claimed in (Takahashi et al., 2000) and for the purpose of re-usability, we distinguish what is of general linguistic interest in the paraphrasing task from what is clearly domain dependent, so these three modules are: A general English dependency parser; A general morpho-syntactic normalizer; A specific- and application-oriented normalizer. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 General English dependency parser
</SectionTitle>
      <Paragraph position="0"> This component is a robust parser for English (XIP) (A&amp;quot;it-Mokhtar et al., 2002) that extract syntactic functionally labeled dependencies between lexical nodes in the text.</Paragraph>
      <Paragraph position="1"> Parsing includes tokenization, morpho-syntactic analysis, tagging which is performed via a combination of hand-written rules and HMM, chunking and finally, extraction of dependencies between lexical nodes.</Paragraph>
      <Paragraph position="2"> Dependencies are binary relations linking two lexical nodes of a sentence. They are established through what we call deduction rules.</Paragraph>
      <Paragraph position="3"> Deduction rules Deduction rules apply on a chunk tree and consist in three parts:  Context is a regular expression on chunk tree nodes that has to be matched with the rule to apply. Condition is a boolean condition on dependencies, on linear order between nodes of the chunk tree, or on a comparison of features associated with nodes.</Paragraph>
      <Paragraph position="4"> Extraction corresponds to a list of dependencies if the contextual description and the conditions are verified.</Paragraph>
      <Paragraph position="5"> For instance, the following rule establishes a SUBJ dependency between the head of a nominal chunk and a finite verb:</Paragraph>
      <Paragraph position="7"> SUBJ(#2,#1).</Paragraph>
      <Paragraph position="8"> The first three lines of the rule corresponds to context and describe a nominal chunk in which the last element is marked with the variable #1, followed by anything but a verb, followed by a verbal chunk in which the last element is marked with the variable #2. The fourth line (negative condition: ) verifies if a SUBJ dependency exists between the lexical nodes corresponding to the variable #2 (the verb) and #1 (the head of the nominal chunk). The test is true if the SUBJ dependency does not exist. If both context and condition are verified, then a dependency SUBJ is created between the verb and the noun (last line).</Paragraph>
      <Paragraph position="9"> An important feature is that our parser always provides a unique analysis (determinism), this analysis being potentially underspecified.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 General morpho-syntactic normalization
</SectionTitle>
      <Paragraph position="0"> The morpho-syntactic normalizer is a general module that is neither corpus- nor application-dedicated.</Paragraph>
      <Paragraph position="1"> It consists of hand-made rules that apply to the syntactic representation produced by our parser. It uses well known syntactic equivalences such as passiveactive transformation and verb alternations proposed in Levin. It also exploits the classification given by the COMLEX lexicon (Grishman et al., 1994) in order to calculate the deep-subject of infinitive verbs.</Paragraph>
      <Paragraph position="2"> For instance the utterance Antimony ores are mixed with other metals is finally represented with a  set of normalized syntactic relations expressing that the normalized subject (SUBJ-N) of the verb mix is unknown, and that mix has two second actants (OBJ-N) ore and metal :</Paragraph>
      <Paragraph position="4"> For this example, both passive transformation and reciprocal alternation transformation have been applied on the set of dependencies produced by the general parser.</Paragraph>
      <Paragraph position="5"> Deep syntactic rules are expressed using the same formalism than general syntactic rules presented in the previous section. For instance the following rule construct an OBJ-N (Normalized object) dependency between the surface syntactic subject and a verb in a passive form3.</Paragraph>
      <Paragraph position="7"> Unlike Ros'e's approach (Ros'e, 2000) which also developed a deep syntactic analyzer, this is done exclusively by hand-made rules based on the previous calculated dependencies on the one hand and syntactic and morphological properties of the nodes involved in the dependencies on the other hand.</Paragraph>
      <Paragraph position="8"> Together with the exploration of syntactic properties, we also take advantage of morphological properties in order enrich our deep syntactic analysis. This is done using the CELEX database (Celex Database, 2000) by pairing nouns and verbs that belong to the same morphological family, which allows us to obtain for the expression John's creation of the painting, the same deep syntactic representation as for John creates the painting.</Paragraph>
      <Paragraph position="9"> As a result of the second stage, we obtain new deep syntactic relations, together with the superficial syntactic relations calculated by the general parser: SUBJ-N (Normalized subject) that links the first actant of a verb (finite or non-finite) or of a predicative noun to this verb or noun.</Paragraph>
      <Paragraph position="10"> OBJ-N (Normalized object) that links the second actant of a verb (finite or non-finite) or of a predicative noun to this verb or noun.</Paragraph>
      <Paragraph position="11"> ATTRIB (General attribute) that links two nodes when the second one denotes a property of the first one.</Paragraph>
      <Paragraph position="12"> PURPOSE that links a verb to its actant expressing the purpose of the action.</Paragraph>
      <Paragraph position="13"> It is important to note that predicative nouns are represented by their underlying verbs. e.g. The invention of the process is represented by OBJN(invent,process). null 3VDOMAIN links the first element of a verbal chain to the last element of a verbal chain and passive is a feature that is added to this relation.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Application and corpus specific
</SectionTitle>
      <Paragraph position="0"> normalization Application and corpus specific normalization is a follow-up of the previous module. But while general normalization is purely based on syntactic transformations and some derivational morphology properties, synonymy relations and all further possibilities of morphological derivations are not exploited. This extension uses the results obtained at the previous analysis level.</Paragraph>
      <Paragraph position="1"> The application- and corpus-oriented analysis is organized in two axes that are detailed below.</Paragraph>
      <Paragraph position="2"> corpus oriented linguistic processing; corpus oriented paraphrasing rules.</Paragraph>
      <Paragraph position="3">  We exploit the corpus specific properties at different stages of the processing chain in order to improve the results of the general syntactic analysis. Below are the additions we made: Specific tokenization rules.</Paragraph>
      <Paragraph position="4"> Since toxic products can have names like 2,3-Benzofuran, which the general tokenizer does not consider as one unique token, we add a local grammar layer dedicated to the detection of these kinds of names. In other words, this layer composes together tokens that have been separated by the general tokenizer. null Specific disambiguation rules valid for this kind of corpus but not necessarily valid for all kinds of texts.</Paragraph>
      <Paragraph position="5"> For instance, the word sharp has a priori two possible part-of-speech analyzes, noun and adjective, and we want to keep these two analyzes for the general parser. But, since the noun sharp belongs to a certain domain (music) that has no intersection with the domain handled by the corpus, we add specific disambiguation rules to remove the noun analysis for this word.</Paragraph>
      <Paragraph position="6"> Improved treatment of coordination for this kind of text.</Paragraph>
      <Paragraph position="7"> The corpus contains long chains of coordinated elements and especially coordination in which the last coordinated element is preceded by both a comma and the coordinator. Since some elements have been typed semantically, we can be more precise in the coordination treatment exploiting this semantic information. null Adding some lexical semantics information For the purpose of the application, we have semantically typed some lexical entries that are useful for paraphrase detection. For instance, colour names have the features colour : + added.</Paragraph>
      <Paragraph position="8"> Automatic contextual typing Some of the manually semantic typing (previous point) allows us to indirectly type new lexical units. For instance, as formulations like synonyms, call, name, designate are marked as possible synonymy introducers, we are able to infer that complements of these lexical units are synonyms. In a similar way, syntactic modifiers of lexical units that have been marked in the application lexicon like smell and odor are odor descriptions. In these cases, direct typing cannot be achieved. For example, the huge number of potential smellings (almond-like, unpleasant, etc.) cannot be code by hand. However, the inference mechanism enable us to extract the required information.</Paragraph>
      <Paragraph position="9"> Ad-hoc anaphora resolution.</Paragraph>
      <Paragraph position="10"> In our corpus, the pronoun it and the possessive its always refer to the toxic product that is described in the text. As we do not have any anaphora resolution device integrated to our parser, we take advantage of this specificity to resolve anaphora for it and its.  Paraphrases are detected by hand-made rules using lexical and structural information.</Paragraph>
      <Paragraph position="11"> Lexical relations for paraphrasing As mentioned before, in our general normalizer some nouns and verbs belonging to the same morphological family are related. We extend these relations to other classes of words that appear in the corpus. For instance, we want to link the adjective flammable and the verb burn, and we want the same kind of relation between the adjectives soluble, volatile, mixable and the verbs dissolve, evaporate and mix respectively. We declaratively create a relation (ISAJ relation) between these pairs of words, and this relation can then be handled by our parser exactly like a dependency relation which has been previously calculated. Other lexical relations between synonyms (e.g. call and name) or non-related morphological nouns and verbs (as for instance the noun flammability and burn) are created.</Paragraph>
      <Paragraph position="12"> The lexical relations we created are the following ISAJ links an adjective and a verb when the verb can be paraphrased by BE+adjective TURNTO links a noun and a verb when the verb can be paraphrased by TURN TO+noun HASN links a noun and a verb when the verb can be paraphrased by HAVE+noun SYNO links two words belonging to the same morpho-syntactic class when the first is a synonym of the second4.</Paragraph>
      <Paragraph position="13"> Normalization rules Once these relations are created, we can then exploit them in rules.</Paragraph>
      <Paragraph position="14"> For instance, the following rule5 (see below) allows for the creation of the predicate PROPERTY(aniline,dissolve,NONE,NONE,NONE) for the utterance aniline is soluble.</Paragraph>
      <Paragraph position="15"> if (  The rule formalism is the one used for the general syntactic grammar and the deep syntax grammar. In this case, we only have two parts in the rule (Condi null present example, since we have detected that aniline is the described toxic product (SUBSTANCE(aniline)), since an ISAJ relation exists between soluble and dissolve (ISAJ(soluble,dissolve)) and finally since the deep syntactic analysis of the sentence has given to us the dependency ATTRIB(aniline,soluble), the final predicate is created.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Example of output
</SectionTitle>
      <Paragraph position="0"> When applied on an input text describing a toxic substance, such as the following one : Acetone is a manufactured chemical that is also found naturally in the environment. It is a colorless liquid with a distinct smell and taste. It evaporates easily, is flammable, and dissolves in water. It is also called dimethyl ketone, 2-propanone, and beta-ketopropane. Acetone is used to make plastic, fibers, drugs, and other chemicals. It is also used to dissolve other substances. It occurs naturally in plants, trees, volcanic gases, forest fires, and as a product of the breakdown of body fat. It is present in vehicle exhaust, tobacco smoke, and landfill sites.</Paragraph>
      <Paragraph position="1"> Industrial processes contribute more acetone to the environment than natural processes.</Paragraph>
      <Paragraph position="2"> the system is able to extract the following list of predicates:  Most of the information present in the original text has been extracted and normalized: for example, flammable is normalized as PROP-ERTY(acetone,burn,NONE,NONE,easily). However, form the input ... as a product of the breakdown of body fat, the system extract the partial analysis ORIGIN(acetone,natural,a product,in). Such cases are discussed in section 4.</Paragraph>
      <Paragraph position="3"> In this section, we have shown how, extending a general parser with limited information (morphological and transformational) and adding specific domain knowledge for the corpora we consider, we were able to obtain a normalization of some knowledge enclosed in the texts. The next section is dedicated to the evaluation of the performances of this system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML