File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1606_intro.xml

Size: 3,323 bytes

Last Modified: 2025-10-06 14:02:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1606">
  <Title>Normalization and Paraphrasing Using Symbolic Methods</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Corpus Analysis and Expected Output
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Corpus study
</SectionTitle>
      <Paragraph position="0"> The corpus on which we work consists of a collection of texts presenting toxic products from ATSDR that are meant to be read by general public. We have concentrated on the first paragraphs containing in average between 6-7 sentences and consisting in the general presentation of a toxic product. They give information about the name, the appearance (colour, smell), some physical properties and possible synonyms of a toxic product. They also explain where the product comes from and for what purposes it is used. Because of the uniformity of the information conveyed in these different texts, the corpus is rich in paraphrases.</Paragraph>
      <Paragraph position="1"> For instance, in the text concerning acetone we read: It evaporates easily, is flammable, and dissolves in water.</Paragraph>
      <Paragraph position="2"> And in the text concerning acrolein we can read: It dissolves in water very easily and quickly, changes to a vapor when heated. It also burns easily.</Paragraph>
      <Paragraph position="3"> Even in the same text, they are some redundancies and a similar idea can be expressed more than once in different ways. For instance, in the text describing 2-Butanone we can read: it is also present in the environment from natural sources.</Paragraph>
      <Paragraph position="4"> And later: 2-Butanone occurs as a natural product These few examples illustrates that the kind of texts we work with deal with a restricted semantic domain and contain a large number of reformulations. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Semantic focus of our paraphrase system
</SectionTitle>
      <Paragraph position="0"> Our goal is to detect and represent some selected information in the corpus presented above. To achieve this, we want to associate a uniform representation with the different wordings of the same information that appears in the texts. We focus on the different ways of expressing the information relative to the appearance, physical properties, synonyms, use and origin of toxic products. Our representation consists of a list of predicates which are detailed below.</Paragraph>
      <Paragraph position="1"> PHYS FORM/2. This predicate is the result of the normalization of strings expressing the physical form of the toxic product. For instance PHYS FORM(ammonia,gas) expresses that the product ammonia is a gas.</Paragraph>
      <Paragraph position="2"> DESCRIPTION COLOUR/2. This predicate is the result of the normalization of strings describing the colour of the toxic product. For instance DESCRIPTION COLOUR(antimony,silverywhite) expresses that antimony is a silvery-white product.</Paragraph>
      <Paragraph position="3"> DESCRIPTION SMELL/2. This predicate is the result of the normalization of strings describing the smell of toxic product. For instance</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML