File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2147_metho.xml

Size: 6,807 bytes

Last Modified: 2025-10-06 14:07:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2147">
  <Title>The Week at a Glance - Cross-language Cross-document Information Extraction and Translation</Title>
  <Section position="3" start_page="1007" end_page="1007" type="metho">
    <SectionTitle>
2 Overview of Processing
</SectionTitle>
    <Paragraph position="0"> The system uses a set of pre-existing modules.</Paragraph>
    <Paragraph position="1"> These are: automatic language/codeset recognition for a text (Ludovik et al., 1999), sentence based summarization (biased towards domain keywords) (Cowie et al.., 1998), part of speech tagging, noun phrase recognition, proper name recognition and classification (Cowie et al.</Paragraph>
    <Paragraph position="2"> 1993; Cow}e, 1996), ontology based extraction, translation of the final filled template to English, and output generation. Additional document and template filters have been added at tile fi'ont and back ends of the system to reduce the amount of text to be processed and to remove templates which are only sparsely filled.</Paragraph>
    <Paragraph position="3"> For example, when a text is gathered in Spanish by the web spider it will be checked to see if any of the person names of interest occur in the document using a list of names in Spanish. If this is the case the document is then part-of-speech tagged and noun phrases and proper names are recognized. In the present system proper names are handled using a table lookup process, rather than a more complex (and accurate) pattern based method. The ontology based extraction fills out tile slots to produce a completed template. This is then translated by looking tip words in the lexicon and by transliterating, or translating, proper names. The completed template is then stored with references to the original document.</Paragraph>
    <Paragraph position="4"> A set of templates are then used to produce one of a variety of reports either for all events or for a single event type. These can be sorted on the different slots in tile template. A table is then produced using HTML containing links to each original docunaent, to document summarization and translation tools, and the slot fillers from each template.</Paragraph>
    <Paragraph position="5"> In tile rest of this paper we focus oll the configurable extraction method, tile preliminary tests caMed out on the system, and we close with a discussion oll tile improvements to resources and tools needed to make this a robust and useful technology.</Paragraph>
  </Section>
  <Section position="4" start_page="1007" end_page="1008" type="metho">
    <SectionTitle>
3 Extraction
</SectionTitle>
    <Paragraph position="0"> The three events used in the present system are &amp;quot;election&amp;quot;, &amp;quot;travel&amp;quot;, and &amp;quot;meeting&amp;quot;. For each of these a template was defined containing slots whose content would seem likely to occtu in newspaper articles. Each of these slots was then mapped to one or more ontology concepts to produce a &amp;quot;control template&amp;quot;. The three events are currently defined as follows:  The left hand label defines the nalne/,ole of the slot, tile right hand defines one. or more, ontological concepts which should be found for any phrase in the text which is a potential filler for tile slot. The method of tOtal)late definition is completely generic, and should allow a user with  a reasonable knowledge of tile ontology to rapidly configure an extraction system for new simple event types.</Paragraph>
    <Paragraph position="1"> To perform an extraction, after the phrase recognition step, each headword in a sentence is looked up in tile lexicon and its associated concepts found. Each lexicon entry is then matched with the concepts in tile control template slots. A lnatch may also be found using ancestors of the concept found in tile lexicon entry. Thus for the lexicon entry &amp;quot;Bishop&amp;quot;, in English, the attached concept is &amp;quot;RELIGIOUS-ROLE&amp;quot;, which is a kind of &amp;quot;SOCIAL-ROLE&amp;quot;. Tile combination of lexical entries which has the highest match, and which contains the key concept for the event is chosen and a completed extraction template is produced.</Paragraph>
    <Paragraph position="2"> Lexical subcategorization patterns, which will also help increase the accuracy of this selection process, have not been used yet.</Paragraph>
    <Paragraph position="3"> The ontological lexicons for Japanese and Russian were created by joining a bi-lingual Source language to English lexicon with an English to ontology lexicon. This process adds a significant amount of artificial ambiguity to the final source language to ontology lexicon.</Paragraph>
    <Paragraph position="4"> Using correctly created lexicons for each language and syntactic knowledge for each lexical entry would allow the extraction process to operate more accurately.</Paragraph>
    <Section position="1" start_page="1008" end_page="1008" type="sub_section">
      <SectionTitle>
Two Examples of Extraction
</SectionTitle>
      <Paragraph position="0"> The following examples are both produced by tile extraction method operating on bracketed texts produced by part-of speech tagging and phrase recognition.</Paragraph>
      <Paragraph position="1"> On Thursday April 16, Clinton began his two day state visit in Santiago, Chile to meet with Chilean President, Eduardo Frei, and then onto the Sur@nit of the Americas.</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"/>
    </Section>
  </Section>
  <Section position="5" start_page="1008" end_page="1009" type="metho">
    <SectionTitle>
4 Testing
</SectionTitle>
    <Paragraph position="0"> Two weeks of news stories were gathered from two newspapers in each of our four languages: English, Spanish, Russian, and Japanese. We then filtered this docmnent collection and kept only those doculnents which mentioned specific surnames, for eighteen different people. This entailed generating lists of these names in all four languages, including moq)hological variants for Russhm. This was intended to focus the extraction process to specific domains (business and politics principally). The extraction process was then run on the remaining set of documents and the resulting templates translated and used to generate the final tables of events.</Paragraph>
    <Paragraph position="1"> Many of the entries are inaccurate. One of the principal causes is the lack of syntactic information to constrain the extraction process. Simple improvements could be made by adding constraints based on appositions, prepositions,  particles and morphology. However, a significant number of entries do contain useful information and the ability to scan, in one language, the output from eight sources in four languages is obviously a useful one.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML