File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-1005_metho.xml

Size: 13,482 bytes

Last Modified: 2025-10-06 14:12:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-1005">
  <Title>Tagging for Learning: Collecting Thematic Relations from Corpus</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE LARGEST CO~iPANY ON THE LIST,
WHICH LAST PAID SHAREHOLDERS IN JANUARY,
SAID THE 5 PC STOCK DIVIDEND WOULD BE
PAYABLE FOLLOWING THE PAYMENT OF THE
CASH DIVIDEND. (DJ, October 27, 1988)
</SectionTitle>
    <Paragraph position="0"> For this sentence, which is not exotic or unusual in its complexity, there are 24 non-trivial different parse trees. Human readers, in contrast to most programs, can quickly identify groups of words that &amp;quot;hang together&amp;quot; such as COMPANY PAID A DIVI-DEND, STOCK DIVIDEND, and CASH DIVIDEND, and use these clusters to understand the sentence unambiguously. Moreover, a human reader can easily recognize SHAREHOLDERS as recipient and DIVIDEND as the object of PAY. Along these lines, our program develops the capability to identify such patterns by training on a large corpus of examples.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 The Training Corpus
</SectionTitle>
      <Paragraph position="0"> The training corpus, from which our lexical information is extracted, consists of more than ten rail34 i lion words from the Dow Jones newswire (10 months worth of stories). For the root PAY, for instance, we collected more than 6000 examples, 20 of which are given below.</Paragraph>
      <Paragraph position="1"> To exploit this data, a system must transform common patterns into operational templates, encoding a core relation between the words. The sections that follow describe the evolution and implementation of this acquisition technique.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Co-occurrence: Previous Work
</SectionTitle>
    <Paragraph position="0"> Garside \[4\] and Church et al. \[1\] provided a major impetus for this line of work. In Church's work, a collection of English collocations bootstrapped from a tagged corpus facilitated the construction of an adaptive &amp;quot;tagger&amp;quot;, a program that annotates a text with part-of-speech information.</Paragraph>
    <Paragraph position="1"> Frank Smadja \[7\] continued Church's effort by collecting operational pairs such as verb-noun and adjective-noun pairs. Smadja used these pairs to constrain \]\[exical choice in a language generator; for example, the system prefers &amp;quot;deposit a check&amp;quot; to &amp;quot;place a check&amp;quot; based on the frequency of co-occurrence of deposit and check.</Paragraph>
    <Paragraph position="2"> Ido Dagan \[3\] pursued this topic further by projecting co-occurrences beyond the local context, using collocations for anaphora resolution. For example in,</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE CAR WAS DRIVING ON THE ROAD.
SUDDENLY IT BRAKED.
</SectionTitle>
    <Paragraph position="0"> CAR is selected over ROAD as the anaphor of IT, since CAR BRAKE is a stronger collocation than ROAD BRAKE. Interestingly, this idea complements Wilks' preference semantics \[8\], in which preference is based on a semantic hierarchy. In Dagan's method, preferences are based on word patterns acquired from corpus. null Our work further emphasizes globM-sentence connections. An example that highlights the use of co-occurrence is given on the next page.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE CHAIRMAN AND CHIEF EXECUTIVE OF FRANKL-
IN FIRST FEDERAL SAVINGS ~ LOAN ASSOCIAT-
ION OF WILKES-BARRE, \[SAID\] FRANKLIN FIRST
FEDERAL'S PLAN OF CONVERSION HAD BEEN
APPROVED BY THE FEDERAL HOME LOAN BANK
BOARD \[AND THAT\] THE OFFERING OF COMMON
SHARES IN FRANKLIN FIRST FINANCIAL CORP.
</SectionTitle>
    <Paragraph position="0"> HAD BEEN APPROVED BY THE BANK BOARD AND BY THE SEC. (D J, 07-25-88).</Paragraph>
    <Paragraph position="1"> What is the attachment of THAT? THAT could potentially attach to almost any preceding word, e.g., FEDERAL THAT, BOARD THAT, CONVERSION THAT, SAID THAT, etc. The affinity of the word  pair SAY THAT (although it does not appear in this sentence as a collocation) supports the appropriate attachment.</Paragraph>
    <Paragraph position="2"> Furthermore, co-occurrence relations support thematic-role assignment. This is important for our ultimate objective of producing more accurate conceptual information from news stories \[5\]. The text below illustrates one type of problem in role assignment: null</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE LARGEST COMPANY ON THE LIST,
WHICH LAST PAID SHAREHOLDERS IN JANUARY,
SAID THE 5 PC STOCK DIVIDEND WOULD BE
PAYABLE FOLLOWING THE PAYMENT OF THE
</SectionTitle>
    <Paragraph position="0"> CASH DIVIDEND. (D J, October 27, 1988) Who paid what to whom and when? Co-occurrence-based analysis generates lexical relations such as subj-verb, verb-obj, and verb-obj2, relations which are further mapped into appropriate thematic and semantic roles. The program thus determines that COMPANY is the payer of PAID, SHAREHOLDERS the payee, and DIVIDEND the payment.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Lexical Representation
</SectionTitle>
    <Paragraph position="0"> An acquired lexical structure called a Thematic Relations (Figure 2) facilitates this analysis. For a pair of content words, a relation provides (1) a strength of association (or &amp;quot;mutual affinity&amp;quot;), and (2) a structure type.</Paragraph>
    <Paragraph position="1"> This table is acquired from corpus by a tagger based on morphology and local syntax.</Paragraph>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Extracting Co-occurrence
</SectionTitle>
    <Paragraph position="0"> The algorithm operates in three steps: (1) tag the corpus for morphology and part of speech, (2) collect collocations using relative frequency, and (3) use tagging to determine lexical relations within collocations. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Part-of-speech Tagging
</SectionTitle>
      <Paragraph position="0"> Since the corpus size is about 10-million words, a full-fledged global sentence parsing is prohibitively expensive, and tagging must be carried out by localist methods, i.e., by means of morphology and local syntactic markers. There are three degrees of difficulty of cases to be tagged.</Paragraph>
      <Paragraph position="1"> Morphology-based Tagging: Only a few words can be tagged using morphology alone. While PAYMENT and SHAREHOLDERS are unambiguously nouns, morphology-based tagging is ambiguous for most words. For example, PAID and SAID could be either verb or adjective (i.e. participle modifier); STOCK and CASH could be either noun or verb.</Paragraph>
      <Paragraph position="2">  AUG. 12 TO HOLDERS OF RECORD JULY 15.</Paragraph>
      <Paragraph position="3"> OCT. I WILL BE PAYED IN THE USUAL MAN AUG. 29 TO HOLDERS OF RECORD AUG. 12.</Paragraph>
      <Paragraph position="4"> SEPT. 14 TO HOLDERS OF RECORD AUG. 22 AUG. 18 TO HOLDERS OF RECORD AUG. 8.</Paragraph>
      <Paragraph position="5"> DATE ON OR AFrER AUG. 1, 1990, FOR TH OF 10.85 PENCE A SHARE. HEIGHTENING OVER A 12-MONTH PERIOD. DUE THURSDAY.</Paragraph>
      <Paragraph position="6">  help to remove most cases of ambiguity. For example, was SAID (read: the word SAID preceded by was) can be unambiguously tagged a verb; the PAID shareholders, is an adjective; and the STOCK is definitely a noun.</Paragraph>
      <Paragraph position="7"> Statistics-Based Tagging: Taggers reported by \[4; 1\] have capitalized on a large collection of bi-grams plus statistically weighted grammar rules. In this method, statistical properties are acquired from a large training corpus which was tagged manually. Statistical methods have proved very effective, and attained a high level of accuracy \[6\].</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Problematic Cases
</SectionTitle>
      <Paragraph position="0"> Some cases prove even more difficult and cannot be resolved by localist methods. Consider the following two examples.</Paragraph>
      <Paragraph position="1"> * &amp;quot;The company preferred stock PAID ...&amp;quot; . In this clause, PAID, could be either an adjective or a verb (see &amp;quot;the horse raced past the barn&amp;quot;). Indeed, this clause could probably be determined by a global parse, however, this would be too expensive computationally.</Paragraph>
      <Paragraph position="2"> * &amp;quot;CONVINCING MANAGEMENT proved tough&amp;quot; is even harder since it presents a Necker cube situation (i. e. changing the interpretation of either word seems immediately to change the interpretation of the pair). Is it an adjective-noun or is it a verb-noun pair? In general, the analysis of such pairs requires deeper understanding of word relationships. Consider another example:</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="0" end_page="0" type="metho">
    <SectionTitle>
LATER IN THE DAY BUYING INTEREST
</SectionTitle>
    <Paragraph position="0"> DIMINISHED . . .</Paragraph>
    <Paragraph position="1"> Again, it is difficult to tell whether INTEI~EST in BUYING diminished or the BUYING of IN-TERESTs diminished. Thus, local clues do not contribute towards the proper resolution of such cc'~3es.</Paragraph>
    <Paragraph position="2"> The incorrect resolution of such cases, which unfortunately are pervasive in the corpus, impinges on two objectives: performance and learning.</Paragraph>
    <Paragraph position="3"> In order to perform text analysis, in the first case one must determine whether management was convinced, or the management convinced some second party; in the second case, one must determine the subject of the main verb of the sentence, i.e., which is the ,subject of DIMINISHED? Many applications require an unambiguous result. Thus a call must be made one way or another. Statistical means might make that call slightly more judiciuos on the average. null However, when tagging is used for learning of thematic roles, inappropriate resolution of such cases can drastically contaminate the final results by biasing it in a certain direction. Results are far more accurate when ambiguous cases are left out altogether.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Tagging for Learning
</SectionTitle>
      <Paragraph position="0"> Our tagger is based on a 7,000-root lexicon that facilitates accurate morphological analysis, and about I00 local-syntax rules. It produces tagging for about 60% of the content words in the corpus. Tagged output for a sample sentence is given below.</Paragraph>
      <Paragraph position="1">  A 4-tuple in the sentence above is a word/root/affix/part-of-speech. As expected, many content words in this sentence cannot be unambiguously tagged, and are marked ?, i.e., undetermined. In particular, notice that PAID remains unresolved. Fortunately, most PAY cases in the corpus are simpler and are appropriately tagged.</Paragraph>
    </Section>
  </Section>
  <Section position="12" start_page="0" end_page="0" type="metho">
    <SectionTitle>
OF///PP THE///DT CASH///NN DIVIDEND///NN
THE///DT COMPANY///NN LAST///JJ PAID/PAY/ED
/VA A///DT 5///NN DIVIDEND///NN 0N///PP ,,\]'A-
NUARYIIIDD . ..
</SectionTitle>
    <Paragraph position="0"> For purposes of thematic role acquisition the identification of passive and active voice is crucial. In the sample sentence above, PAID is appropriately tagged as a verb in the active voice (marked as VA).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Collecting Collocations
</SectionTitle>
      <Paragraph position="0"> Based on the tagging above (the root field), all collocations in the corpus are counted, and the following table is generated.</Paragraph>
      <Paragraph position="1"> This table is similar to Smadja's \[7\], and it provides the position of collocative words relative to PAY, and the total count within 4 words in either direction.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Determining Lexical Relations
</SectionTitle>
      <Paragraph position="0"> Lexical relations are determined using the known functionality of the verb (see \[9\]) and supporting examples. PAY is marked in the lexicon as a dative verb.</Paragraph>
      <Paragraph position="1"> Consider 5 cases containing the pair PAY SHARE-HOLDER, from which the thematic relation is induced (VA stands for verb, active voice; VP for verb, passive voice; AD for adjective).</Paragraph>
      <Paragraph position="2">  (I) STINGHOUSE SAID IT INTENDS TO PAY/va THE TWO SHAREHOLDERS/nn $2.08 A SHARE PLUS A (2) ONTROL OF THE COMPANY WITHOUT PAYING/va ALL SHAREHOLDERS/rm A FAIR PRICE. THE (3) THE CASH PORTION OF THE PRICE PAID/?? TO POLYSAR COMMON SHAREHOLDERS/nn WILL, INCR (4) CIPATING SHAREHOLDERS/nn WILL BE PAID/vp $3 A SHARE CASH. NO BROKERAGE FEES OR T (5) PER SHARE. THE DIVIDEND IS PAYABLE/ad TO SHAREHOLDERS/nn OF RECORD JULY ~. (6) ENTS A SHARE FROM 37.5 CENTS, PAYABLE/ad SEPT. I TO SHAREHOLDERS/nn OF RECORD AUG</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML