File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1309_intro.xml

Size: 4,047 bytes

Last Modified: 2025-10-06 14:06:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1309">
  <Title>Implementing Voting Constraints with Finite State Transducers</Title>
  <Section position="4" start_page="92" end_page="93" type="intro">
    <SectionTitle>
3 Preliminary Results from Tagging English
</SectionTitle>
    <Paragraph position="0"> We have experimented with this approach using the Wail Street Journal Corpus from the Penn 2~reebank CD. We used two classes of constraints: one class derived from the training corpus (a set of 5000 sentences (about 109,000 tokens in total) from the WSJ Corpus) and a second set of hand-crafted constraints mainly incorporating negative constraints (demoting impossible or unlikely situations) or lexicalized positive constraints. These were constructed by observing the failures of the statistical constraints on the training corpus and fixing them accordingly. A test corpus of 500 sentences (about 11,500 tokens in total) was set aside for testing.</Paragraph>
    <Paragraph position="1"> For the statistical constraints, we extracted tag k- grams from the tagged training corpus for k = 2, 3, 4, and 5. For each tag k-gram, we computed a vote which is essentially very similar to the weights used by Tzoukermann et al. \[14\] except that we do not use their notion of genotypes exactly in the same way. Given a tag k-gram tl, t~,...tk, let n = count(t1 E Tags(wi), t2 E Tags(wi+l),..., tk E Tags(wi+k-1)) for all possible i's in the training corpus, be the number of possible places the tags sequence can possibly occur. Here Tags(wi) is the set of tags associated with the token wi. Let f be the number of times the tag sequence tl,t2,...tk actually occurs in the tagged text, that is l+o.s .f = count(thtg. .... tk). We smooth//n by defining p = n+l so that neither p nor 1 - p is zero. The uncertainty of p is given by V/p(1- p)/n \[14\]. We then compute the vote for this</Paragraph>
    <Paragraph position="3"> This formulation thus gives high votes to k-grams which are selected most of the time they are &amp;quot;selectable.&amp;quot; And, among the k-grams which are equally good (same .f/n), those with a  higher n (hence less uncertainty) are given higher votes. The votes for negative and positive hand-crafted constraints are selected to override any vote the statistical constraints may have. The initial lexical votes for the parse ti,j of token wi are obtained from the training corpus in the usual way, i.e., as eount(wi, tij)/count(wi) normalized to between 0 and 100.</Paragraph>
    <Paragraph position="4"> After extracting the /c-grams as described above for k = 2, 3, 4 and 5, we ordered each group by decreasing votes and did an initial set of experiments with these, to select a small group of constraints performing satisfactorily. Table 1 presents, for reference, the number of distinct k-grams extracted and how they performed when they solely were used as constraints.</Paragraph>
    <Paragraph position="5"> We selected after this experimentation, the first 200 (with highest votes) of the bi-gram and k I No. of Train. Set Test Set  the first 200 of the 3-gram constraints, as the set of statistical constraints; inclusion of 4- and 5-grams with highest votes did not have any meaningful impact on the results. It should be noted that the constraints obtained this way are purely constraints on tag sequences and do not use any lezical or genotype information. The initial lexical votes were obtained from the training corpus as also described above. 1 We started tagging the training set with this set of constraints and, by observing errors made and introducing hand-crafted rules, arrived at a total of about 970 constraints. Most of the hand-crafted constraints were negative constraints (with large negative votes) to rule out certain tag sequences. Table 2 presents a set of tagging result from this experimentation. Although the results are quite preliminary, we feel that the results in the last row of Table 2 are quite satisfactory and warrant further extensive investigation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML