File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1113_intro.xml

Size: 2,510 bytes

Last Modified: 2025-10-06 14:06:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1113">
  <Title>Towards Unsupervised Extraction of Verb Paradigms from Large Corpora</Title>
  <Section position="3" start_page="110" end_page="111" type="intro">
    <SectionTitle>
2 Data
</SectionTitle>
    <Paragraph position="0"> This report is on our investigation of English text using verbs tagged for inflectional category, l The tags identify six inflectional categories: past tense (VBD), tenseless (VB), third-person singular present tense (VBZ), other present tense (VBP), -ing (VBG) and participle (VBN). The use of tagged verbs enables us to postpone the problem of resolving ambiguous inflectional forms, such as the homonyms, &amp;quot;work&amp;quot; as tenseless and &amp;quot;work&amp;quot; as present tense, a conflation that is pervasive in English for these categories. We also do not address how to separate the past participle and passive participle uses of VBN.</Paragraph>
    <Paragraph position="1"> The methods reported in this paper were developed on two different corpora. The first corpus consisted of the 300 most frequent verbs IThe verbs were automatically tagged by the Brill tagger. Tag errors, such as &amp;quot;\[ \VBG&amp;quot; tended to form isolated clusters.  from the 52 million word corpus of the New York Times, 1995. 2 For this corpus, both the verbs and the contexts consisted of tagged words. As a somewhat independent test, we applied our methods to the 400 most frequent verbs from a second corpus containing over 100 million words from the WM1 Street Journal (1990-94). For the second corpus, the tags for context words were removed. The results for the two corpora are very similar. For reasons of space, only the results from the second corpus are reported here.</Paragraph>
    <Paragraph position="2"> The distribution of verbs is very different for inflectional category and lemma. The distribution of verbs with respect to lemmas is typical of the distribution of tokens in a corpus. Of the 176 lemmas represented in the 400 most frequent verbs, 79 (45%) have only one verb. One lemma, BE, has 14 verbs. 3 Even in 100 million words, the 400 th most frequent verb occurs only 356 times. We have not yet looked at the rel~ition between corpus frequency and clustering behavior of an item. The distribution of verbs in inflectional categories has a different profile (See Table 1). This may be related to the fact that, unlike lemmas, inflectional categories form a small, closed class.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML