File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/c92-2084_metho.xml

Size: 24,722 bytes

Last Modified: 2025-10-06 14:12:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="C92-2084">
  <Title>DERIVATION OF UNDERLYING VALENCY FRAMES FROM A LEARNER'S DICTIONARY</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DERIVATION OF UNDERLYING VALENCY FRAMES
FROM A LEARNER'S DICTIONARY
ALEXANDR ROSEN, EVA HAJICOVA and JAN HAJIC
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Universita Karlova
Praha, Czechoslovakia
</SectionTitle>
      <Paragraph position="0"> ABSTRACT syntactic information. The lexicon and grammars, enriched by feedback from The authors collect lexical data for the parsed texts, can later be used a module of English syntactic analysis within t}~e machine translation system in the context of a bilingual research proper.</Paragraph>
      <Paragraph position="1"> project. The computer usable version of OA/JD (Hornby, 1974) is used as the At present, the pril~ry source of primary source. The main focus is on lexical data for the English analysis the structure and derivation of is a m~chine readable dictionary, valency frames for verbal entries in preprocessed to contain only relevant the target lexicon. Illustration of information in a transparent format. the complex relation between OALD's This paper foeusses on how valency verb subc~tegorization codes and the frames for verbal entries are target complementation paradigms is extra~ted from subcategorization codes provided, and an approach to the in the ,~chine readable dictionary.</Paragraph>
      <Paragraph position="2"> derivation procedure design suggested.</Paragraph>
    </Section>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. ~ CI{OICES
1. INTROD\[b'~ION
</SectionTitle>
    <Paragraph position="0"> Even though the correspondences The present paper describes a part between parallel text units can be of a larger project, which should re- established at an arbitrary level sult in the extraction of lexical and starting from word forms up to an structural correspondences between elaborate logical representation, the grammatical units in large parallel practical solution seems to lie English and CVzech texts. The cortes- somewhere in between. The approach we pondenees will then be used to build have chosen is based on the a transfer module for an English- representation of linguistic analysis -to-Czech (and possibly Czech-to- in terms of underlyin~ (tectogram-English) machine translation system, metical) structures, which are Final as well as partial results determined by the given laaqgu~ge, but should also be useful as source data void of various irregularities of the for text-oriented lir~uistic research, surface strings, including the both hi- and monolingual I . ~unbiguity of n~rp~mic and surface syntactic units. 2 A &amp;quot;deeper&amp;quot; analysis This task entails the need for would increase the. risk of errors and tools to analyse unrestricted Czech introduce more theoretical bias while and English texts. In the first stage a very shallow level would require of the project the goal is to produce larger amounts of data to arrive at Czech and English lexicons of adequate simple facts when parallel text units coverage and implemented analysis are compared.</Paragraph>
    <Paragraph position="1"> grammars, which will later be augmented with tools for preliminary The (underlying) syntactic disambiguation. The parser will build description is dependency-based (with annotated dependency structures, coordination and apposition as usable for tagging word forms, clauses relations of a different type) and the and sentences with morphological and project described here makes it ACRES DE COLING-92. NANTES, 23-28 AOOT 1992 5 5 3 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 possible (i) to test the basic assumptions of the theory on a large data collection, and (ii) to formulate an implementable relation between the surface string and the underlying representation.</Paragraph>
    <Paragraph position="2"> A constrained-based (unification) formalism was selected due to its declarativeness, conciseness and formal rigour, but its other interesting properties were a\]so appreciated: i.a., the important role of the lexicon and the need I~ treat surface facts within the same rigoro~ framework a.q deeper concep~. 3</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. THE SOLACE
</SectionTitle>
    <Paragraph position="0"> As a shortcut towards a lexicon of reasonable coverage we decided to build upon an available machine readable dictionary, which we intend to augment later by hand and from other sources. Our primary source of English lexical data is now CUVOALD, or the Expanded Computer Usable Version of the Oxford Advanced</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Learner's Dictionary of Cul.rent
</SectionTitle>
      <Paragraph position="0"> English, 3rd edition (OALD, Hornby, 1974), w}Kch is available from Oxford Text Archive (see Mitten, 1986) 4 .</Paragraph>
      <Paragraph position="1"> CUVOALDIists all headwords, headword variants and derivatives with simple codes denoting word classes and inflection patterns, supplemented by verb pattern codes for verbs. Sense distinctions from OALD are not retained.</Paragraph>
      <Paragraph position="2"> Where~q the derivation of lexical information as needed by the analysis from CUVOALD word class codes is relatively straightforward, the OALD verb pattern codes, which are crucial for our purpose, present a real challenge. The dictionary classifies verbs according to the number and form of complements in%~ 51 &amp;quot;verb patterns&amp;quot;, marked by numbers 1-25, supplemented in some cases by letters (4A,4B,4C,4D,4F). The number of verbs in a single pattern is quite variable: starting from a single item in \[VP4F\] for he followed by an infinitive up to 4855 standard transitive verbs in \[VP6A\]. A pattern groups together verbs which exhibit the same behaviour in a standard context and are subject to the same set of transfor~tions under specified conditions. So e.g.</Paragraph>
      <Paragraph position="3"> the class of intransitive verbs \[ VP2A\] can take introductory there and postpone the subject if it is indefinite and &amp;quot;heavy&amp;quot; : There comes a time when we feel we must make a protest. A single pattern is also used for verbs which allow the same mot phesyntactic variations of a complement. ( \[VPI\] : She's dark~in good heal th/here/a pretty girl. ) A different verb pattern is, however, used if only a subset of the relevant class perndLs the variation. (\[VP6C\]: She en~oys swimmir~g / *to swim. vs.</Paragraph>
      <Paragraph position="4"> \[VP6D\] : She likes swim~ting / to swim.) Some vat iatio~kq may be treated as a different verb pattern. (This is the case of the above example: She likes swimming. \[VP6D |and Site likes to swim. \[VPTA\] ) Akkerman (1989) lists several shortcomings of the OALD verb patterns. As Sampson (1990) noted, some of them are arguable. For our purpose, the most proble,~tic seems to be the treatment of compound verbs ( with the resulting loss of infor.~tion in CUVOALD) and too surface-level definition of some verb patterns. These classes are quite a heterogenous collection: by \[VPI4\] are marked verbs in all of the following uses, the only requirement being that the verb is followed by a noun and a prepositional phrase: They accused him of stealing the book. I explained my difficulty to him.</Paragraph>
      <Paragraph position="5"> Compare the copy with the original.</Paragraph>
      <Paragraph position="6"> Another &amp;quot;misbehaved&amp;quot; pattern is VP4A where, depending on the verb, the infinitive can be complement or adjunct: The ~winm~er failed to reach the shore. lie came to see that he was mistaken.</Paragraph>
      <Paragraph position="7"> She stood up to see better.</Paragraph>
      <Paragraph position="8"> Apart from these &amp;quot;systemic&amp;quot; blemishes we expect a number of other inconsistencies and errors to appear during t}~ process of derivation and use of the target lexicon.</Paragraph>
      <Paragraph position="9"> ACl~.S DE COLING-92, NANTES, 23-28 AOUT 1992 5 S 4 PROC. OF COLING-92, NANTES, AUG. 23-28. 1992</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. &amp;quot;lltE TAIK~T
</SectionTitle>
    <Paragraph position="0"> The target lexicon conta:imq the fol\] owitN information about 'the valency of a verb ( or its complementation), grouped in an entry as a comp\]ementation t~radigm:</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
SUBCATFC~)RIZATION LIST (SC) gives syn-
</SectionTitle>
    <Paragraph position="0"> tactic and merphologica\] categories for every del~endent, i .e. ei ther a particil~nt ( comp\] emenh, n~y be ob\]igatory or optional) or an ohZigatory free modificaLion ( obligatory adjunct). An item in the list is Jn fact an tmderspecified representation of the correspondirlg dependent. The ordering of items in the list corres1~nds t~ the unmarked word order in a declarative sentence.</Paragraph>
    <Paragraph position="1"> SYRrI'ACTIC FIULME (SF) , a feature structtme with syntactic functiorks as attributes; values of these attributes are co-indexed with the corresponding ite.~ of the subcateKorization list.</Paragraph>
    <Paragraph position="2"> UNDERLYING STRUCUVRE (IJS), a feature structure with tectogranm~tical functions ~s attributes; valucs of these attributes are identical w:ith Imder-Zyiri~ structures within the correspond:ing items of sul~aLeKorization list and syntactic frame. The value of the attribute C~DV (governor) is identical with the value of the lexeme attribut&gt;e of the verb's feature structure.</Paragraph>
    <Paragraph position="3"> The armlysis will establish index \] inks between saturated fra.m s\] ors and their fillers in the analysis tree. This will provide e~y access i~ the analysis resu\] ~ at the three \] eve\] s of description, hig}~ ighting the structure of the sentent, Ja\[ (:ore, The 1hi\]owing simple example gives comp\] enmntation paradigm for an intransitive verb. N\[ nora |is shortI~nd for a feature structure representing noun in the noEu.t~tive c~k~e with saturated subcategorization requirements ; the numbers co-index feature structures which are s~red as values of some attributes, the ,a.a I 1 index selectq only a part of the structure, namely the nominal equivalent of the underlyin~ structure; the attrJ bute GOV g~ yes the lexica\] value of the verb while ACP stands for actor/bearer, t|m function represen~ ring subject of an active verb at the underlying level. Angle bracket~ enclose lists, square brackets (con~ .juctio~ of) feature structures, curly brackets disjunctions. Commas separate members of conjuction, vertical bars members of disjunction.</Paragraph>
    <Paragraph position="5"> Next, we Rive two possible complementation paradigms for a transitive verb. ( PAT stands for patient, VI prespart, SC&lt;N3 &gt; \] 4 is abbreviation for present participle form of a verb whose single va/ency slot for subject in the SC list is co-indexed with the ~ct~r/bearer of the n~trJx verb):</Paragraph>
    <Paragraph position="7"> &amp;q the value of the attribute PAT of er~joy i.s shared with the value of the attribute US of the object, the correct va\]ue for the dependent verb's ACt attribute is supp\] led via co-indexing of the subject of enjoy wJ th the subject of the non- finite clause within the SC list of er~joy:</Paragraph>
    <Paragraph position="9"> The complementation paradigm, rat}mr tI~n being stated within full-fledged feature structures, is expressed in terms of templates, preferrab\] y allowing defaults and multiple inheritance, Accordingly, the above two paradig~ wil\] be expressed as follows : ACRES DE COLING-92. NAN*rEs, 23-28 Ao\[rr 1992 S 5 5 PROC. OF COLING-92, NAbrrEs, AUo. 23-28, 1992 transi tive transitive , 2ing , equi Two verbal entries can he related by a lexical rule with the effect that one of these two entries need not be explicitly present (the ot}~er should then be ~rked by the rule's name).</Paragraph>
    <Paragraph position="10"> This will solve phenomena such as there preposir~, dative alternation, and passivization.</Paragraph>
    <Paragraph position="11"> The collection of three &amp;quot;levels&amp;quot; of description within a single complementation paradigm provides a means to express rather subtle differences. Let us take as an example four superfi- null cially identical constructions: (a) I w~nted him to see the monster.</Paragraph>
    <Paragraph position="12"> (b) I expected him to see the monster. (c) I elected him to see the monster.</Paragraph>
    <Paragraph position="13"> (d) I told him to see the monster.</Paragraph>
    <Paragraph position="14">  Following Quirk et al. (1985, p.1216), the verb is monotransitive in (a), complex-transitive in (b) and (c), and ditransitive in (d). The example (b) is closer to the monotransitive type while (c) is closer to the ditransitive type.</Paragraph>
    <Paragraph position="15"> If we have the subcategorization list</Paragraph>
    <Paragraph position="17"> to express the superficial identity of all the four cases, we can assume the above verbs to have the following syntactic frames:  The difference between the types (a) and (b) vs. (c) and (d) is that between the Raising and Equi types. Therefore, (b) will have only two participants at the level of underlying structure while (c) will have  three: (a) US \[ACr 14\], PAT \[6\]\] (b) US \[ACT \[4\], PAT \[6\]\] (c) US \[ACT \[4\], PAT \[5\], EFF \[6\]\] (d) US \[ACT \[4\], PAT \[6\], ADDR \[5\]\] The respective templates will be: (a) transitive, 3inf , raising (b) complex-transitive, 3inf , raisin~ (c) complex-transitive, 3inf , equi (d} di trensi tive, 3inf , equi  A problem remains bow to derive such information from OAL/T s verb patterns.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5. THE DE}{IVATION
</SectionTitle>
    <Paragraph position="0"> CUVOALD was not pri.~rily intended for use with a syntactic parser, so a few modifications were necessary.</Paragraph>
    <Paragraph position="1"> First, the pronunciation field was deleted and homograph entries with different pronunciations r~erged. ( In CUVOALD, each word, or word form, has only one entry, unless it has two different pronunciations. ) Second, entries headed by regular forms within irregular paradigms as headwords were also deleted. And finally, reference to base forms was provided in entries of all the remaining nonbase (irregular) forms. Base forms of irregular paradigms were marked by a code specifying the paradigm type. After that, we tried to find a way how to derive the complementation paradigms.</Paragraph>
    <Paragraph position="2"> Ideally, templates o f the sort described in Section 4 should Correspond to OALD verb patterns while lexical rules would account for structures listed in Hornby (1975) as variants of the same verb pattern. Although this idea works in t}~e case of the most frequent patterns ( \[VP2A\] , \[VP6A\] ) , there are many patterns where the relation between pattern and paradigm can be l:n, n:l, or even n:n (n &gt; I) (see Section 3).</Paragraph>
    <Paragraph position="3"> The case of n patterns : i paradigm reduces the number of paradigms and as such is a welcome situation. The case of 1:n can mean (i) ambiguity for all verbs listed under the pattern (and c~n possibly be accounted for by lexical rules) , (ii) the possibility to subdivide the verbs of this class into n subclasses, or (iii) a combination of the two. For (il, the derivation of complementation paradi~l from a verb pattern will yield a disjunction. For (ii) , verbs with different complemenration paradigms should be distin-ACRES DE COLING-92, NANTES, 23-28 AO't3&amp;quot;r 1992 5 5 6 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992 guished. Boguraev and Briscoe (1.989) used valency codes in Ii)0CE (Lon~gnan Dictior~ary of Contemporary English) to autematical\]y extract the (explicitly unmarked) distinction between Equi and Raising verbs. Similar approach can be tNed to ,~ke this and other distinctions in OALD by taking into account co-occurences of verb pat terra. (Am situation is simpler ira that we, ~s yet, make no attempt to treat distinct.</Paragraph>
    <Paragraph position="4"> word senses, and more difficult in tidal the blurred se~e distinctions can have negative effect on any derivation procedure. It renmir~ to be seen whether such a n~thod w:i\]\] lead t~ resul~ of sufficient reliability. However, at the same time we have fx) supply n~re inforn~tion t~ some classes of verbs, for which any pessibilJty of autonmt~c treatment is ex\[uded. The current efforts include the specification of lexica\] values of particles and prepositions for compound verbs and assigning verbs n~rked by verb pattern codes such m~ VP14 to relevant subclasses.</Paragraph>
    <Paragraph position="5"> The correspondences between the OALD patterns and complementation paradigms are stated in the simple cases by rules relating one or more patterns to one or more paradigms - templates. Where possible, frequently co-occurring verb patterns are collapsed into a single paradigm with local disjuction, e.g. \[VP61)\] and \[VPTA\] for like (swimming/ to swim) give the following template:  Now t~re are two possible strategies representing two extremes. The first strategy disregards the actual distribution of verb patterns in the dictionary and atten~ts to combine results of rule application into a compact and meaningful complementa-Lion paradigm. The second strategy starts from a list of all combinations of verb patterns within the dictionary and assigns a rule to every combination. Let us look }mw the first approach works.</Paragraph>
    <Paragraph position="6"> The process of derivation of a complementation paradigm for a verb entry consists of the following steps: 1. Application of rules rewriting a verb pattern code (or more verb pattern codes if the resulting paradigms can be related by a lexical rule) by a template or a sequence of templates collected by logical operators &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot;, the result may be marked by one or mere lexical rule names. Rules rewriting n~re l~tter~ are preferred to those rewriting fewer i~tterns.</Paragraph>
    <Paragraph position="7"> A rule may be, supplemented by a condition stipulating the presence or absence of' other paradigi~ within the same entry. A rule whose condition is satisfied is preferred to a rule wJ thout condition. Verbs with patter~ which do not, corresl~nd to a single complementation i~radigm while co- null occurring verb patterns do not indicate a preference ibr one paradigm or the other have to be treated ,~nu~lly.</Paragraph>
    <Paragraph position="8"> 2. Simplification of templates by making as local as possible.</Paragraph>
    <Paragraph position="9"> the sequence of all disjunctions 3. Consistency check performed by ex null pansion of the sequence of templates into feature structures.</Paragraph>
    <Paragraph position="10">  tra~gitive, 2n ', trar~itive, 2cis, 2tlmt ', transitive, 2cls, 2wh- I complex transitive, 3inf, raising }  a) This is a condition stipulating that neither of the pat terrL~ should be present; the character ^ stands for negation.</Paragraph>
    <Paragraph position="11"> b) This is the template of a prepositional verb. The lexical value of the preposition should be supplied.</Paragraph>
    <Paragraph position="12"> This looks like a principled solution, but step i can be a source of unforeseen complexities with the result that too ,~ny entries will have to he handled manually. The second strategy is much safer: if there are not too many different combinations of verb patterns it might not be too difficult to state rewriting rules for all of them, thus e\]iminati,~ steps 2 and 3 from the above procedure.</Paragraph>
    <Paragraph position="13"> However, to make a decision, some statistical analysis is necessary.</Paragraph>
    <Paragraph position="14"> CUVOALD lists 5695 verbs with 633 different combinatioi~ of verb patterns. ~ 4853 verbs (85.2%) are marked by one of the 56 most frequent combinations (each occurring seven and more times). The first ten most frequent combinations are given below: verb patterns frequency</Paragraph>
    <Paragraph position="16"> At the other end, t~re are 442 combinations occurring only once, 191 two and more times, 119 three and more times and 77 five and mere times.</Paragraph>
    <Paragraph position="17"> Another survey was aimed at finding most frequent combinations as proper subsets of the full combinations treated above. E.g. the combination of three patterns 2A,3A,6A occurs alone in 54 entries, but as a proper subset of a larger combination already in 566 entries.</Paragraph>
    <Paragraph position="18"> From the above data it seems that a compromise between the treatment of individual verb patterns and of entire combinations would be most efficient. 119 combinatio~ can already be treated by individual rules quite comfortably while the rest can be composed from results of rules applied independently, where more alert supervision is required. It also seems feasible to use the rules for combinations to treat parts of the remaining lists of verb patterns, ar~ perhaps add a few more, se#ected according to the second statistics.</Paragraph>
    <Paragraph position="19"> 6. P~SPM\[rr\]A~ Lexicon and gram,~r together form the basis for the extraction of lexical and structural correspondences. Other tools are necessary, however, and we are currently designing specifications for such tools.</Paragraph>
    <Paragraph position="20"> Besides the non-trivial task of text cleanup, for which no special tools will be used, two major needs remain: text unit align,~nt and data extraction methods.</Paragraph>
    <Paragraph position="21"> Automatic text unit alignment (on word, phrase, and sentence level ) is also non-trivia/. On the sentence level, we will employ a method for al igpmlent based on sentence length ( Gale 1991 ), for which we have developed a f\]exible front-end for recognizing sentence houndaries. We are considering an extension of Church's algorithm taking into account lexicon-based elementary word correspendences (as in Kay (1988) and Catizone et al. (1991) ) for better accuracy, but this extension has not been implemented yet.</Paragraph>
    <Paragraph position="22"> Methods for data extraction are still under development. However, it is clear what such data should look like. As our output representation is AcrEs DE COLING-92, NANTES, 23-28 AOt)r 1992 5 5 8 PROC, OF COLING-92, NANTES, AUo. 2.3-28, 1992 far from the inter\]Jr~u~ ideal, the data will basically be transfer data in a form fitti~ the structural transfer model, following the ideas of Kaplan et, a.L. ( 1989). The actual implementation, however, will follow the l~.ttern of the transfer module in the experiment~l machine trans\]at\]on system EL\[\] (Russel et a\] . (1991) ) . NOT~ This project, ca\]\]ed MAT'RACI,; (from MAchine TRAnslation between Czech and Eng\[ J sh, J s one o f the projecL~ carried out within the IBM Academic Initiative Jn Czechoslovakia.</Paragraph>
    <Paragraph position="23"> it is not the aim of this paper to discuss and substantiate the repertoire of valency relatio~ and their classJ fication. The interested reader can find a detailed analysis of these issues and a comparison with other theories of deep (underlying) structure Jn Sgall, HajJSov{~ and Panevov~ (1986, esp. Ch.2).</Paragraph>
    <Paragraph position="24"> 3 As we are involved in the development of a practical constraint-based system, we are aware of the necessity to include some centre\] or dynamic information in addition to the static description sup;x)rted by traditional constraint-based for,~lisms. We expect to deal with this issue seriously in later stages of the project, when partia\] results wi\]\] be available.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 CUVOALD comes in two versions: one
</SectionTitle>
    <Paragraph position="0"> lists base forms pl~ a\]\] forms of J rregu\] ar words whJ \]e the other contains all inflected forms explicitly. As we intend t~ }rove a merphological comi~nent, we are u~ing the ba.~e forms version.</Paragraph>
    <Paragraph position="1"> 5 These and following numbers include base forn~ only, as well as 876 verbs which were not marked by any p~ttern and for which defaults were used: 6A for transitive verbs, 2A for intransitive verbs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML