XML Viewer - w02-0906

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0906_metho.xml
Size: 15,691 bytes
Last Modified: 2025-10-06 14:08:03
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0906">
  <Title>Learning Argument/Adjunct Distinction for Basque</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 The argument/adjunct distinction
</SectionTitle>
    <Paragraph position="0"> Talking about Subcategorization Frames (SCF), means talking about arguments. Many existing systems acquire directly a set of possible SCFs without any previous filtering of adjuncts.</Paragraph>
    <Paragraph position="1"> However, adjuncts are a substantial source of noise and therefore, in order to avoid this problem, our approach addresses the problem of the argument/adjunct distinction.</Paragraph>
    <Paragraph position="2"> The argument/adjunct distinction is probably one of the most unclear issues in linguistics. The distinction has being presented, for example, in the generativist tradition, in the following way: arguments are those elements participating in the event and adjuncts are those elements contextualizing or locating the event.</Paragraph>
    <Paragraph position="3"> This definition seems to be quite clear, but when we deal with concrete examples it is not the Izaskun Aldezabal, Maxux Aranzabe, Koldo Gojenola , Kepa Sarasola Dept. of Computer Languages and Systems, University of the Basque Country, 649 P. K.,  July 2002, pp. 42-50. Association for Computational Linguistics.</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia,
</SectionTitle>
    <Paragraph position="0"> Unsupervised Lexical Acquisition: Proceedings of the Workshop of the case. For example, if we take two verbs, talk and play.</Paragraph>
    <Paragraph position="1"> a. Yesterday I talked with Mary.</Paragraph>
    <Paragraph position="2"> b. Yesterday I played soccer with Mary.</Paragraph>
    <Paragraph position="3"> Here Mary is a participant of the event in both cases, therefore under the given definition both would be arguments. But this is contradictory to what traditional views consider in practice. The PP, with Mary, is considered an argument of talk but not an argument of play. It is true that there are differences between both of them because playing does not require two participants (though it can have them), while talking (under the sense of communicating) seems to require two participants.</Paragraph>
    <Paragraph position="4"> Finer argument/adjunct distinction have also been proposed differentiating between basic arguments, pseudo-arguments and adjuncts. Basic arguments are those required by the verb. Pseudoarguments are those that even if they are not required by the verb, when appearing they extend the verbal semantics, for example, adding new participants. And finally adjuncts, which would be contextualizers of the event. The most radical view is to consider the argument/adjunct distinction as a continuum where the elements belonging to the extremes of this continuum can be easily classified as arguments or adjuncts. On the contrary, the elements belonging to the central part of the continuum can be easily misclassified. For further reference see C. Schutze (1995), J.M. Gawron (1986), C. Verspoor (1997), J. Grimshaw (1990), and N. Chomsky (1995).</Paragraph>
    <Paragraph position="5"> From the different diagnostics proposed in the literature some are quite consistent among various authors (R. Grishman et al. 1994, C. Pollard and I. Sag 1987, C. Verspoor 1997).</Paragraph>
    <Paragraph position="6"> 1) The Obligatoriness condition. When a verb demands obligatorily the appearance of an element, this element will be an argument.</Paragraph>
    <Paragraph position="7">  a. John put the book on the table b. *John put the book 2) Frequency. Arguments of a verb occur more frequently with that verb than with the other verbs.</Paragraph>
    <Paragraph position="8"> a. I came from home (argument).</Paragraph>
    <Paragraph position="9"> b. I heard it from you (adjunct).</Paragraph>
    <Paragraph position="10"> 3) Iterability: Several instances of the same adjunct can appear together with a verb, while several instances of an argument cannot appear with a verb.</Paragraph>
    <Paragraph position="11"> a. I saw you in Washington, in the Kenedy Center.</Paragraph>
    <Paragraph position="12"> b. *I saw Alice John (being John and Alice two persons) 4) Relative order: Arguments tend to appear closer to the verb than adjuncts.</Paragraph>
    <Paragraph position="13"> a. I put the book on the table at three b. *I put at three the book on the table 5) Implicational test: Arguments are semantically  implied, even when they are optional.</Paragraph>
    <Paragraph position="14"> a. I came to your house (from x) b. I heard that (from x) The third and fourth tests were not very useful to us. Iterability test is quite weak because it seems to rely more on some other semantic notions such as part/whole relation than in the argument/adjunct distinction. For example, sentence 3.a would be grammatical due to semantic plausibility. The Kennedy Center is a part of Washington, therefore to see somebody in the Kennedy Center and see him in Washington are not semantically incompatible, so it is plausible to say it. In the case of 3.b John is not a part of Alice and therefore it is not plausible to see Alice John. But for example it is plausible to say I saw you the hand. The relative order test is difficult to apply on a language like Basque which is a free word order language.</Paragraph>
    <Paragraph position="15"> The first and fifth tests are robust enough to be useful in practice. But only the two first diagnostics can be captured statistically by the application of association measures like Mutual Information. We did not come out with any straightforward way to apply the fifth test computationally.</Paragraph>
    <Paragraph position="16"> Before talking about the different measures applied, we will present step by step the whole process we pursued for achieving the argument/adjunct distinction.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
2 The acquisition process
</SectionTitle>
    <Paragraph position="0"> Our starting point was a raw newspaper corpus from of 1,337,445 words, where there were instances of 1,412 verbs. From them, we selected 640 verbs as statistically relevant because they appear in more than 10 sentences.</Paragraph>
    <Paragraph position="1"> As we said earlier, our goal was to distinguish arguments from adjuncts. When starting from raw corpus, like in this case, it is necessary to get instances of verbs together with their dependents (arguments and adjuncts). We obtained this information applying a partial parser (section 2.1) to the corpus. Once we had the dependents, statistical measures helped us deciding which were arguments and which were adjuncts (section 2.2).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The parsing phase
</SectionTitle>
      <Paragraph position="0"> Aiming to obtain the data against which statistical filters will be applied, we analyzed the corpus using several available linguistic resources: * First, we performed morphological analysis of the corpus, based on two-level morphology (K.</Paragraph>
      <Paragraph position="1"> Koskenniemi 1983; I. Alegria et al. 1996) and disambiguation using the Constraint Grammar formalism (Karlsson et al. 1995, Aduriz et al.</Paragraph>
      <Paragraph position="2"> 1997).</Paragraph>
      <Paragraph position="3"> * Second, a shallow parser was applied (I.</Paragraph>
      <Paragraph position="4"> Aldezabal et al. 2000), which recognizes basic syntactic units including noun phrases, prepositional phrases and several types of subordinate sentences.</Paragraph>
      <Paragraph position="5"> * The third step consisted in linking each verb and its dependents. Basque lacks a robust parser as in (T. Briscoe &amp; J. Carroll 1997, D. Kawahara et al. 2001) and, therefore, we used a finite state grammar to link the dependents (both arguments and adjuncts) with the verb (I. I. Aldezabal et al. 2001). This grammar was developed using the Xerox Finite State Tool (L. Karttunen et al. 1997). Figure 1 shows the result of the parsing phase. In this case, both commitative and inessive cases (PPs) are adjuncts, while the ergative NP is an argument. The linking of dependents to a verb is not trivial considering that Basque is a language with free order of constituents, and any element appearing between two verbs could be, in principle, dependent on any of them. Many problems must be taken into account, such as ambiguity and determination of clause boundaries, among others. We evaluated the accuracy up to this point, obtaining a precision over dependents of 87% and a recall of 66%.</Paragraph>
      <Paragraph position="6"> So the input data to the next phase was relatively noisy.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
2.2 The argument selection phase
</SectionTitle>
      <Paragraph position="0"> In the data resulting from the shallow parsing phase we counted up to 65 different cases (types of arguments, including postpositions and different types of suffixes). These are divided in two main groups: * 43 correspond to postpositions. Some of them can be directly mapped to English prepositions, but in many cases several Basque postpositions correspond to just one English preposition (see  1)... (a) [ EEBBetako lehendakariak] (b) [UEko 15 herrialdeetako merkataritza ministroekin] (c) [bazkaldu behar zuen] (d) [negoziazioen bilgunean] ... 2) ... the president of the USA had to eat with the ministers of Commerce of 15 countries of the UE in the negotiation center ...</Paragraph>
      <Paragraph position="1"> (a) [EEBB-etako lehendakari-a-k] (b) [UE-ko 15 herrialde-etako merkataritza ministro-ekin] [USA-of president-the-erg.] [UE-of 15 countries-of Commerce ministers-with] NP-ergative(president, singular) PP(with)-commitative(minister, plural) The president of the USA with the ministers of Commerce of 15 countries of the UE (c) [bazkaldu behar zuen] (d) [negoziazio-en bilgune-an] [to eat had] [negotiation-of center-in] verb(eat) PP(in)-inessive(center, singular) had to eat in the negotiation center  Below (c) Verb phrase and (a,b,d) verbal dependents (phrases), and also under the case+head that map to categories other than English prepositions, such as adverbs (Table 1b).</Paragraph>
      <Paragraph position="2">  instance, English that complementizer corresponds to several subordination suffixes: -la, -n, -na, -nik).</Paragraph>
      <Paragraph position="3"> This shows to which extent the range of arguments is fine grained, in contrast to other works where the range is at the categorial level, such as NP or PP (M. Brent 1993, C. Manning 1993, P. Merlo &amp; M. Leybold 2001).</Paragraph>
      <Paragraph position="4"> Due to the complexity carried by having such a high number of cases, we decided to gather postpositions that are semantically equivalent or almost equivalent (for example, English between and among). Even if there are some semantic differences between them they do not seem to be relevant at the syntactic level. Some linguists were in charge of completing this grouping task. Even considering the risk of making mistakes when grouping the cases, we concluded that the loss of accuracy due to having too sparse data (consequence of having many cases) would be worse than the noise introduced by any mistake in the grouping. The resulting set contained 48 cases.</Paragraph>
      <Paragraph position="5"> The complexity is reduced but it is still considerable.</Paragraph>
      <Paragraph position="6"> Most of the work on automatic acquisition of subcategorization information (J. Carroll &amp; T.</Paragraph>
      <Paragraph position="7"> Briscoe 1997, A. Sarkar &amp; D. Zeman 2000, A.</Paragraph>
      <Paragraph position="8"> Korhonen 2001) apply statistical methods (hypothesis testing). Basically the idea is the following: they get &amp;quot;possible subcategorization frames&amp;quot; from automatically parsed data (either completely or partially parsed) or from a syntactically annotated corpus. Afterwards a statistical filter is employed to decide whether those &amp;quot;possible frames&amp;quot; are or not real subcategorization frames. These statistical methods can be problematic mostly because they perform badly on sparse data. In order to avoid as much as possible data sparseness, we decided to design a system that learns which are the arguments of a given verb instead of learning whole frames. Frames are combinations of arguments, and considering that our system deals with 48 cases, the number of combinations was high, resulting in sparse data. So we decided to work at the level of the argument/adjunct distinction. Working on this distinction is also very useful to avoid noise in the subcategorization frame, because in this task adjuncts are synonyms of noise. A system that tries to get subcategorization frames without previously making the argument/adjunct distinction suffers of having sparse and noisy data.</Paragraph>
      <Paragraph position="9"> To accomplish the argument/adjunct distinction we applied two measures: Mutual Information (MI), and Fisher's Exact Test (for more information on these measures, see C. Manning &amp; H. Schutze 1999). MI is a measure coming from Information Theory, defined as the logarithm of the ratio between the probability of the co-occurrence of the verb and the case, and the probability of the verb and the case appearing together calculated from their independent probability. So higher Mutual Information values correspond to higher associated verb and cases (see table 2).</Paragraph>
      <Paragraph position="11"> Mutual Information shows higher values for atera-ablative(to go/take out), erabili-gisa (to useas). These pairs were manually tagged as arguments, therefore Mutual information makes the right prediction. On the contrary, atera-instrumental (to go/take out-with), erabili-instrumental (to use-with) were manually tagged as adjuncts. Mutual information values in table 2 go along with the manual tagging for these last pairs as well, because the Mutual information values are low as should correspond to adjuncts.</Paragraph>
      <Paragraph position="12"> Fisher's Exact Test is a hypothesis testing statistical measure  . We used the left-side version of the test (see T. Pederssen, 1996). Under this version the test tells us how likely it would be to perform the same experiment again and be less accurate. That is to say, if you were repeating the experiment and there were no relation between the verb and the case, you would have a big probability of finding a lower co-occurrence frequency than the one you observed in your experiment. So higher left-side Fisher values tell us that there is a correlation between the verb and the case (see table 3.)  as). These values predict correctly the association between the verbs and cases for these examples. The low values for the atera-instrumental (to go/take out-with), and erabili-instrumental (to usewith) pairs, should be interpreted as the nonassociation between the verbs and the cases in these examples, that is to say, they are adjuncts. And again, the prediction would be right according to the taggers.</Paragraph>
      <Paragraph position="13"> These tests are broadly used to discover associations between words, but they show different behaviour depending on the nature of the data. We did not want to make any a priori decision on the measure employed. On the contrary, we aimed to check which test behaved better on our data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML