File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/c96-1028_intro.xml
Size: 7,513 bytes
Last Modified: 2025-10-06 14:05:57
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-1028"> <Title>Cross-Serial Dependencies Are Not Hard to Process</Title> <Section position="3" start_page="0" end_page="158" type="intro"> <SectionTitle> 2 Preliminaries </SectionTitle> <Paragraph position="0"> To calibrate our discussion, we quickly review t,h~, salient terminology from formal langm~ge theory and the current undersl,anding of dm import; tor natural language.s.</Paragraph> <Section position="1" start_page="0" end_page="157" type="sub_section"> <SectionTitle> 2.1 Terminology </SectionTitle> <Paragraph position="0"> Let 12i denote the hierarchy of languages generated by the corresponding hierarchy of gramnmrs (according to dm usuN hierarchy (Hopcroft and Ulhnan, 1979)). Thus,/20 denot;es the (:lass of languages general,ed by type 0 grammars. They are ehm'aeterized by unrestricted grammar produc- null tion rules. PS1 is the class of languages generated by context sensitive granlmars--the sole restriction on production rules in this type of grammar is that the right hand side (RHS) of each rule is at least as long as the left hand side (LHS). PS1.5 denotes the class of languages generated by indexed grammars. Gazdar (1985) provides the most perspicuous notation for the restricted forms that production rules may take in such grammars: 2 1. A\[...\] --+ W\[...\] 2. A\[...\] ---+ B\[i, ...\] 3. A\[i,...\] ----+ W\[...\] Indexed grammars incorporate a notion of stacking; rules of the form in (2) describe push operations, and those of the form in (3) involve pops. Rules of the form (1) are copy operations. The elipses indicate that the remainder of the stack is passed on from the LHS to each nonterminal (and only the nonterminals) on the RHS. PS2 is the class of context free languages generated by grammars whose productions are restricted such that the LHS of each is a single nonterminal symbol, and each RHS is a sequence of terminals and nonterminals. Finally, the regular languages, PS3 are those produced by regular grammars, characterized by rules that have a single nonterminal symbol on the LHS and on the RHS, either a terminal symbol or a terminal and a single nonterminal.</Paragraph> <Paragraph position="1"> These classes of languages can be arranged into a hierarchy based on proper containment relations among them: PS3 C PS2 C PS1.5 C PS1 C PS0 (PS0 is the least restrictive, the most expressive). Aho (1968) shows the existence of languages that are a proper subset of the indexed languages and a proper superset of the context free. Joshi et al. (1989) conjecture that there is actually a convergence in expressive power among the 'mildly context sensitive' (MCS) languages, but other work points out exceptions (Savitch, 1989; Vogel and Erjavec, 1994). Since the reduplication languages (Savitch, 1989) are central to the point of this paper we define them-the languages homomorphic to the set of strings {ww\[w 6 {a,b}*}. The string duplication languages are not context free, although they are closely related to the string reversal languages ({wwR\[w 6 {a, b}*}, where the R indicates the reversal operator) which are context free. The two languages induce different dependency relationships which is best described as nesting in the context free case and cross-serial in the indexed case: dices; W denotes a sequence of elements of terminals and nonterminals; A, B denote nonterminals.</Paragraph> <Paragraph position="2"> An important property of the each of the language classes is that it is closed under bottl intersection with regular languages (e.g., the intersection of a context free language and a regular language is no more expressive than a context free language) and homomorphism (e.g., an order preserving map of each symbol in a language to a single element (possibly a string) of a context free language implies that the first language is also context free). It is convenient to refer to languages with homomorphismSwwR{WWRIwto E {a, b}*} ai~d {wwIw 6 {a,b}*} as and ww, respectively.</Paragraph> <Paragraph position="3"> Corresponding to expressivity class and the associated model of computation is the complexity of recognition for each class. Table 1 gives an informal ranking of the language classes with their corresponding worst case recognition complexity on the standard model of computation.</Paragraph> <Paragraph position="4"> Thus, given a context free grammar for ww R and a string of length n, then in the worst case it will take an amount of time proportional to the cube of the length of the string to determine whether the string is in ww R (and identify its structure).</Paragraph> <Paragraph position="5"> While the expressivity hierarchy is useful for differentiating classes of lmlguages in precise terms like worst-case recognition complexity, it is easy to use the hierarchy incorrectly. For instance, it is not valid to conclude that because a language is in a particular language class all subsets of that language are also included that language class (e.g. ww;i is a proper subset of w, yet w 6PS3 ww R 6PS2). Also, in most cases the structural descriptions that underlie strings of a language are of more interest than the string sets themselves.</Paragraph> <Paragraph position="6"> For this reason it is useful to distinguish weak and strong containment of a grammar in a language class: e.g., a grammar is weakly context free if its stringset is context free; a grammar is strongly context free if its treeset is also context free.</Paragraph> </Section> <Section position="2" start_page="157" end_page="158" type="sub_section"> <SectionTitle> 2.2 Applicability to Natural Language </SectionTitle> <Paragraph position="0"> Pullum and Gazdar (1982) survey the arguments up to the time they wrote for the non-coritextfreeness of natural language. The most interesting were those that considered idealizations of linguistic phenomena in terms of the string duplicating language, ww. In each case they found the m'gument flawed: the phenomena in question did not yield languages whose stringsets were homomorphic to tile duplication language. Bresnan et al. (1982) argue that Dutch is not strongly context free. Shieber (1985) provides a stringset argument about a dialect of Swiss-German, which has a class of verb phrases with cross-serial dependencies (through case marking) between NPs and their Vs, which establishes even the weak-noncontext-freeness of natural language because of homomorphism to ww. Manaster-Ramer (1987) re-analyzes an argument considered by Pullum and Gazdar (1982) about Dutch and produces a corrected stringset argument that Dutch licences a&quot;b'*c '~ constructions, which are MCS. No known syntactic phenomenon requires greater than indexed language expressivity.</Paragraph> <Paragraph position="1"> The point of this paper is to emphasize that although a particular Swiss-German dialect renders natural language syntax non-context free, it does not entail that natural languages, induding the ones that license cross-serial dependencies, incur the worst case recognition complexity costs for indexed languages. In fact, we argue in the next section that ww is fairly straightforward to process. Essentially, we consider languages xx homomorphic to ww, where x can be either PS3 or PS2, and argue that the recognition for xx is no worse than worst case recognition for PS3 if x EPS3 and no worse than the worst case for PS2 ifx EPS2, even though xx is itself indexed.</Paragraph> </Section> </Section> class="xml-element"></Paper>