File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/c90-2040_metho.xml

Size: 10,323 bytes

Last Modified: 2025-10-06 14:12:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-2040">
  <Title>FINITE-STATE PARSING AND DISAMBIGUATION</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
FINITE-SI'ATE SYNTAX
</SectionTitle>
    <Paragraph position="0"> The actual finite-state syntax consists of three components: * Syntactic disambiguation of word-forms which have multiple interpretations.</Paragraph>
    <Paragraph position="1"> * Determination of clause boundaries.</Paragraph>
    <Paragraph position="2"> * Determining the head-modifier relations of words and the surface syntactic functions of the heads.</Paragraph>
    <Paragraph position="3"> These components are well defined but they depend on each other in a nontrivial way. It is more convenient to write constraint rules for disambiguation and head-modifier relations if one can assume that the clause boundaries are already there. And conversely, the clause boundaries are easier to determine if we have the correct readings of words available. The approach adopted in this paper shows one solution where one may describe the constraints freely, ie. one may act as if the other modules had already done their work.</Paragraph>
    <Paragraph position="4"> Representation of sentences The way we have chosen in order to solve this interdependence, relies on the representation of sentences and the constraint rules. Each sentence is represented as a finite-state machine (fsm) that accepts all possible readings of the sentence. The task of the grammar is to accept the correct reading(s) and exclude incorrect ones. In a reading we include:  * One interpretation of each word-form.</Paragraph>
    <Paragraph position="5"> * One possible type of clause boundary or its absence for each word boundary.</Paragraph>
    <Paragraph position="6"> * One possible syntactic tag for each word.</Paragraph>
    <Paragraph position="7">  An example of a sentence in this representation is given in figure 2 on the next page. In the input sentence each word is represented as an analysis given by the morphological analyzer. The representation consists of one or mo;e interpretations, and each interpretation, in turn, of a base form and a set of morphosyntactic features, eg. &amp;quot;katto&amp;quot; N ELA SG.</Paragraph>
    <Paragraph position="8">  Word and clause boundaries For word boundaries we have four possibilities: @@ A sentence boundary, which occurs only at the very beginning and end of the sentence (and is the only possibility there). @ A normal word boundary (where there is no clause boundary).</Paragraph>
    <Paragraph position="9"> @/ A clause boundary separating two  clauses, where one ends and the other starts.</Paragraph>
    <Paragraph position="10"> @&lt; Beginning of a center embedding, where the preceding clause continues after the embedding has been completed.</Paragraph>
    <Paragraph position="11"> @&gt; End of a center embedding.</Paragraph>
    <Paragraph position="12"> Each word is assumed to belong to exactly one clause. This is taken strictly as a formal basis and implies that words in a subordinate clause only belong to the subordinate clause, not to their main clause. Furthermore, this implies a very flat structure to sentences. Tail recursion is treated as iteration.</Paragraph>
    <Paragraph position="13"> There has been a long dispute on the finite-state property of natural languages. We have observed that one level of proper center embedding is fairly common in our corpuses and that these instances also represent normal and unmarked language usage. We do not insist on the absence of a second or third level of center embedding. We only notice that there are very few examples of these in the corpuses, and even these are less clear examples of normal usage.</Paragraph>
    <Paragraph position="14"> The present version of the finite-state syntax accepts exactly one level of center embedding. The formalism and the implementation can be extended to handle a fixed number of recursive center-embeddings, but we will not pursue it further here.</Paragraph>
    <Paragraph position="15"> Grammatical tags One grammatical tag is attached with each word. Tags for heads indicate the syntactic role of the constituent, eg. MAIN-PRED, SUBJ, OBJ, ADV, PRED-COMP, and tags for modifiers reflect the part of speech of the head and the direction where it is located, eg. No, &lt;-N. This kind of simple tagging induces a kind of a constituent structure to the sentence closely resembling classical parsing.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
GRAMMAR
</SectionTitle>
    <Paragraph position="0"> The proposed grammar constructs no analysis for input sentences. Instead, the grammar excludes the incorrect readings. The ultimate resuit of the parsing is already present as one reading in the initial representation of the sentence which acts as an input to the parser. The result is just hidden among a large number of incorrect readings.</Paragraph>
    <Paragraph position="1"> Input sentences The following is an example of a sentence &amp;quot;kalle voisi uida paljonkin&amp;quot; (English glosses 'Char-</Paragraph>
    <Paragraph position="3"> This is an expression standing for a finite-state network. Alternatives are denoted by lists of the form:</Paragraph>
    <Paragraph position="5"> The input expression lists thus some 256 distinct readings in spite of its concise appearance.</Paragraph>
    <Paragraph position="6"> (The input is here still simplified because of the omission of the syntactic function tags.) Constraint rules Each constraint is formulated as a readable statement expressing some necessity in all grammatical sentences, eg.: NEG .... &gt; NEGV ..</Paragraph>
    <Paragraph position="7"> This constraint says that if we have an occurrence of a feature NEG (denoting a negative form of a verb), then we must also have a feature NEGV (denoting a negation) in the same clause. &amp;quot;..&amp;quot; denotes arbitrary features and stems, excluding clause boundaries except for full embeddings.</Paragraph>
    <Paragraph position="8"> Types of constraint rules Several types of constraint rules are needed: . Tect~nical constraints for feasible clause bracketing.</Paragraph>
    <Paragraph position="9"> - Disambiguation rules (eg. imperatives only in sentence initial positions, negative forms require a negation, AD-A requires an adjective or adverb to follow; etc.) deg Clause boundary constraints (relative pronouns and certain conjunctions are preceded by a boundary, even other boundaries need some explicit clue to justify their possibility).</Paragraph>
    <Paragraph position="10"> o Even/clause may have at most one finite verb and (roughlyspeaking) also must have one finite verb.</Paragraph>
    <Paragraph position="11"> Examples of constraint rules The following rule constrains the occurrence infinitives by requiring that they must be preceded by a verb taking an infinitive complement (signalled by the feature VCHAIN).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
INFI NOM ::&gt; VCHAIN .o
</SectionTitle>
    <Paragraph position="0"> Imperatives should occur only at the beginning of a sentence. A coordination of two or more imperatives is also permitted (if the first imperative is sentence initial):</Paragraph>
    <Paragraph position="2"> (Here COMMA is a feature associated with the punctuation token, and COORD a feature present in coordinating conjunctions.) The following disambiguation rule requires that modifiers of adjectives and adverbs must have their head present: AD-A :=&gt; . @ , \[A I ADV\] For clause boundaries we need a small set of constraint rules. Part of them specify that in certain contexts (such as before relative pronouns or subjunctions) there must be a boundary. The remaining rules specify converse constraints, ie. what kinds of clues must be present in order for a clause boundary to be present.</Paragraph>
    <Paragraph position="3"> All these constraints are ultimately implemented as finite-state machines which discard he corresponding ungrammatical readings. All constraint-automata together leave (hopefully) exactly one grammatical reading, the correct one. The grammar as a whole is logically an intersection of all constraints whereas the process of syntactic analysis corresponds to the intersection of the grammar and the input sentence.</Paragraph>
    <Paragraph position="4"> Output With a very small grammar consisting of about a dozen constraint rules, the input sentence given in the above example is reduCed into the following result:  The formalism and implementation proposed for the finite-state syntax is monotonic in the sense that no information is ever changed. Each constraint simply adds something to the discriminating power of the whole grammar. No constraint rule may ever forbid something that would later on be accepted as an exception.</Paragraph>
    <Paragraph position="5"> This, maybe, puts more strain for the grammar writer but gives us better hope of understanding the grammar we write.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
IMPLEMENTATION
</SectionTitle>
    <Paragraph position="0"> The constraint rules are implemented by using Ran Kaptan's finite-state package. In the preliminary phase constraints are hand-coded into expressions which are then converted into fsm's.</Paragraph>
    <Paragraph position="1"> We have planned to construct a compiler which would automatically translate rules in the proposed formalism into automata like the one used for morphological two-level rules (Karttunen et al. 1987).</Paragraph>
    <Paragraph position="2"> The actual run-time system needs only a very restricted set of finite-state operations, intersection of the sentence and the grammar. The grammar itself might be represented as one large intersection or as several smaller ones which are intersected in parallel. The sentence as a fsm is of a very restricted class of finite-state nelworks which simplifies the run-time process.</Paragraph>
    <Paragraph position="3"> An alternative and obvious framework for implementing constraint rules is Prolog which would be convenient for the testing phase. Prolog would, perhaps, have certain limitations for the production use of such parsers.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML