File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2923_metho.xml
Size: 12,523 bytes
Last Modified: 2025-10-06 14:10:54
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2923"> <Title>LingPars, a Linguistically Inspired, Language-Independent Machine Learner for Dependency Treebanks</Title> <Section position="4" start_page="171" end_page="173" type="metho"> <SectionTitle> 3 System architecture </SectionTitle> <Paragraph position="0"> The point of departure for pattern learning in Ling Pars were the fine-grained part of speech (PoS) tags (POSTAG) and the LEMMA tag. For those languages that did not provide a lemma tag, lower-cased word form was used instead. Also, where available from the FEATS field and not already in tegrated into the PoS tag, the following informa tion was integrated into the PoS tag: a) case, which was regarded as a good predictor for function, as well as a good dependency-indica tor for e.g. preposition- and adnominal attachment b) mood/finiteness, in order to predict subordina tion and verb chaining, especially in the absence of 2 Neither was there time (and for some languages: reading knowledge) to write the necessary converters to and from a normalized standard formalism for each treebank.</Paragraph> <Paragraph position="1"> auxiliary class information in the FEATS field c) pronoun subclass, in order to predict adnomi nal vs. independent function as well as subordinat ing function (relatives and interrogatives) A few treebanks did not classify subordinating words as conjunctions, relatives, interrogatives etc., but lumped them into the general adverb and pronoun classes. Danish is a case in point - here, the treebank classified all non-inflecting words as PoS 'U'3. Our solution, implemented only for Dan ish and Swedish, was to introduce a list of struc ture-words, that would get their PoS appended with an '-S', enabling the learner to distinguish between e.g. &quot;ordinary&quot; ADV, and &quot;structural&quot; ADV-S.</Paragraph> <Section position="1" start_page="171" end_page="172" type="sub_section"> <SectionTitle> 3.1 The parser </SectionTitle> <Paragraph position="0"> In a first round, our parser calculates a preference list of functions and dependencies for each word, examining all possible mother-daughter pairs and n-grams in the sentence (or paragraph). Next, de pendencies are adjusted for function, basically summing up the frequency-, distance- and direc tion-calibrated function-PoS attachment probabil ities for all contextually allowed functions for a given word. Finally, dependency probabilities are weighted using linked probabilities for possible mother-, daughter- and sister-tags in a second pass.</Paragraph> <Paragraph position="1"> The result are 2 arrays, one for possible daugh ter-mother pairs, one for word:function pairs.</Paragraph> <Paragraph position="2"> Values in both arrays are normalized to the 0..1 in terval, meaning that for instance even an originally low probability, long distance attachment will get high values after normalization if there are few or no competing alternatives for the word in question.</Paragraph> <Paragraph position="3"> LingPars then attempts to &quot;effectuate&quot; the de pendency (daughter-mother) array, starting with the - in normalized terms - highest value4. If the daughter candidate is as yet unattached, and the de pendency does not produce circularities or crossing branches, the corresponding part of the (ordered) word:function array is calibrated for the suggested dependency, and the top-ranking function chosen.</Paragraph> <Paragraph position="4"> In principle, one pass through the dependency array would suffice to parse a sentence. However, 3For the treebank as such, no information is lost, since it will be recoverable from the function tag. In a training situation, however, there is much less to train on than in a treebank with a more syntactic definition of PoS. 4 Though we prefer to think of attachments as bottom-up choices, the value-or dered approach is essentially neither bottom-up nor top-down, depending on the language and the salience of relations in a sentence, all runs had a great varia tion in the order of attachments. A middle-level attachment like case-based preposition-attachment, for instance, can easily outperform (low) article- or (high) top-node-attachment.</Paragraph> <Paragraph position="5"> due to linguistic constraints like uniqueness princi ple, barrier tags and &quot;full&quot; heads5, some words may be left unattached or create conflicts for their heads. In these cases, weights are reduced for the conflicting functions, and increased for all daugh ter-mother values of the unattached word. The value arrays are then recomputed and rerun. In the case of unattached words, a complete rerun is per formed, allowing problematic words to attach be fore those words that would otherwise have blocked them. In the case of a function (e.g subject uniqueness) conflict, only the words involved in the conflict are rerun. If no conflict-free solution is found after 19 runs, barrier-, uniqueness- and pro jectivity-constraints are relaxed for a last run6.</Paragraph> <Paragraph position="6"> Finally, the daughter-sequence for each head (with the head itself inserted) is checked against the probability of its function sequence (learned not from n-grams proper, but from daughter-se quences in the training corpus). For instance, the constituents of a clause would make up such a se quence and allow to correct a sequence like SUBJ VFIN ARG2 ARG1 into SUBJ VFIN ARG1 ARG2, where ARG1 and ARG2 are object func tions with a preferred order (for the language learned) of ARG1 ARG2.</Paragraph> </Section> <Section position="2" start_page="172" end_page="172" type="sub_section"> <SectionTitle> 3.2 Learning functions (deprels) </SectionTitle> <Paragraph position="0"> LingPars computes function probabilities (Vf, function value) at three levels: First, each lemma and PoS is assigned local (context-free) probabili ties for all possible functions. Second, the proba bility of a given function occurring at a specific place in a function n-gram (func-gram, example (a)) is calculated (with n between 2 and 6). The learner only used endocentric func-grams, marking which of the function positions had their head within the func-gram. If no funcgram supported a given function, its probability for the word in ques tion was set to zero. At the third level, for each en docentric n-gram of word classes (PoS), the proba bility for a given function occurring at a given po sition in the n-gram (position 2 in example (b)) was computed. Here, only the longest possible n-grams were used by the parser, and first and last positions of the n-gram were used only to provide context, not to assign function probabilities.</Paragraph> </Section> <Section position="3" start_page="172" end_page="173" type="sub_section"> <SectionTitle> 3.3 Learning dependencies </SectionTitle> <Paragraph position="0"> In a rule based Constraint Grammar system, depen dency would be expressed as attachment of func tions to forms (i.e. subject to verb, or modifier to adjective). However, with empty deprel fields, LingPars cannot use functions directly, only their probabilities. Therefore, in a first pass, it computes the probability for the whole possible attachment matrix for a sentence, using learned mother- and daughter-normalized frequencies for attachments of type (a) PoS-PoS, (b) PoS-Lex, (c) Lex-PoS and (d) Lex-Lex, taking into account also the learned directional and distance prefer ences. Each matrix cell is then filled with a value Vfa (&quot;function attachment value&quot;) - the sum of the individual normalized probabilities of all possible functions for that particular daughter given that particular mother multiplied with the preestab lished, attachment-independent Vf value for that token-function combination.</Paragraph> <Paragraph position="1"> Inspired by the BARRIER conditions in CG rule contexts, our learner also records the frequency of those PoS and those functions (deprels) that may appear between a dependent of PoS A and a head of PoS B. The parser then regards all other, nonregistered interfering PoS or functions as blocking tokens for a given attachment pair, reducing its at tachment value by a factor of 1/100.</Paragraph> <Paragraph position="2"> In a second pass, the attachment matrix is cali brated using the relative probabilities for depen dent daughters, dependent sisters and head mother given. This way, probabilities of object and object complement sisters will enhance each other, and given the fact that treebanks differ as to which ele ment of a verb chain arguments attach to, a verbal head can be treated differently depending on whether it has a high probability for another verb (with auxiliary, modal or main verb function) as mother or daughter or not.</Paragraph> <Paragraph position="3"> Finally, like for functions, n-grams are used to calculate attachment probabilities. For each endo centric PoS n-gram (of length 6 or less), the proba bilities of all treebank-supported PoS:function chains and their dependency arcs are learned, and the value for an attachment word pair occurring in the chain will be corrected using both the chain/ngram probability and the Vf value for the function associated with the dependent in that particular chain. For contextual reasons, arcs central to the n-gram are weighted higher than peripheral arcs.7</Paragraph> </Section> <Section position="4" start_page="173" end_page="173" type="sub_section"> <SectionTitle> 3.4 Non-projectivity and other language-spe </SectionTitle> <Paragraph position="0"> cific problems As a general rule, non-projective arcs were only al lowed if no other, projective head could be found for a given word. However, linguistic knowledge suggests that non-projective arcs should be particu larly likely in connection with verb-chain-depen dencies, where subjects attach to the finite verb, but objects to the non-finite verb, which can create crossing arcs in the case of object fronting, chain inversion etc. Since we also noted an error-risk from arguments getting attached to the closest verb in a chain rather than the linguistically correct one8, we chose to introduce systematic, after-parse raising of certain pre-defined arguments from the auxiliary to the main verb. This feature needs lan guage-dependent parameters, and time constraints only allowed the implementation for Danish, Span ish, Portuguese and Czech. For Dutch, we also dis covered word-class-related projectivity-errors, that could be remedied by exempting certain FEATS classes from the parser's general projectivity con straint altogether (prep-voor and V-hulp)9.</Paragraph> <Paragraph position="1"> In order to improve root accuracy, topnode probability was set to zero for verbs with a safe subordinator dependent. However, even those tree banks descriptively supporting this did not all PoSmark subordinators. Therefore, FEATS-informa tion was used, or as a last resort - for Danish and Swedish - word forms.</Paragraph> <Paragraph position="2"> A third language-specific error-source was punctuation, because some treebanks (cz, sl, es) al lowed punctuation as heads. Also, experiments for the Germanic and Romance languages showed that performance decreased when punctuation was al lowed as BARRIER, but increased, when a fine-grained punctuation PoS10 was included in function and dependency n-grams.</Paragraph> <Paragraph position="3"> 7Due to BARRIER constraints, or simply because of insufficient training data in the face of a very detailed tag set, it may be impossible to assign all words n-gram supported functions or dependencies. In the former case, local function probabilities are used, in the latter attachment is computed as function - PoS probability only, using the most likely function.</Paragraph> <Paragraph position="4"> 8 Single verbs being more frequent than verb chains, the learner tended to gener alize close attachment, and even (grand)daughter and (grand)mother conditions could not entirely remedy this problem.</Paragraph> <Paragraph position="5"> 9Though desirable, there was no time to implement this for other languages.</Paragraph> </Section> </Section> <Section position="5" start_page="173" end_page="173" type="metho"> <SectionTitle> 10 Only for Spanish and Swedish was there a subdivision of punctuation PoS, so </SectionTitle> <Paragraph position="0"> we had to supply this information in all other cases by adding token-informa tion to the POSTAG field.</Paragraph> </Section> class="xml-element"></Paper>