File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3207_metho.xml

Size: 27,825 bytes

Last Modified: 2025-10-06 14:09:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3207">
  <Title>Bilingual Parsing with Factored Estimation: Using English to Parse Korean</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Bilingual parsing
</SectionTitle>
    <Paragraph position="0"> The joint model used by our bilingual parser is an instance of a stochastic bilingual multitext grammar (2-MTG), formally defined by Melamed (2003). The 2-MTG formalism generates two strings such that each syntactic constituent--including individual words--in one side of the bitext corresponds either to a constituent in the other side or to [?].</Paragraph>
    <Paragraph position="1"> Melamed defines bilexicalized MTG (L2MTG), which is a synchronous extension of bilexical grammars such as those described in Eisner and Satta (1999) and applies the latter's algorithmic speedups to L2MTG-parsing.</Paragraph>
    <Paragraph position="2"> Our formalism is not a precise fit to either unlexicalized MTG or L2MTG since we posit lexical dependency structure only in one of the languages (English). The primary rationale for this is that we are dealing with only a small quantity of labeled data in language L and therefore do not expect to be able to accurately estimate its lexical affinities. Further, synchronous parsing is in practice computationally expensive, and eliminating lexicalization on one side reduces the run-time of the parser from O(n8) to O(n7). Our parsing algorithm is a simple transformation of Melamed's R2D parser that eliminates head information in all Korean parser items.</Paragraph>
    <Paragraph position="3"> The model event space for our stochastic &amp;quot;halfbilexicalized&amp;quot; 2-MTG consists of rewrite rules of the following two forms, with English above and L below:  where upper-case symbols are nonterminals and lower-case symbols are words (potentially [?]). One approach to assigning a probability to such a rule is to make an independence assumption, for example:</Paragraph>
    <Paragraph position="5"> There are two powerful reasons to model the bilingual grammar in this factored way. First, we know of no treealigned corpora from which bilingual rewrite probabilities could be estimated; this rules out the possibility of supervised training of the joint rules. Second, separating the probabilities allows separate estimation of the probabilities--resulting in two well-understood parameter estimation tasks which can be carried out independently.1 null This factored modeling approach bears a strong resemblance to the factored monolingual parser of Klein and Manning (2002), which combined an English dependency model and an unlexicalized PCFG. The generative model used by Klein and Manning consisted of multiplying the two component models; the model was therefore deficient.</Paragraph>
    <Paragraph position="6"> We go a step farther, replacing the deficient generative model with a log-linear model. The underlying parsing algorithm remains the same, but the weights are no longer constrained to sum to one. (Hereafter, we assume weights are additive real values; a log-probability is an example of a weight.) The weights may be estimated using discriminative training (as we do for the English model, SS3.1) or as if they were log-probabilities, using smoothed maximum likelihood estimation (as we do for the Korean model, SS3.3). Because we use this model only for inference, it is not necessary to compute a partition function for the combined log-linear model.</Paragraph>
    <Paragraph position="7"> In addition to the two monolingual syntax models, we add a word-to-word translation model to the mix. In this paper we use a translation model to induce only a single best word matching, but in principle the translation model could be used to weight all possible word-word links, and the parser would solve the joint alignment/parsing problem.2 As a testbed for our experiments, the Penn Korean Treebank (KTB; Han et al., 2002) provides 5,083 Korean constituency trees along with English translations and their trees. The KTB also analyzes Korean words into their component morphemes and morpheme tags, which allowed us to train a morphological disambiguation model.</Paragraph>
    <Paragraph position="8"> To make the most of this small corpus, we performed all our evaluations using five-fold cross-validation. Due to the computational expense of bilingual parsing, we 1Of course, it might be the case that some information is known about the relationship between the two languages. In that case, our log-linear framework would allow the incorporation of additional bilingual production features.</Paragraph>
    <Paragraph position="9"> 2Although polynomial, we found this to be too computationally demanding to do with our optimal parser in practice, but with pruning and/or A[?] heuristics it is likely to be feasible.</Paragraph>
    <Paragraph position="10"> produced a sub-corpus of the KTB limiting English sentence length to 10 words, or 27% of the full data. We then randomized the order of sentences and divided the data into five equal test sets of 280 sentences each ([?]1,700 Korean words, [?]2,100 English words). Complementing each test set, the remaining data were used for training sets of increasing size to simulate various levels of data scarcity.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Parameter estimation
</SectionTitle>
    <Paragraph position="0"> We now describe parameter estimation for the four component models that combine to make our full system (Table 1).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 English syntax model
</SectionTitle>
      <Paragraph position="0"> Our English syntax model is based on weighted bilexical dependencies. The model predicts the generation of a child (POS tag, word) pair, dependent upon its parent (tag, word) and the tag of the parent's most recent child on the same side (left or right). These events correspond quite closely to the parser described by Eisner's (1996) model C, but instead of the rules receiving conditional probabilities, we use a log-linear model and allow arbitrary weights. The model does not predict POS tags; it assumes they are given, even in test.</Paragraph>
      <Paragraph position="1"> Note that the dynamic program used for inference of bilexical parses is indifferent to the origin of the rule weights; they could be log-probabilities or arbitrary numbers, as in our model. The parsing algorithm need not change to accommodate the new parameterization.</Paragraph>
      <Paragraph position="2"> In this model, the probability of a (sentence, tree) pair (E,T) is given by:</Paragraph>
      <Paragraph position="4"> where th are the model parameters and f is a vector function such that fi is equal to the number of times a feature (e.g., a production rule) fires in (E,T).</Paragraph>
      <Paragraph position="5"> Parameter estimation consists of selecting weights th to maximize the conditional probability of the correct parses given observed sentences:3</Paragraph>
      <Paragraph position="7"> (2) Another important advantage of moving to log-linear models is the simple handling of data sparseness. The feature templates used by our model are shown in Table 2. The first feature corresponds to the fully-described child-generation event; others are similar but less informative. These &amp;quot;overlapping&amp;quot; features offer a kind of backoff, so that each child-generation event's weight receives a contribution from several granularities of description. null Feature selection is done by simple thresholding: if a feature is observed 5 times or more in the training set, its weight is estimated; otherwise its weight is locked at 3This can be done using iterative scaling or gradient-based numerical optimization methods, as we did.</Paragraph>
      <Paragraph position="8"> Model Formalism Estimation Role English syntax (SS3.1) bilexical dependency discriminative estimation combines with Korean grammar syntax for bilingual parsing Korean morphology (SS3.2) two-sequence discriminative estimation best analysis used as input trigram model over a lattice to TM training and to parsing Korean syntax (SS3.3) PCFG smoothed MLE combines with English syntax for bilingual parsing Translation model (SS3.4) IBM models 1-4, unsupervised estimation best analysis used as both directions (approximation to EM) input to bilingual parsing  dency parser. TX is a tag and WX is a word. P indicates the parent, A the previous child, and C the nextgenerated child. D is the direction (left or right). The last two templates correspond to stopping.</Paragraph>
      <Paragraph position="9"> 0. If a feature is never seen in training data, we give it the same weight as the minimum-valued feature from the training set (thmin). To handle out-of-vocabulary (OOV) words, we treat any word seen for the first time in the final 300 sentences of the training corpus as OOV. The model is smoothed using a Gaussian prior with unit variance on every weight.</Paragraph>
      <Paragraph position="10"> Because the left and right children of a parent are independent of each other, our model can be described as a weighted split head automaton grammar (Eisner and Satta, 1999). This allowed us to use Eisner and Satta's O(n3) parsing algorithm to speed up training.4 This speedup could not, however, be applied to the bilingual parsing algorithm since a split parsing algorithm will preclude inference of certain configurations of word alignments that are allowed by a non-split parser (Melamed, 2003).</Paragraph>
      <Paragraph position="11"> We trained the parser on sentences of 15 words or fewer in the WSJ Treebank sections 01-21.5 99.49% dependency attachment accuracy was achieved on the training set, and 76.68% and 75.00% were achieved on sections 22 and 23, respectively. Performance on the English side of our KTB test set was 71.82% (averaged across 5 folds, s = 1.75).</Paragraph>
      <Paragraph position="12"> This type of discriminative training has been applied to log-linear variants of hidden Markov models (Lafferty et al., 2001) and to lexical-functional grammar (Johnson et al., 1999; Riezler et al., 2002). To our knowledge, it has not been explored for context-free models (including bilexical dependency models like ours). A review  of discriminative approaches to parsing can be found in Chiang (2003).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Korean morphological analysis
</SectionTitle>
      <Paragraph position="0"> A Korean word typically consists of a head morpheme followed by a series of closed-class dependent morphemes such as case markers, copula, topicalizers, and conjunctions. Since most of the semantic content resides in the leading head morpheme, we eliminate for word alignment all trailing morphemes, which reduces the KTB's vocabulary size from 10,052 to 3,104.</Paragraph>
      <Paragraph position="1"> Existing morphological processing tools for many languages are often unweighted finite-state transducers that encode the possible analyses for a surface form word.</Paragraph>
      <Paragraph position="2"> One such tool, klex, is available for Korean (Han, 2004).</Paragraph>
      <Paragraph position="3"> Unfortunately, while the unweighted FST describes the set of valid analyses, it gives no way to choose among them. We treat this as a noisy channel: Korean morpheme-tag pairs are generated in sequence by some process, then passed through a channel that turns them into Korean words (with loss of information). The channel is given by the FST, but without any weights. To select the best output, we model the source process.</Paragraph>
      <Paragraph position="4"> We model the sequence of morphemes and their tags as a log-linear trigram model. Overlapping trigram, bigram, and unigram features provide backoff information to deal with data sparseness (Table 3). For each training sentence, we used the FST-encoded morphological dictionary to construct a lattice of possible analyses. The lattice has a &amp;quot;sausage&amp;quot; form with all paths joining between each word.</Paragraph>
      <Paragraph position="5"> We train the feature weights to maximize the weight of the correct path relative to all paths in the lattice. In contrast, Lafferty et al. (2001) train to maximize the the probability of the tags given the words. Over training sentences, maximize:</Paragraph>
      <Paragraph position="7"> where Ti is the correct tagging for sentence i, Mi is the correct morpheme sequence.</Paragraph>
      <Paragraph position="8"> There are a few complications. First, the coverage of the FST is of course not universal; in fact, it cannot analyze 4.66% of word types (2.18% of tokens) in the KTB.  ogy model. Tx is a tag, Mx is a morpheme.</Paragraph>
      <Paragraph position="9"> We tag such words as atomic common nouns (the most common tag). Second, many of the analyses in the KTB are not admitted by the FST: 21.06% of correct analyses (by token) are not admitted by the FST; 6.85% do not have an FST analysis matching in the first tag and morpheme, 3.63% do not have an FST analysis matching the full tag sequence, and 1.22% do not have an analysis matching the first tag. These do not include the 2.18% of tokens with no analysis at all. When this happened in training, we added the correct analysis to the lattice. To perform inference on new data, we construct a lattice from the FST (adding in any analyses of the word seen in training) and use a dynamic program (essentially the Viterbi algorithm) to find the best path through the lattice. Unseen features are given the weight thmin. Table 4 shows performance on ambiguous tokens in training and test data (averaged over five folds).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Korean syntax model
</SectionTitle>
      <Paragraph position="0"> Because we are using small training sets, parameter estimates for a lexicalized Korean probabilistic grammar are likely to be highly unreliable due to sparse data. Therefore we use an unlexicalized PCFG. Because the POS tags are given by the morphological analyzer, the PCFG need not predict words (i.e., head morphemes), only POS tags.</Paragraph>
      <Paragraph position="1"> Rule probabilities were estimated with MLE. Since only the sentence nonterminal S was smoothed (using add-0.1), the grammar could parse any sequence of tags but was relatively sparse, which kept bilingual run-time down. 6 When we combine the PCFG with the other models to do joint bilingual parsing, we simply use the logs of the PCFG probabilities as if they were log-linear weights. A PCFG treated this way is a perfectly valid log-linear model; the exponentials of its weights just happen to satisfy certain sum-to-one constraints.</Paragraph>
      <Paragraph position="2"> In the spirit of joint optimization, we might have also combined the Korean morphology and syntax models into one inference task. We did not do this, largely out of concerns over computational expense (see the discussion of translation models in SS3.4). This parser, independent of the bilingual parser, is evaluated in SS4.</Paragraph>
      <Paragraph position="3">  terminals gave indistinguishable results on monolingual parsing. Alternatively, we could have trained the PCFG discriminatively (treating the PCFG rules as log-linear features), but because our training sets are small we do not expect such training to be very different from training the PCFG as a generative model with probabilities.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Translation model
</SectionTitle>
      <Paragraph position="0"> In our bilingual parser, the English and Korean parses are mediated through word-to-word translational correspondence links. Unlike the syntax models, the translation models were trained without the benefit of labeled data.</Paragraph>
      <Paragraph position="1"> We used the GIZA++ implementation of the IBM statistical translation models (Brown et al., 1993; Och and Ney, 2003).</Paragraph>
      <Paragraph position="2"> To obtain reliable word translation estimates, we trained on a bilingual corpus in addition to the KTB training set. The Foreign Broadcast Information Service dataset contains about 99,000 sentences of Korean and 72,000 of English translation. For our training, we extracted a relatively small parallel corpus of about 19,000 high-confidence sentence pairs.</Paragraph>
      <Paragraph position="3"> As noted above, Korean's productive agglutinative morphology leads to sparse estimates of word frequencies. We therefore trained our translation models after replacing each Korean word with its first morpheme stripped of its closed-class dependent morphemes, as described in SS3.2.</Paragraph>
      <Paragraph position="4"> The size of the translation tables made optimal bilingual parsing prohibitive by exploding the number of possible analyses. We therefore resorted to using GIZA++'s hypothesized alignments. Since the IBM models only hypothesize one-to-many alignments from target to source, we trained using each side of the bitext as source and target in turn. We could then produce two kinds of alignment graphs by taking either the intersection or the union of the links in the two GIZA++ alignment graphs. All words not in the resulting alignment graph are set to align to [?].</Paragraph>
      <Paragraph position="5"> Our bilingual parser deals only in one-to-one alignments (mappings); the intersection graph yields a mapping. The union graph yields a set of links which may permit different one-to-one mappings. Using the union graph therefore allows for flexibility in the word alignments inferred by the bilingual parser, but this comes at computational expense (because more analyses are permitted). null Even with over 20,000 sentence pairs of training data, the hypothesized alignments are relatively sparse. For the intersection alignments, an average of 23% of non-punctuation Korean words and 17% of non-punctuation English words have a link to the other language. For the union alignments, this improves to 88% for Korean and 22% for English.</Paragraph>
      <Paragraph position="6"> A starker measure of alignment sparsity is the accuracy of English dependency links projected onto Korean. Following Hwa et al. (2002), we looked at dependency links in the true English parses from the KTB where both the dependent and the head were linked to words on the Korean side using the intersection alignment. Note that Hwa et al. used not only the true English trees, but also hand-produced alignments. If we hypothesize that, if English words i and j are in a parent-child relationship, then so are their linked Korean words, then we infer an incomplete dependency graph for the Korean sentences whose precision is around 49%-53% but whose recall is  standard deviations) are shown over five-fold cross-validation. Over 65% of word tokens are ambiguous. The accuracy of the first tag in each word affects the PCFG and the accuracy of the first morpheme affects the translation model (under our aggressive morphological lemmatization).</Paragraph>
      <Paragraph position="7"> an abysmal 2.5%-3.6%. 7</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> Having trained each part of the model, we bring them together in a unified dynamic program to perform inference on the bilingual text as described in SS2. In order to experiment easily with different algorithms, we implemented all the morphological disambiguation and parsing models in this paper in Dyna, a new language for weighted dynamic programming (Eisner et al., 2004).</Paragraph>
    <Paragraph position="1"> For parameter estimation, we used the complementary DynaMITE tool. Just as CKY parsing starts with words in its chart, the dynamic program chart for the bilingual parser is seeded with the links given in the hypothesized word alignment.</Paragraph>
    <Paragraph position="2"> All our current results are optimal under the model, but as we scale up to more complex data, we might introduce A[?] heuristics or, at the possible expense of optimality, a beam search or pruning techniques. Our agenda discipline is uniform-cost search, which guarantees that the first full parse discovered will be optimal--if none of the weights are positive. In our case we are maximizing sums of negative weights, as if working with log probabilities.8 null When evaluating our parsing output against the test data from the KTB, we do not claim credit for the single outermost bracketing or for unary productions. Since unary productions do not translate well from language to language (Hwa et al., 2002), we collapse them to their lower nodes.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Baseline systems
</SectionTitle>
      <Paragraph position="0"> We compare our bilingual parser to several baseline systems. The first is the Korean PCFG trained on the small 7We approximated head-words in the Korean gold-standard trees by assuming all structures to be head-final, with the exception of punctuation. That is, the head-words of sister constituents will elect the right-most, non-punctuation word among them as the head.</Paragraph>
      <Paragraph position="1"> 8In fact the English syntax model is not constrained to have nonpositive weights, but we decrement every parameter by thmax. For a given sentence, this will reduce every possible parse's weight by a constant value, since the same number of features fire in every parse; thus, the classification properties of the parser are not affected.</Paragraph>
      <Paragraph position="2"> KTB training sets, as described in SS3.3. We also consider Wu's (1997) stochastic inversion transduction grammar (SITG) as well as strictly left- and right-branching trees.</Paragraph>
      <Paragraph position="3"> We report the results of five-fold cross-validation with the mean and standard deviation (in parentheses).</Paragraph>
      <Paragraph position="4"> Since it is unlexicalized, the PCFG parses sequences of tags as output by the morphological analysis model.</Paragraph>
      <Paragraph position="5"> By contrast, we can build translation tables for the SITG directly from surface words--and thus not use any labeled training data at all--or from the sequence of head morphemes. Experiments showed, however, that the SITG using words consistently outperformed the SITG using morphemes. We also implemented Wu's treetransformation algorithm to turn full binary-branching SITG output into flatter trees. Finally, we can provide extra information to the SITG by giving it a set of English bracketings that it must respect when constructing the joint tree. To get an upper bound on performance, we used the true parses from the English side of the KTB.</Paragraph>
      <Paragraph position="6"> Only the PCFG, of course, can be evaluated on labeled bracketing (Table 6). Although labeled precision and recall on test data generally increase with more training data, the slightly lower performance at the highest training set size may indicate overtraining of this simple model. Unlabeled precision and recall show continued improvement with more Korean training data.</Paragraph>
      <Paragraph position="7"> Even with help from the true English trees, the unsupervised SITGs underperform PCFGs trained on as few as 32 sentences, with the exception of unlabeled recall in one experiment. It seems that even some small amount of knowledge of the language helps parsing. Crossing brackets for the flattened SITG parses are understandably lower.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Bilingual parsing
</SectionTitle>
      <Paragraph position="0"> The output of our bilingual parser contains three types of constituents: English-only (aligned to [?]), Korean-only (aligned to [?]), and bilingual. The Korean parse induced by the Korean-only and bilingual constituents is filtered so constituents with intermediate labels (generated by the binarization process) are eliminated.</Paragraph>
      <Paragraph position="1"> A second filter we consider is to keep only the (remaining) bilingual constituents corresponding to an English head word's maximal span. This filter will eliminate constituents whose English correspondent is a head word with some (but not all) of its dependents. Such partial English constituents are by-products of the parsing and do not correspond to the modeled syntax.</Paragraph>
      <Paragraph position="2"> With good word alignments, the English parser can help disambiguate Korean phrase boundaries and overcome erroneous morphological analyses (Table 5). Results without and with the second filter are shown in Table 7. Because larger training datasets lead to larger PCFGs (with more rules), the grammar constant increases. Our bilingual parser implementation is on the cusp of practicality (in terms of memory requirements); when the grammar constant increased, we were unable to parse longer sentences. Therefore the results given for bilingual parsing are on reduced test sets, where a length filter was applied: sentences with |E |+ |F |&gt; t were removed, for varying values of t.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Discussion
</SectionTitle>
      <Paragraph position="0"> While neither bilingual parser consistently beats the PCFG on its own, they offer slight, complementary improvements on small training datasets of 32 and 64 sentences (Table 7). The bilingual parser without the English head span filter gives a small recall improvement on average at similar precision. Neither of these differences is significant when measured with a paired-sample t-test.</Paragraph>
      <Paragraph position="1"> In contrast, the parser with the English head span filter sacrifices significantly on recall for a small but significant gain in precision at the 0.01 level. Crossing brackets at all levels are significantly lower with the English head span filter. We can describe this effect as a filtering of Korean constituents by the English model and word alignments. Constituents that are not strongly evident on the English side are simply removed. On small training datasets, this effect is positive: although good constituents are lost so that recall is poor compared to the PCFG, precision and crossing brackets are improved.</Paragraph>
      <Paragraph position="2"> As one would expect, as the amount of training data increases, the advantage of using a bilingual parser vanishes--there is no benefit from falling back on the English parser and word alignments to help disambiguate the Korean structure. Since we have not pruned our search space in these experiments, we can be confident that all variations are due to the influence of the translation and English syntax models.</Paragraph>
      <Paragraph position="3"> Our approach has this principal advantage: the various morphology, parsing, and alignment components can be improved or replaced easily without needing to retrain the other modules. The low dependency projection results (SS3.4), in conjunction with our modest overall gains, indicate that the alignment/translation model should receive the most attention. In all the bilingual experiments, there is a small positive correlation (0.3), for sentences at each length, between the proportion of Korean words aligned to English and measures of parsing accuracy. Improved English parsers--such as Collins' models--have also been implemented in Dyna, the dynamic programming framework used here (Eisner et al., 2004).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML