File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1206_metho.xml

Size: 22,026 bytes

Last Modified: 2025-10-06 14:10:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1206">
  <Title>Automated Multiword Expression Prediction for Grammar Engineering</Title>
  <Section position="4" start_page="36" end_page="37" type="metho">
    <SectionTitle>
2 Multiword Expressions
</SectionTitle>
    <Paragraph position="0"> The term Multiword Expressions (MWEs) has been used to describe expressions for which the syntactic or semantic properties of the whole expression cannot be derived from its parts ((Sag et al., 2002), (Villavicencio et al., 2005)), including a large number of related but distinct phenomena, such as phrasal verbs (e.g. come along), nominal compounds (e.g. frying pan), institutionalised phrases (e.g. bread and butter), and many others. They are used frequently in language, and in English, Jackendoff (1997) estimates the number of MWES in a speaker's lexicon to be comparable to the number of single words. This is reflected in several existing grammars and lexical resources, where almost half of the entries are Multiword Expressions. However, due to their heterogeneous characteristics, MWEs present a tough challenge for both linguistic and computational work (Sag et al., 2002). Some MWEs are fixed, and do not present internal variation, such as ad  hoc, while others allow different degrees of internal variability and modification, such as touch a nerve (touch/find a nerve) and spill beans (spill several/musical/mountains of beans). In terms of semantics, some MWEs are more opaque in their meaning (e.g. to kick the bucket as to die), while others have more transparent meanings that can be inferred from the words in the MWE (e.g. eat up, where the particle up adds a completive sense to eat). Therefore, to provide a unified account for the detection of these distinct but related phenomena is a real challenge for NLP systems.</Paragraph>
  </Section>
  <Section position="5" start_page="37" end_page="41" type="metho">
    <SectionTitle>
3 Detection of Errors: Overview
</SectionTitle>
    <Paragraph position="0"> van Noord (2004) reports on various errors that have been discovered for the Dutch Alpino Grammar (Bouma et al., 2001) semi-automatically, using the Twente Nieuws Corpus. The idea pursued by van Noord (2004) has been to locate those n-grams in the input that might be the cause of parsing failure. By processing a huge amount of data, the parsability metrics briefly presented in section 1 have been used to successfully locate various errors introduced by the tokenizer, erroneous/incomplete lexical descriptions, frozen expressions with idiosyncratic syntax, or incomplete grammatical descriptions. However, the recovery of these errors has been shown to still require significant efforts from the grammar developer. Moreover, there is no concrete data given about the distribution of the different types of errors discovered.</Paragraph>
    <Paragraph position="1"> As also mentioned before, among the n-grams that usually cause parse failures, there is a large number of missing MWEs in the lexicon such as phrasal verbs, collocations, compound nouns, frozen expressions (e.g. by and large, centre of attention, put forward by, etc).</Paragraph>
    <Paragraph position="2"> For the purpose of the detection of MWEs, we are interested in seeing what the major types of error for a typical large-scale deep grammar are. In this context, we have run the error mining experiment reported by van Noord with the English Resource Grammar (ERG; (Flickinger, 2000))1 and the British National Corpus 2.0 (BNC; (Burnard, 2000)).</Paragraph>
    <Paragraph position="3"> We have used a subset of the BNC written component. The sentences in this collection contain no more than 20 words and only ASCII characters.</Paragraph>
    <Paragraph position="4"> 1ERG is a large-scale HPSG grammar for English. In this paper, we have used the January 2006 release of the grammar. That is about 1.8M distinct sentences.</Paragraph>
    <Paragraph position="5"> These sentences have then be fed into an efficient HPSG parser (PET; (Callmeier, 2000)) with ERG loaded. The parser has been configured with a maximum edge number limit of 100K and has run in the best-only mode so that it does not exhaustively find all the possible parses. The result of each sentence is marked as one of the following four cases: * P means at least one parse is found for the sentence; * L means the parser halted after the morphological analysis and has not been able to construct any lexical item for the input token; * N means the search has finished normally and there is no parse found for the sentence; * E means the search has finished abnormally by exceeding the edge number limit.</Paragraph>
    <Paragraph position="6"> It is interesting to notice that when the ambiguity packing mechanism (Oepen and Carroll, 2000) is used and the unpacking is turned off 2, E does not occur at all for our test corpus. Running the parsability checking over the entire collection of sentences has taken the parser less than 2 days on a 64bit machine with 3GHz CPU. The results are shown in Table 1.</Paragraph>
    <Paragraph position="7">  ?From the results shown in Table 1, one can see that ERG has full lexical span for less than half of the sentences. For these sentences, about 80% are successfully parsed. These numbers show that the grammar coverage has a significant improvement as compared to results reported by Baldwin et al. (2004) and Zhang and Kordoni (2006), mainly attributed to the increase in the size of the lexicon and the new rules to handle punctuations and fragments. null Obviously, L indicates the unknown words in the input sentence. But for N, it is not clear where 2For the experiment of error mining, only the parsability checking is necessary. There is no need to record the exact parses.</Paragraph>
    <Paragraph position="8">  and what kind of error has occurred. In order to pinpoint the errors, we used the error mining techniques proposed by van Noord (2004) on the grammar and corpus. We have taken the sentences marked as N (because the errors in L sentences are already determined) and calculate the word sequence parsabilities against the sentences marked as P. The frequency cut is set to be 5. The whole process has taken no more than 20 minutes, resulting in total the parsability scores for 35K n-grams (word sequences). The distribution of n-grams in length with parsability below 0.1 is shown in Table 2.</Paragraph>
    <Paragraph position="9">  Although pinpointing the problematic n-grams still does not tell us what the exact errors are, it does shed some light on the cause. From Table 2 we see quite a lot of uni-grams with low parsabilities. Table 3 gives some examples of the word sequences. By intuition, we make the bold assumption that the low parsability of uni-grams is caused by the missing appropriate lexical entries for the corresponding word.3 For the bi-grams and tri-grams, we do see a lot of cases where the error can be repaired by just</Paragraph>
    <Section position="1" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
Mining Results
</SectionTitle>
      <Paragraph position="0"> In order to distinguish those n-grams that can be added into the grammar as MWE lexical entries from the other cases, we propose to validate them using evidence collected from the World  Recently, many researchers have started using the World Wide Web as an extremely large corpus, since, as pointed out by Grefenstette (1999), the Web is the largest data set available for NLP ((Grefenstette, 1999), (Keller et al., 2002), (Kilgarriff and Grefenstette, 2003) and (Villavicencio, 2005)). For instance, Grefenstette employs the Web to do example-based machine translation of compounds from French into English. The method he employs would suffer considerably from data sparseness, if it were to rely only on corpus data. So for compounds that are sparse in the BNC he also obtains frequencies from the Web. The scale of the Web can help to minimise the problem of data sparseness, that is especially acute for MWEs, and Villavicencio (2005) uses the Web to find evidence to verify automatically generated VPCs.</Paragraph>
      <Paragraph position="1"> This work is built on these, in that we propose to employ the Web as a corpus, using frequencies collected from the Web to detect MWEs among the n-grams that cause parse failure. We concentrate on the 482 most frequent candidates, to verify t he method.</Paragraph>
      <Paragraph position="2"> The candidate list has been pre-processed to remove systematic unrelated entries, like those including acronyms, names, dates and numbers, following Bouma and Villada (2002). Using Google as a search engine, we have looked for evidence on the Web for each of the candidate MWEs, that have occurred as an exact match in a webpage. For each candidate searched, Google has provided us with a measure of frequency in the form of the number of pages in which it appears. Table 4 shows the 10 most frequent candidates, and among these there are parts of formulae, frozen expressions and collocations. Table 5 on the other hand, shows the 10 least frequent candidates. From the total of candidates, 311 have been kept while the other have been discarded as noise.</Paragraph>
      <Paragraph position="3"> A manual inspection of the candidates has revealed that indeed the list contains a large amount of MWEs and frozen expressions like taking into account the, good and evil, by and large, put forward by and breach of contract. Some of these cases, like come into effect in, have very specific subcategorisation requirements, and this is reflected by the presence of the prepositions into and in in the ngram. Other cases seem to be part of formulae, like but also in, as part of not only X but</Paragraph>
    </Section>
    <Section position="2" start_page="39" end_page="41" type="sub_section">
      <SectionTitle>
MWE Pages Entropy Prob (%)
</SectionTitle>
      <Paragraph position="0"> stand by and 1350000 0.399 65.5 discharged from hospital 553000 0.001 99.9 shock of it 92300 0.541 44.6 was woken by 91400 0.001 99.9 telephone rang and 43700 0.026 99.2 glanced across at 36900 0.003 99.9 the citizens charter 22900 0.070 97.9 input is complete 13900 0.086 97.2 from of government 706 0.345 0.1 the to infinitive 561 0.445 1.4  also Y, but what about, and the more the (part of the more the Yer).</Paragraph>
      <Paragraph position="1"> However, among the candidates there still remain those that are not genuine MWEs, like of alcohol and and than that in, which contain very frequent words that enable them to obtain a very high frequency count without being an MWE. Therefore, to detect these cases, the remainder of the candidates could be further analysed using some statistical techniques to try to distinguish them from the more likely MWEs among the candidates. This is done by Bouma and Villada (2002) who investigated some measures that have been used to identify certain kinds of MWEs, focusing on collocational prepositional phrases, and on the tests of mutual information, log likelihood and kh2. One significant difference here is that this work is not constrained to a particular type of MWEs, but has to deal with them in general. Moreover, the statistical measures used by Bouma and Villada demand the knowledge of single word frequencies which can be a problem when using Google especially for common words like of and a.</Paragraph>
      <Paragraph position="2"> In Tables 4 and 5 we present two alternative measures that combined can help to detect false candidates. The rational is similar to the statistical tests, without the need of searching for the frequency of each of the words that make up the MWE. We assume that if a candidate is just a result of the random occurrence of very frequent words most probably the order of the words in the ngram is not important. Therefore, given a candidate, such as the likes of, we measure the frequency of occurrence of all its permutations (e.g. the of likes, likes the of, etc) and we calculate the candidate's entropy as</Paragraph>
      <Paragraph position="4"> where Pi is the probability of occurrence of a given permutation, and N the total number of permutations. The entropy above defined has its maximum at S = 1 when all permutations are equally probably, which indicates a clear signature of a random nature. On the other hand, when order is very important and only a single configuration is allowed the entropy has its minimum, S = 0. An ngram with low entropy has good chances of being an MWE. A close inspection on Table 4 shows that the top two candidate ngrams have relatively high entropies ( here we consider high entropy when S &gt; 0.3 ). In the first case this can be explained by the fact that the word the can appear after the word of without compromising the MWE meaning as in the burden of the job. In the second case it shows that the real MWE is cost effective and the word and can be either in the beginning or in the end of the trigram. In fact for a trigram with only two acceptable permutations the entropy is S = log2/log6 similarequal 0.39, very close to what is obtained .</Paragraph>
      <Paragraph position="5"> We also show the probability of occurrence of each candidate ngram among its permutations (P1). Most of the candidates in the list are more frequent than their permutations. In Table 4 we find two exceptions which are clearly spelling errors in the last 2 ngrams. Therefore low P1 can be a good indicative of a noisy candidate. Another good predictor is the relative frequency between the candidates. Given the occurrence values for the most frequent candidates, we consider that by using a threshold of 20,000 occurrences, it is possible to remove the more noisy cases.</Paragraph>
      <Paragraph position="6"> We note that the grammar can also impose some restrictions in the order of the elements in the ngram, in the sense that some of the generated permutations are ungrammatical (e.g. the of likes) and will most probably have null or very low frequencies. Therefore, on top of the constraints on the lexical order there are also constraints on the constituent order of a candidate which will be reflected in these measures.4 The remainder candidates can be semi-automatically included in the grammar, by using a lexical type predictor, as described in the next section. With this information, each candidate is added as a lexical entry, with a possible manual check by a grammar writer prior to inclusion in the grammar.</Paragraph>
      <Paragraph position="7"> 4Google ignores punctuation between the elements of the ngram. This can lead to some hits being returned for some of the ungrammatical permuted ngrams, such as one one by in the sentence We're going to catch people one by one. One day,... from www.beertravelers.com/lists/drafttech.html. On the other hand, Google only returns the number of pages where a given ngram occurred, but not the number of times it occurred in that page. This can result in a huge underestimation especially for very frequent ngrams and words, which can be used mo re than once in a given page. Therefore, a conservative view of these frequencies must be adopted, given that for some ngrams they might be inflated and for others deflated.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="41" end_page="42" type="metho">
    <SectionTitle>
5 Automated Deep Lexical Acquisition
</SectionTitle>
    <Paragraph position="0"> In section (3), we have seen that more than 50% of the sentences contain one or more unknown words. And about half of the other parsing failures are also due to lexicon missing. In this section, we propose a statistical approach towards lexical type prediction for unknown words, including multi-word expressions.</Paragraph>
    <Section position="1" start_page="41" end_page="41" type="sub_section">
      <SectionTitle>
5.1 Atomic Lexical Types
</SectionTitle>
      <Paragraph position="0"> Lexicalist grammars are normally composed of a limited number of rules and a lexicon with rich linguistic features attached to each entry. Some grammar formalisms have a type inheriting system to encode various constraints, and a flat structure of the lexicon with each entry mapped onto one type in the inheritance hierarchy. The following discussion is based on Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994), but should be easily adapted to other formalisms, as well.</Paragraph>
      <Paragraph position="1"> The lexicon of HPSG consists of a list of well-formed Typed Feature Structures (TFSs) (Carpenter, 1992), which convey the constraints on specific words by two ways: the type compatibility, and the feature-value consistency. Although it is possible to use both features and types to convey the constraints on lexical entries, large grammars prefer the use of types in the lexicon because the inheritance system prevents the redundant definition of feature-values. And the feature-value constraints in the lexicon can be avoided by extending the types. Say we have n lexical entries Li :tbracketleftbigF a1bracketrightbig... Ln :tbracketleftbigF anbracketrightbig. They share the same lexical type t, but take different values for the feature F. If a1,...,an are the only possible values for F in the context of type t, we can extend the type t with subtypes ta1 :tbracketleftbigF a1bracketrightbig... tan :tbracketleftbigF anbracketrightbig and modify the lexical entries to use these new types, respectively. Based on the fact that large grammars normally have a very restricted number of feature-values constraints for each lexical type, the increase of the types is acceptable. It is also typical that the types assigned to lexical entries are maximum on the type hierarchy, which means that they have no further subtypes. We will call the maximum lexical types after extension the atomic lexical types. Then the lexicon will be a multi-valued mapping from the word stems to the atomic lexical types.</Paragraph>
      <Paragraph position="2"> Needless to underline here that all we have mentioned above is not applicable exclusively to HPSG, but to many other formalisms based on TFSs, which makes our assumptions about atomic lexical types all the more relevant for a wide range of systems and applications.</Paragraph>
    </Section>
    <Section position="2" start_page="41" end_page="42" type="sub_section">
      <SectionTitle>
5.2 Statistical Lexical Type Predictor
</SectionTitle>
      <Paragraph position="0"> Given that the lexicon of deep grammars can be modelled by a mapping from word stems to atomic lexical types, we now go on designing the statistical methods that can automatically &amp;quot;guess&amp;quot; such mappings for unknown words.</Paragraph>
      <Paragraph position="1"> Similar to Baldwin (2005), we also treat the problem as a classification task. But there is an important difference. While Baldwin (2005) makes predictions for each unknown word, we create a new lexical entry for each occurrence of the unknown word. The assumption behind this is that there should be exactly one lexical entry that corresponds to the occurrence of the word in the given context5.</Paragraph>
      <Paragraph position="2"> We use a single classifier to predict the atomic lexical type. There are normally hundreds of atomic lexical types for a large grammar. So the classification model should be able to handle a large number of output classes. We choose the Maximum Entropy-based model because it can easily handle thousands of features and a large number of possible outputs. It also has the advantages of general feature representation and no independence assumption between features. With the efficient parameter estimation algorithms discussed by Malouf (2002), the training of the model is now very fast.</Paragraph>
      <Paragraph position="3"> For our prediction model, the probability of a lexical type t given an unknown word and its context c is:</Paragraph>
      <Paragraph position="5"> where feature fi(t,c) may encode arbitrary characteristics of the context. The parameters &lt; th1,th2,... &gt; can be evaluated by maximising the pseudo-likelihood on a training corpus (Malouf, 2002). The detailed design and feature selection for the lexical type predictor are described in Zhang and Kordoni (2006).</Paragraph>
      <Paragraph position="6">  knowns. In principle, this constraint can be relaxed by allowing the classifier to return more than one results by, setting a confidence threshold, for example.</Paragraph>
      <Paragraph position="7">  In the experiment described here, we have used the latest version of the Redwoods Treebank in order to train the lexical type predictor with morphological features and context words/POS tags features 6. We have then extracted from the BNC 6248 sentences, which contain at least one of the 311 MWE candidates verified with World Wide Web in the way described in the previous section. For each occurrence of the MWE candidates in this set of sentences, our lexical type predictor has predicted a lexical entry candidate. This has resulted in 1936 distinct entries. Only those entries with at least 5 counts have been added into the grammar. This has resulted in an extra 373 MWE lexical entries for the grammar.</Paragraph>
      <Paragraph position="8"> This addition to the grammar has resulted in a significant increase in coverage (table 6) of 14.4%. This result is very promising, as only a subset of the candidate MWEs has been analysed, and could result in an even greater increase in coverage, if these techniques were applied to the complete set of candidates.</Paragraph>
      <Paragraph position="9"> However, we should also point out that the coverage numbers reported in Table 6 are for a set of &amp;quot;difficult&amp;quot; sentences which contains a lot of MWEs. When compared to the numbers reported in Table 1, the coverage of the parser on this data set after adding the MWE entries is still significantly lower. This indicates that not all the MWEs can be correctly handled by simply adding more lexical entries. Further investigation is still required. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML