File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0603_evalu.xml
Size: 8,876 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0603"> <Title>Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing</Title> <Section position="7" start_page="20" end_page="23" type="evalu"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="20" end_page="20" type="sub_section"> <SectionTitle> 4.1 Lauer's Dataset </SectionTitle> <Paragraph position="0"> We experimented with the dataset from (Lauer, 1995), in order to produce results comparable to those of Lauer and Keller & Lapata. The set consists of 244 unambiguous 3-noun NCs extracted from Grolier's encyclopedia; however, only 216 of these NCs are unique.</Paragraph> <Paragraph position="1"> Lauer (1995) derived n-gram frequencies from the Grolier's corpus and tested the dependency and the adjacency models using this text. To help combat data sparseness issues he also incorporated a taxonomy and some additional information (see Related Work section above). Lapata and Keller (2004) derived their statistics from the Web and achieved results close to Lauer's using simple lexical models.</Paragraph> </Section> <Section position="2" start_page="20" end_page="21" type="sub_section"> <SectionTitle> 4.2 Biomedical Dataset </SectionTitle> <Paragraph position="0"> We constructed a new set of noun compounds from the biomedical literature. Using the Open NLP 6In addition to the articles (a, an, the), we also used quantifiers (e.g. some, every) and pronouns (e.g. this, his). tools,7 we sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003). Then we extracted all 3-noun sequences falling in the last three positions of noun phrases (NPs) found in the shallow parse. If the NP contained other nouns, the sequence was discarded. This allows for NCs which are modified by adjectives, determiners, and so on, but prevents extracting 3-noun NCs that are part of longer NCs. For details, see (Nakov et al., 2005).</Paragraph> <Paragraph position="1"> This procedure resulted in 418,678 different NC types. We manually investigated the most frequent ones, removing those that had errors in tokenization (e.g., containing words like transplan or tation), POS tagging (e.g., acute lung injury, where acute was wrongly tagged as a noun) or shallow parsing (e.g., situ hybridization, that misses in). We had to consider the first 843 examples in order to obtain 500 good ones, which suggests an extraction accuracy of 59%. This number is low mainly because the tokenizer handles dash-connected words as a single token (e.g. factor-alpha) and many tokens contained other special characters (e.g. cd4+), which cannot be used in a query against a search engine and had to be discarded.</Paragraph> <Paragraph position="2"> The 500 NCs were annotated independently by two judges, one of which has a biomedical background; the other one was one of the authors. The problematic cases were reconsidered by the two judges and after agreement was reached, the set contained: 361 left bracketed, 69 right bracketed and 70 ambiguous NCs. The latter group was excluded from the experiments.8 We calculated the inter-annotator agreement on the 430 cases that were marked as unambiguous after agreement. Using the original annotator's choices, we obtained an agreement of 88% or 82%, depending on whether we consider the annotations, that were initially marked as ambiguous by one of the judges to be correct. The corresponding values for the kappa statistics were .606 (substantial agreement) and .442 (moderate agreement).</Paragraph> <Paragraph position="3"> inflection or with a different word variant, e.g,. colon cancer cells and colon carcinoma cells.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 4.3 Experiments </SectionTitle> <Paragraph position="0"> The n-grams, surface features, and paraphrase counts were collected by issuing exact phrase queries, limiting the pages to English and requesting filtering of similar results.9 For each NC, we generated all possible word inflections (e.g., tumor and tumors) and alternative word variants (e.g., tumor and tumour). For the biomedical dataset they were automatically obtained from the UMLS Specialist lexicon.10 For Lauer's set we used Carroll's morphological tools.11 For bigrams, we inflect only the second word. Similarly, for a prepositional paraphrase we generate all possible inflected forms for the two parts, before and after the preposition.</Paragraph> </Section> <Section position="4" start_page="21" end_page="23" type="sub_section"> <SectionTitle> 4.4 Results and Discussion </SectionTitle> <Paragraph position="0"> The results are shown in Tables 1 and 2. As NCs are left-bracketed at least 2/3rds of the time (Lauer, 1995), a straightforward baseline is to always assign a left bracketing. Tables 1 and 2 suggest that the surface features perform best. The paraphrases are equally good on the biomedical dataset, but on Lauer's set their performance is lower and is comparable to that of the dependency model.</Paragraph> <Paragraph position="1"> The dependency model clearly outperforms the adjacency one (as other researchers have found) on Lauer's set, but not on the biomedical set, where it is equally good. kh2 barely outperforms #, but on the biomedical set kh2 is a clear winner (by about 1.5%) on both dependency and adjacency models.</Paragraph> <Paragraph position="2"> The frequencies (#) outperform or at least rival the probabilities on both sets and for both models. This is not surprising, given the previous results by Lapata and Keller (2004). Frequencies also outperform Pr on the biomedical set. This may be due to the abundance of single-letter words in that set (because of terms like T cell, B cell, vitamin D etc.; similar problems are caused by Roman digits like ii, iii etc.), whose Web frequencies are rather unreliable, as they are used by Pr but not by frequencies. Single-letter words cause potential problems for the paraphrases rect ([?]), incorrect (x), and no prediction ([?]), followed by precision (P, calculated over[?]andxonly) and coverage (C, % examples with prediction). We use &quot;-&quot; for back-off to another model in case of[?]. as well, by returning too many false positives, but they work very well with concatenations and dashes: e.g., T cell is often written as Tcell.</Paragraph> <Paragraph position="3"> As Table 4 shows, most of the surface features that we predicted to be right-bracketing actually indicated left. Overall, the surface features were very good at predicting left bracketing, but unreliable for right-bracketed examples. This is probably in part due to the fact that they look for adjacent words, i.e., they act as a kind of adjacency model.</Paragraph> <Paragraph position="4"> We obtained our best overall results by combining the most reliable models, marked in bold in Tables 1, 2 and 4. As they have independent errors, we used a majority vote combination.</Paragraph> <Paragraph position="5"> Table 3 compares our results to those of Lauer (1995) and of Lapata and Keller (2004). It is important to note though, that our results are directly comparable to those of Lauer, while the Keller&Lapata's are not, since they used half of the Lauer set for de- null velopment and the other half for testing.12 We, following Lauer, used everything for testing. Lapata & Keller also used the AltaVista search engine, which no longer exists in its earlier form. The table does not contain the results of Girju et al. (2005), who achieved 83.10% accuracy, but used a supervised algorithm and targeted bracketing in context. They further &quot;shuffled&quot; the Lauer's set, mixing it with additional data, thus making their results even harder to compare to these in the table.</Paragraph> <Paragraph position="6"> Note that using page hits as a proxy for n-gram frequencies can produce some counter-intuitive results. Consider the bigrams w1w4, w2w4 and w3w4 and a page that contains each bigram exactly once.</Paragraph> <Paragraph position="7"> A search engine will contribute a page count of 1 for w4 instead of a frequency of 3; thus the page hits for w4 can be smaller than the page hits for the sum of the individual bigrams. See Keller and Lapata (2003) for more issues.</Paragraph> <Paragraph position="8"> 12In fact, the differences are negligible; their system achieves pretty much the same result on the half split as well as on the whole set (personal communication).</Paragraph> <Paragraph position="9"> results on Lauer's set. The results of Keller & Lapata are on half of Lauer's set and thus are only indirectly comparable (note the different baseline).</Paragraph> </Section> </Section> class="xml-element"></Paper>