File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0603_metho.xml

Size: 17,815 bytes

Last Modified: 2025-10-06 14:09:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0603">
  <Title>Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing</Title>
  <Section position="6" start_page="17" end_page="20" type="metho">
    <SectionTitle>
3 Models and Features
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
3.1 Adjacency and Dependency Models
</SectionTitle>
      <Paragraph position="0"> In related work, a distinction is often made between what is called the dependency model and the adjacency model. The main idea is as follows. For a given 3-word NC w1w2w3, there are two reasons it may take on right bracketing, [w1[w2w3]]. Either (a) w2w3 is a compound (modified byw1), or (b)w1 and w2 independently modify w3. This distinction can be seen in the examples home health care (health care is a compound modified by home) versus adult male rat (adult and male independently modify rat).</Paragraph>
      <Paragraph position="1"> The adjacency model checks (a), whether w2w3 is a compound (i.e., how strongly w2 modifies w3 as opposed to w1w2 being a compound) to decide whether or not to predict a right bracketing. The dependency model checks (b), does w1 modify w3 (as opposed to w1 modifying w2).</Paragraph>
      <Paragraph position="2"> Left bracketing is a bit different since there is only modificational choice for a 3-word NC. If w1 modifies w2, this implies thatw1w2 is a compound which in turn modifies w3, as in law enforcement agent.</Paragraph>
      <Paragraph position="3"> Thus the usefulness of the adjacency model vs.</Paragraph>
      <Paragraph position="4"> the dependency model can depend in part on the mix of left and right bracketing. Below we show that the dependency model works better than the adjaceny model, confirming other results in the literature. The next subsections describe several different ways to compute these measures.</Paragraph>
    </Section>
    <Section position="2" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
3.2 Using Frequencies
</SectionTitle>
      <Paragraph position="0"> The most straightforward way to compute adjacency and dependency scores is to simply count the corresponding frequencies. Lapata and Keller (2004) achieved their best accuracy (78.68%) with the dependency model and the simple symmetric score #(wi,wj).1</Paragraph>
    </Section>
    <Section position="3" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
3.3 Computing Probabilities
</SectionTitle>
      <Paragraph position="0"> Lauer (1995) assumes that adjacency and dependency should be computed via probabilities. Since they are relatively simple to compute, we investigate them in our experiments.</Paragraph>
      <Paragraph position="1"> Consider the dependency model, as introduced above, and the NC w1w2w3. Let Pr(wi -wj|wj) be the probability that the word wi precedes a given fixed word wj. Assuming that the distinct head-modifier relations are independent, we obtain</Paragraph>
      <Paragraph position="3"> To choose the more likely structure, we can drop the shared factor and compare Pr(w1 -w3|w3) to Pr(w1 -w2|w2).</Paragraph>
      <Paragraph position="4"> The alternative adjacency model compares Pr(w2 - w3|w3) to Pr(w1 - w2|w2), i.e. the association strength between the last two words vs. that between the first two. If the first probability is larger than the second, the model predicts right. The probability Pr(w1 - w2|w2) can be estimated as #(w1,w2)/#(w2), where #(w1,w2) and #(w2) are the corresponding bigram and unigram 1This score worked best on training, when Keller&amp;Lapata were doing model selection. On testing, Pr (with the dependency model) worked better and achieved accuracy of 80.32%, but this result was ignored, as Pr did worse on training.  frequencies. They can be approximated as the number of pages returned by a search engine in response to queries for the exact phrase &amp;quot;w1 w2&amp;quot; and for the word w2. In our experiments below we smoothed2 each of these frequencies by adding 0.5 to avoid problems caused by nonexistent n-grams.</Paragraph>
      <Paragraph position="5"> Unless some particular probabilistic interpretation is needed,3 there is no reason why for a given ordered pair of words (wi,wj), we should use Pr(wi - wj|wj) rather than Pr(wj - wi|wi), i &lt; j. This is confirmed by the adjacency model experiments in (Lapata and Keller, 2004) on Lauer's NC set. Their results show that both ways of computing the probabilities make sense: using Altavista queries, the former achieves a higher accuracy (70.49% vs. 68.85%), but the latter is better on the British National Corpus (65.57% vs. 63.11%).</Paragraph>
    </Section>
    <Section position="4" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
3.4 Other Measures of Association
</SectionTitle>
      <Paragraph position="0"> In both models, the probability Pr(wi - wj|wj) can be replaced by some (possibly symmetric) measure of association between wi and wj, such as Chi squared (kh2). To calculate kh2(wi,wj), we need:  (A) #(wi,wj); (B) #(wi,wj), the number of bigrams in which the first word is wi, followed by a word other than wj; (C) #(wi,wj), the number of bigrams, ending in wj, whose first word is other than wi; (D) #(wi,wj), the number of bigrams in which the  first word is not wi and the second is not wj. They are combined in the following formula:</Paragraph>
      <Paragraph position="2"> Here N = A + B + C + D is the total number of bigrams, B = #(wi)[?]#(wi,wj) and C = #(wj)[?]#(wi,wj). While it is hard to estimate D  right bracketing preference. The best Lauer model does not work with words directly, but uses a taxonomy and further needs a probabilistic interpretation so that the hidden taxonomy variables can be summed out. Because of that summation, the term Pr(w2 -w3|w3) does not cancel in his dependency model. directly, we can calculate it asD = N[?]A[?]B[?]C.</Paragraph>
      <Paragraph position="3"> Finally, we estimate N as the total number of indexed bigrams on the Web. They are estimated as 8 trillion, since Google indexes about 8 billion pages and each contains about 1,000 words on average.</Paragraph>
      <Paragraph position="4"> Other measures of word association are possible, such as mutual information (MI), which we can use with the dependency and the adjacency models, similarly to #, kh2 or Pr. However, in our experiments, kh2 worked better than other methods; this is not surprising, as kh2 is known to outperform MI as a measure of association (Yang and Pedersen, 1997).</Paragraph>
    </Section>
    <Section position="5" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
3.5 Web-Derived Surface Features
</SectionTitle>
      <Paragraph position="0"> Authors sometimes (consciously or not) disambiguate the words they write by using surface-level markers to suggest the correct meaning. We have found that exploiting these markers, when they occur, can prove to be very helpful for making bracketing predictions. The enormous size of Web search engine indexes facilitates finding such markers frequently enough to make them useful.</Paragraph>
      <Paragraph position="1"> One very productive feature is the dash (hyphen).</Paragraph>
      <Paragraph position="2"> Starting with the term cell cycle analysis, if we can find a version of it in which a dash occurs between the first two words: cell-cycle, this suggests a left bracketing for the full NC. Similarly, the dash in donor T-cell favors a right bracketing. The right-hand dashes are less reliable though, as their scope is ambiguous. In fiber optics-system, the hyphen indicates that the noun compound fiber optics modifies system. There are also cases with multiple hyphens, as in t-cell-depletion, which preclude their use.</Paragraph>
      <Paragraph position="3"> The genitive ending, or possessive marker is another useful indicator. The phrase brain's stem cells suggests a right bracketing for brain stem cells, while brain stem's cells favors a left bracketing.4 Another highly reliable source is related to internal capitalization. For example Plasmodium vivax Malaria suggests left bracketing, while brain Stem cells would favor a right one. (We disable this feature on Roman digits and single-letter words to prevent problems with terms like vitamin D deficiency, where the capitalization is just a convention as opposed to a special mark to make the reader think that the last two terms should go together.)  We can also make use of embedded slashes. For example in leukemia/lymphoma cell, the slash predicts a right bracketing since the first word is an alternative and cannot be a modifier of the second one. In some cases we can find instances of the NC in which one or more words are enclosed in parentheses, e.g., growth factor (beta) or (growth factor) beta, both of which indicate a left structure, or (brain) stem cells, which suggests a right bracketing.</Paragraph>
      <Paragraph position="4"> Even a comma, a dot or a colon (or any special character) can act as indicators. For example, &amp;quot;health care, provider&amp;quot; or &amp;quot;lung cancer: patients&amp;quot; are weak predictors of a left bracketing, showing that the author chose to keep two of the words together, separating out the third one.</Paragraph>
      <Paragraph position="5"> We can also exploit dashes to words external to the target NC, as in mouse-brain stem cells, which is a weak indicator of right bracketing.</Paragraph>
      <Paragraph position="6"> Unfortunately, Web search engines ignore punctuation characters, thus preventing querying directly for terms containing hyphens, brackets, apostrophes, etc. We collect them indirectly by issuing queries with the NC as an exact phrase and then post-processing the resulting summaries, looking for the surface features of interest. Search engines typically allow the user to explore up to 1000 results. We collect all results and summary texts that are available for the target NC and then search for the surface patterns using regular expressions over the text. Each match increases the score for left or right bracketing, depending on which the pattern favors.</Paragraph>
      <Paragraph position="7"> While some of the above features are clearly more reliable than others, we do not try to weight them. For a given NC, we post-process the returned Web summaries, then we find the number of leftpredicting surface feature instances (regardless of their type) and compare it to the number of rightpredicting ones to make a bracketing decision.5</Paragraph>
    </Section>
    <Section position="6" start_page="19" end_page="20" type="sub_section">
      <SectionTitle>
3.6 Other Web-Derived Features
</SectionTitle>
      <Paragraph position="0"> Some features can be obtained by using the over-all counts returned by the search engine. As these counts are derived from the entire Web, as opposed to a set of up to 1,000 summaries, they are of different magnitude, and we did not want to simply add them to the surface features above. They appear as 5This appears as Surface features (sum) in Tables 1 and 2. independent models in Tables 1 and 2.</Paragraph>
      <Paragraph position="1"> First, in some cases, we can query for possessive markers directly: although search engines drop the apostrophe, they keep the s, so we can query for &amp;quot;brain's&amp;quot; (but not for &amp;quot;brains' &amp;quot;). We then compare the number of times the possessive marker appeared on the second vs. the first word, to make a bracketing decision.</Paragraph>
      <Paragraph position="2"> Abbreviations are another important feature. For example, &amp;quot;tumor necrosis factor (NF)&amp;quot; suggests a right bracketing, while &amp;quot;tumor necrosis (TN) factor&amp;quot; would favor left. We would like to issue exact phrase queries for the two patterns and see which one is more frequent. Unfortunately, the search engines drop the brackets and ignore the capitalization, so we issue queries with the parentheses removed, as in &amp;quot;tumor necrosis factor nf&amp;quot;. This produces highly accurate results, although errors occur when the abbreviation is an existing word (e.g., me), a Roman digit (e.g., IV), a state (e.g., CA), etc.</Paragraph>
      <Paragraph position="3"> Another reliable feature is concatenation. Consider the NC health care reform, which is leftbracketed. Now, consider the bigram &amp;quot;health care&amp;quot;. At the time of writing, Google estimates 80,900,000 pages for it as an exact term. Now, if we try the word healthcare we get 80,500,000 hits. At the same time, carereform returns just 109. This suggests that authors sometimes concatenate words that act as compounds. We find below that comparing the frequency of the concatenation of the left bigram to that of the right (adjacency model for concatenations) often yields accurate results. We also tried the dependency model for concatenations, as well as the concatenations of two words in the context of the third one (i.e., compare frequencies of &amp;quot;healthcare reform&amp;quot; and &amp;quot;health carereform&amp;quot;).</Paragraph>
      <Paragraph position="4"> We also used Google's support for &amp;quot;*&amp;quot;, which allows a single word wildcard, to see how often two of the words are present but separated from the third by some other word(s). This implicitly tries to capture paraphrases involving the two sub-concepts making up the whole. For example, we compared the frequency of &amp;quot;health care * reform&amp;quot; to that of &amp;quot;health * care reform&amp;quot;. We also used 2 and 3 stars and switched the word group order (indicated with rev.</Paragraph>
      <Paragraph position="5"> in Tables 1 and 2), e.g., &amp;quot;care reform * * health&amp;quot;. We also tried a simple reorder without inserting stars, i.e., compare the frequency of &amp;quot;reform health  care&amp;quot; to that of &amp;quot;care reform health&amp;quot;. For example, when analyzing myosin heavy chain we see that heavy chain myosin is very frequent, which provides evidence against grouping heavy and chain together as they can commute.</Paragraph>
      <Paragraph position="6"> Further, we tried to look inside the internal inflection variability. The idea is that if &amp;quot;tyrosine kinase activation&amp;quot; is left-bracketed, then the first two words probably make a whole and thus the second word can be found inflected elsewhere but the first word cannot, e.g., &amp;quot;tyrosine kinases activation&amp;quot;. Alternatively, if we find different internal inflections of the first word, this would favor a right bracketing.</Paragraph>
      <Paragraph position="7"> Finally, we tried switching the word order of the first two words. If they independently modify the third one (which implies a right bracketing), then we could expect to see also a form with the first two words switched, e.g., if we are given &amp;quot;adult male rat&amp;quot;, we would also expect &amp;quot;male adult rat&amp;quot;.</Paragraph>
    </Section>
    <Section position="7" start_page="20" end_page="20" type="sub_section">
      <SectionTitle>
3.7 Paraphrases
</SectionTitle>
      <Paragraph position="0"> Warren (1978) proposes that the semantics of the relations between words in a noun compound are often made overt by paraphrase. As an example of prepositional paraphrase, an author describing the concept of brain stem cells may choose to write it in a more expanded manner, such as stem cells in the brain. This contrast can be helpful for syntactic bracketing, suggesting that the full NC takes on right bracketing, since stem and cells are kept together in the expanded version. However, this NC is ambiguous, and can also be paraphrased as cells from the brain stem, implying a left bracketing.</Paragraph>
      <Paragraph position="1"> Some NCs' meaning cannot be readily expressed with a prepositional paraphrase (Warren, 1978). An alternative is the copula paraphrase, as in office building that/which is a skyscraper (right bracketing), or a verbal paraphrase such as pain associated with arthritis migraine (left).</Paragraph>
      <Paragraph position="2"> Other researchers have used prepositional paraphrases as a proxy for determining the semantic relations that hold between nouns in a compound (Lauer, 1995; Keller and Lapata, 2003; Girju et al., 2005).</Paragraph>
      <Paragraph position="3"> Since most NCs have a prepositional paraphrase, Lauer builds a model trying to choose between the most likely candidate prepositions: of, for, in, at, on, from, with and about (excluding like which is mentioned by Warren). This could be problematic though, since as a study by Downing (1977) shows, when no context is provided, people often come up with incompatible interpretations.</Paragraph>
      <Paragraph position="4"> In contrast, we use paraphrases in order to make syntactic bracketing assignments. Instead of trying to manually decide the correct paraphrases, we can issue queries using paraphrase patterns and find out how often each occurs in the corpus. We then add up the number of hits predicting a left versus a right bracketing and compare the counts.</Paragraph>
      <Paragraph position="5"> Unfortunately, search engines lack linguistic annotations, making general verbal paraphrases too expensive. Instead we used a small set of hand-chosen paraphrases: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. It is however feasible to generate queries predicting left/right bracketing with/without a determiner for every preposition.6 For the copula paraphrases we combine two verb forms is and was, and three complementizers that, which and who. These are optionally combined with a preposition or a verb form, e.g. themes that are used in science fiction.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML