File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1406_metho.xml
Size: 11,404 bytes
Last Modified: 2025-10-06 14:10:42
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1406"> <Title>Using Distributional Similarity to Identify Individual Verb Choice</Title> <Section position="4" start_page="33" end_page="35" type="metho"> <SectionTitle> 2 The Recipe Domain and our Corpora </SectionTitle> <Paragraph position="0"> To find the most expressive verb pairs, we have to have corpora to be analysed. Therefore, the selection of a corpus is very important. As the research of authorship attribution (AA) shows (Burrow, 1987), (Holmes and Forsyth, 1995), (Keuelj et al., 2003), (Peng, 2003), and (Clement and Sharp, 2003), therecanbestylevariationsofanindividual author. This happens even with the same topic and genre, and for the same action expressions. Firstly, a person's writing style can change as time, genre, and topic change. Can and Patton (Can and Patton, 2004) have drawn the conclusion: null A higher time gap may have positive impact in separation and categorization.</Paragraph> <Paragraph position="1"> Even within one text, the style may not be uniform. (Burrow, 1987) has pointed out that, for example, in fiction: Thelanguageofitsdialogueandthat of its narrative usually differ from each otherinsomeobviousandmanylessobvious ways.</Paragraph> <Paragraph position="2"> These problems require us to collect high-quality corpora. The recipe domain is a good start in this case. Sentences in it are narrative, imperative and objective, compared with other normal human text. For example, journal articles normally contain a large number of quotations, and they are more subjective. Furthermore, journal articles are more varied in content, even within the same topic. Secondly, most large corpora are not author-categorised. This requires us to collect our own individual corpora.</Paragraph> <Section position="1" start_page="34" end_page="34" type="sub_section"> <SectionTitle> 2.1 Our Corpora </SectionTitle> <Paragraph position="0"> As we mentioned before, we collected a general corpus in the recipe domain from the BBC food website. To make recipes varied enough, this corpus contains different cooking styles from western to eastern, different courses, including starters, maincoursesanddesserts,andanumberofrecipes of famous cooks, such as Ainsley Harriott. Since recipes are widely-available both from the Internet and from publications, it is easy to collect authorcategorisedcorpora. Ourindividualrecipecorpora includefourindividualauthorssofar. Twoofthem are from two published recipe books, and another two we collected online. Recipe books are useful, because they are written in an unique style. Table 1 shows information about both our individual corpora and our large general corpus.</Paragraph> <Paragraph position="1"> Although we are focusing on a small domain, verb variation between individual authors is a common phenomenon. Here are a few further examples from our corpora, which we want to capture. null 1. BBC corpus: Preheat the oven to 200C/400F/Gas 6. (BBC online food recipes) 2. Author2: Switch on oven to 200C, 400F or 3. Author3. Put the oven on. (Food for Health) 1. BBC corpus: Sift the flour, baking powder and salt into a large bowl. (BBC online Recipes) 2. Author2: Sieve the flour, baking powder and bicarbonate of soda into a large mixing bowl. (Recipes for Health Eating) 3. Author3: Sieve the flour in, one-third at a time. (Food for Health)</Paragraph> </Section> <Section position="2" start_page="34" end_page="35" type="sub_section"> <SectionTitle> 2.2 Our Verb List </SectionTitle> <Paragraph position="0"> We manually built a verb list with 146 verbs in total from our BBC corpus. Each verb represents an unique cooking action, associated with definitions and synonyms extracted from WordNet. For example, the verb 'squeeze' contains the following information shown in Figure 1. The BBC Corpus also contains a number of synonyms, such as the verb sift and the verb sieve. In this case, we only pick up the most frequent verb, which is the verb sift in this case, as an cooking action, and we record its synonyms, such as the verb sieve, in the late part of our verb list.</Paragraph> </Section> <Section position="3" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 2.3 Using RASP in our corpora </SectionTitle> <Paragraph position="0"> Ourdataconsistsofverb-objectpairsforverbsobtainedfromourBBCCorpususingRASP(Briscoe null and Carroll, 2002). To derive reliable results, we deal with our data by the following rules. To avoid the sparse data problem and parsing mistakes, we removed a number of verbs that occur less than 3 times in our large corpus, and a set of mistake verbs made by the parser. We consider both direct objects and indirect objects together at the same time.</Paragraph> </Section> </Section> <Section position="5" start_page="35" end_page="35" type="metho"> <SectionTitle> 3 The Baseline Method - WordNet </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> Synonyms </SectionTitle> <Paragraph position="0"> After the individual corpus is parsed, there are a number of main verbs appearing only in the BBC recipe corpus, but not in the individual corpus.</Paragraph> <Paragraph position="1"> This kind of main verbs is called missing verb in a corpus. For example, verbs such as 'roast', 'insert', 'drizzle' appear in the BBC corpus, but not in the Food for Health corpus. We say they are missing verbs in the Food for Health corpus.</Paragraph> <Paragraph position="2"> In this case, if the individual author expresses actions in the missing verb group, other verbs must be chosen instead. Our purpose is to find alternatives used by the individual author. To solve this problem, our baseline measure is the WordNet synonyms. If the missing verb contains synonyms in the verb list, we pick one as the candidate verb, called an available candiate. The following ways decide the verb alternatives for a missing verb. If thereismorethanonecandidateverbforonemissing verb, the most frequent synonym of the missing verb in the individual corpus is chosen as the alternative. The chosen synonym also has to be a main verb in the individual corpus. If the missing verb does not have a synonym or all available candidates do not appear in the individual corpus, we assign no alternative to this missing verb. In this case, we say there is no available alternative for the missing verb. The number of available alternatives for the missing verb and the accuracy is shown in Table 2, and Figure 2.</Paragraph> </Section> </Section> <Section position="6" start_page="35" end_page="36" type="metho"> <SectionTitle> 4 Distributional Similarity Measure </SectionTitle> <Paragraph position="0"> In this section, we introduce the idea of using distributional similarity measures, and discuss how this methodology can help us to capture verbs from individual authors.</Paragraph> <Paragraph position="1"> By calculating the co-occurrence types of target words, distributionalsimilaritydefinesthesimilarity between target word pairs. The co-occurrence types of a target word (w) are the context, c, in which it occurs and these have associated frequencies which may be used to form probability estimates (Weeds et al., 2004). In our case, the target word is main verbs of sentences and the co-occurrence types are objects. In section 6, similarity between verbs is derived from their objects, since normally there is no subject in the recipe domain. We are using the Additive t-test based Co-occurrence Retrieval Model of (Weeds and Weir, 2006). This method considers for each word w which co-occurrence types are retrieved. In our case, objects have been extracted from both the BBC Corpus and an individual corpus. Weeds and Weir use the the co-occurrence types as the features of word (w), F(w):</Paragraph> <Paragraph position="3"> where D(w, c) is the weight associated with word w and co-occurrence type c. T-test is used as a weight function, which is listed later.</Paragraph> <Paragraph position="4"> Weeds and Weir use the following formula to describe the set of True Positives of co-occurrence types, which w1 and w2 are considered main verbs in copora:</Paragraph> <Paragraph position="6"> Weeds and Weir then calculate the precision by using the proportion of features of w1 which occurs in both words, and the recall by using the proportion of features of w2 which occur in both words. In our experiment, precision is relative to and by combination of DS and WordNet. ('A' means the total number of missing verbs in the individual corpus that have candidate alternatives in an individual corpus from methods.) the BBC Corpus, and the recall is relative to an individual corpus.</Paragraph> <Paragraph position="8"> Finally, Weeds and Weir combine precision and recall together by the following formulas:</Paragraph> <Paragraph position="10"> where both r, b are between [0,1]. In our experiments, we only assigned r=1. However, further performs can be done by assigning different values to r and b.</Paragraph> <Section position="1" start_page="36" end_page="36" type="sub_section"> <SectionTitle> 4.1 The Distributional Similarity method </SectionTitle> <Paragraph position="0"> Each missing verb in the BBC corpus is assigned the most likely verb as the available candidate from the individual corpus. The most likely verb isalwayschosenaccordingtothelargestsimilarity using the DS measure. In our case, if the largest similarity of the verb pair is larger than a certain value (-5.0), we say the missing verb has an available candidate. Otherwise, there is no available candidate existing in the individual corpus. For instance, DS suggests verb 'switch' is the most likely-exchangableverbformissingverb'preheat' in the Recipes for Health Eating corpus. 'switch' appears33timesintheindividualcorpus, inwhich there are 33 times that 'switch' has the same object as 'preheat'. Meanwhile, 'preheat' shows 191 times in total in the BBC corpus, with the same objectsas'switch'176times. ByusingtheDSformulas, the similarity value between 'preheat' and 'switch' is 11.99. The number of available candidates of the missing verbs and the accuracy are shown in Table 2, and Figure 2.</Paragraph> <Paragraph position="1"> There is only one corpus in the DS measures.</Paragraph> <Paragraph position="2"> In our case, w1 and w2 are from different corpora.</Paragraph> <Paragraph position="3"> For example, verb 'preheat' is from the BBC corpus, and verb 'switch' is in the Recipes for Health Eating. Although the co-occurence type is objects of the main verb, the precision is for the general corpus ----the BBC corpus, and the recall is for the individual corpus in our experiments.</Paragraph> </Section> </Section> <Section position="7" start_page="36" end_page="37" type="metho"> <SectionTitle> 5 The Combination method </SectionTitle> <Paragraph position="0"> We also combine the baseline and the DS method together in the combination method. The combination method tries the baseline first. For each missing verb, if the baseline returns an available alternative, this is the final available alternative of the combination method. If not, the available alternative is calculated by the DS method. If there is still no candidate for the missing verb, there is no available alternative in this case.</Paragraph> </Section> class="xml-element"></Paper>