File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-1406_intro.xml
Size: 6,161 bytes
Last Modified: 2025-10-06 14:03:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1406"> <Title>Using Distributional Similarity to Identify Individual Verb Choice</Title> <Section position="3" start_page="0" end_page="33" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Human text is characterised by the individual lexical choices of the specific author. It varies from author to author. Individual authors use different verbs to describe the same action. Natural lan- null guage generation (NLG) systems, in contrast, normally produce uniform outputs without considering other lexical possibilities. Consider the followingexamplefromourcorporathataretheBBC null corpus and the Recipes for health eating corpus.</Paragraph> <Paragraph position="1"> 1. BBC Corpus: Finely grate the ginger and squeeze out the juice into a shallow nonmetallic dish. (BBC online recipes) 2. Author2: Extract juice from orange and add this with the water to the saucepan. (Recipes for health eating).</Paragraph> <Paragraph position="2"> Here, we can see that the two authors express the same type of action with different verbs, 'squeeze' and 'extract'. In fact, when expressing this action, theBBCcorpusalwaysusetheverb'squeeze',and Author2 only uses the verb 'extract'. Therefore, we can assume that Author2 considers the verb 'extract' to describe the same action as the verb 'squeeze'usedbytheBBCcorpus. Thepurposeof our research is to develop a NLG system that can detect these kinds of individual writing features, such as the verb choice of individual authors, and can then generate personalised text.</Paragraph> <Paragraph position="3"> The input of our personalised NLG system is an unseen recipe from the BBC food website. Our system, then, translates all sentences into the style of a personal author based on features drawn from analysing an individual corpus we collected. In this paper, we address the verb choice of the individual author in the translation process.</Paragraph> <Paragraph position="4"> Our system defines the writing style of an individual author by analysing an individual corpus. Therefore, our system is a corpus-based NLG system. Lexical choice for individual authors is predicted by analysing the distributional similarity between words in a general large recipe corpus that is used to produce the verbs as the action representation and words in a specific indi- null vidual recipe corpus. Firstly, we collected a large corpus in the recipe domain from the BBC online website. This large recipe corpus is used to extract feature values, for example verb choice, by analysing an individual corpus. Secondly, we collected our individual corpora for a number of individual authors. Each of them is used to extract feature values that may define the individual writing style. The individual author may choose the same or a different verb to describe cooking actions. The question is how can we identify the individual choice? For example, Author2 uses the verb 'extract' instead of the verb 'squeeze'. However, if the author does express the action by a different verb, the problem is how our system picks out verbs according to the individual choice of an author.</Paragraph> <Paragraph position="5"> One way to solve this problem is to access large-scale manually constructed thesauri such as WordNet(Fellbaum, 1998), Roget's(Roget, 1911) or the Macquarie (Bernard, 1990) to get all synonyms and choose the most frequent one in the individual corpus. Another possible way is to use a lexical knowledge based system, like VerbNet (Kipper et al., 2000) to get more possible lexical choices. However, both methods only provide a number of pre-produced lexical choices that may or may not be the words that the individual author would choose. In other words, the lexical choice of an author may not be based on the synonyms extracted from one of the thesauri or may not even belong to the same semantic class. In our example, 'squeeze' and 'extract' are neither synonyms nor Coordinate Terms in WordNet. In a small domain, it is possible to manually build a verb list so that each action is described by a set of possible verbs. The drawback is that this is expensive. Furthermore, it still cannot catch verbs that are not included in the list. Is it possible to predict the individual verbs automatically? The distributional hypothesis (Harris, 1968) says the following: The meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities.</Paragraph> <Paragraph position="6"> Over recent years, many applications (Lin, 1998), (Lee, 1999), (Lee, 2001), (Weeds et al., 2004), and (Weeds and Weir, 2006) have been investigating the distributional similarity of words. Similarity means that words with similar meaning tend to appear in similar contexts. In NLG, the considerationofsemanticsimilarityisusuallypreferred to just distributional similarity. However, in our case, the most important thing is to capture the mostprobablechoiceofaverbofanindividualauthor for expressing an action. The expression of an action can be either the same verb, synonyms, or Coordinate terms to the verb in the big corpus, or any verbs that an individual author chooses for this action. If we check an individual corpus, there are a set of verbs in our list that do not occur. If these actions occur in the individual corpus, the individual author must use different verbs. Distributionalsimilaritytechnologyhelpsustobuildthe null links between verbs in our list and the verbs in an individual corpus.</Paragraph> <Paragraph position="7"> The rest of this paper is organised as follows.</Paragraph> <Paragraph position="8"> In Section 2, we describe the recipe domain, our corpora and our verb list. Section 3 disscuss our baseline system. In Section 4, we present the distributional similarity measures that we are proposing for analysing our corpora. The combination method is disscussed in Section5. In Section 6, we present an evaluation of our results. In Section 7, we draw conclusions and discuss future work.</Paragraph> </Section> class="xml-element"></Paper>