File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2176_metho.xml
Size: 10,958 bytes
Last Modified: 2025-10-06 14:15:03
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2176"> <Title>Learning Correlations between Linguistic Indicators and Semantic Constraints: Reuse of Context-Dependent Descriptions of Entities</Title> <Section position="4" start_page="1072" end_page="1074" type="metho"> <SectionTitle> 3 Language Reuse in Text Generation </SectionTitle> <Paragraph position="0"> Text generation usually involves lexical choice that is, choosing one way of referring to an entity over another. Lexical choice refers to a variety of decisions that have to made in text generation. For example, picking one among several equivalent (or nearly equivalent) constructions is a form of lexical choice (e.g., &quot;The Utah Jazz handed the Boston Celtics a de fear' vs. &quot;The Utah Jazz defeated the Boston Celtics&quot; (Robin, 1994)). We are interested in a different aspect of the problem: namely learning the rules that can be used for automatically selecting an appropriate description of an entity in a specific</Paragraph> <Paragraph position="2"> context.</Paragraph> <Paragraph position="3"> To be feasible and scaleable, a technique for solving a particular case of the problem of lexicai choice must involve automated learning. It is also useful if the technique can specify enough constraints on the text to be generated so that the number of possible surface realizations that match the semantic constraints is reduced significantly. The easiest case in which lexical choice can be made is when the full surface structure can be used, and when it has been automatically extracted from a corpus. Of course, the constraints on the use of the structure in the generated text have to be reasonably similar to the ones in the source text.</Paragraph> <Paragraph position="4"> We have found that a natural application for the analysis of entity-description pairs is language reuse, which includes techniques of extracting shallow structure from a corpus and applying that structure to computer-generated texts.</Paragraph> <Paragraph position="5"> Language reuse involves two components: a source text written by a human and a target text, that is to be automatically generated by a computer, partially making use of structures reused from the source text. The source text is the one from which particular surface structures are extracted automatically, along with the appropriate syntactic, semantic, and pragmatic constraints under which they are used. Some examples of language reuse include collocation analysis (Smadja, 1993), the use of entire factual sentences extracted from corpora (e.g., &quot;'Toy Story' is the Academy Award winning animated film developed by Pixar~'), and summarization using sentence extraction (Paice, 1990; Kupiec et al., 1995). In the case of summarization through sentence extraction, the target text has the additional property of being a subtext of the source text. Other techniques that can be broadly categorized as language reuse are learning relations from on-line texts (Mitchell, 1997) and answering natural language questions using an on-line encyclopedia (Kupiec, 1993).</Paragraph> <Paragraph position="6"> Stydying the concept of language reuse is rewarding because it allows generation systems to leverage on texts written by humans and their deliberate choice of words, facts, structure.</Paragraph> <Paragraph position="7"> We mentioned that for language reuse to take place, the generation system has to use the same surface structure in the same syntactic, semantic, and pragmatic context as the source text from which it was extracted. Obviously, all of this information is typically not available to a generation system. There are some special cases in which most of it can be automatically computed. null Descriptions of entities are a particular instance of a surface structure that can be reused relatively easily. Syntactic constraints related to the use of descriptions are modest - since descriptions are always noun phrases that appear as either pre-modifiers or appositions 2, they are quite flexibly usable in any generated text in which an entity can be modified with an appropriate description. We will show in the rest of the paper how the requisite semantic (i.e., &quot;what is the meaning of the description to pick&quot;) and pragmatic constraints (i.e., &quot;what purpose does using the description achieve ?') can be extracted automatically.</Paragraph> <Paragraph position="8"> Given a profile like the one shown in Table 1, and an appropriate set of semantic constraints (columns 2-7 of the table), the generation component needs to perform a profile lookup and select a row (description) that satisfies most or all semantic constraints. For example, if the semantic constraints specify that the description has to include the country and the political position of Ung Huot, the most appropriate description is &quot;Cambodian foreign minister&quot;.</Paragraph> </Section> <Section position="5" start_page="1074" end_page="1075" type="metho"> <SectionTitle> 4 Experimental Setup </SectionTitle> <Paragraph position="0"> In our experiments, we have used two widely available tools - WordNet and Ripper.</Paragraph> <Paragraph position="1"> WordNet (Miller et al., 1990) is an on-line hierarchical lexical database which contains semantic information about English words (including hypernymy relations which we use in our system). We use chains of hypernyms when we need to approximate the usage of a particular word in a description using its ancestor and sibling nodes in WordNet. Particularly useful for our application are the synset offsets of the words in a description. The synset offset is a number that uniquely identifies a concept node (synset) in the WordNet hierarchy. Figure 3 shows that the synset offset for the concept &quot;administrator, decision maker&quot; is &quot;(07063507}', 2We haven't included relative clauses in our study.</Paragraph> <Paragraph position="2"> while its hypernym, &quot;head, chief, top dog&quot; has a synset offset of &quot;~07311393} &quot;.</Paragraph> <Paragraph position="3"> Ripper (Cohen, 1995) is an algorithm that learns rules from example tuples in a relation.</Paragraph> <Paragraph position="4"> Attributes in the tuples can be integers (e.g., length of an article, in words), sets (e.g., semantic features), or bags (e.g., words that appear in a sentence or document). We use Ripper to learn rules that correlate context and other linguistic indicators with the semantics of the description being extracted and subsequently reused. It is important to notice that Ripper is designed to learn rules that classify data into atomic classes (e.g., &quot;good&quot;, &quot;average&quot;, and &quot;bad&quot;). We had to modify its algorithm in order to classify data into sets of atoms. For example, a rule can have the form &quot;if CONDITION then \[( 07063762} { 02864326} { 0001795~}\] &quot;3 . This rule states that if a certain &quot;CONDITION&quot; (which is a function of the indicators related to the description) is met, then the description is likely to contain words that are semantically related to the three WordNet nodes \[{07063762} {02864326} {00017954}\].</Paragraph> <Paragraph position="5"> The stages of our experiments are described in detail in the remainder of this section.</Paragraph> <Section position="1" start_page="1074" end_page="1074" type="sub_section"> <SectionTitle> 4.1 Semantic tagging of descriptions </SectionTitle> <Paragraph position="0"> Our system, PROFILE, processes WWWaccessible newswire on a round-the-clock basis and extracts entities (people, places, and organizations) along with related descriptions. The extraction grammar, developed in CREP (Duford, 1993), covers a variety of pre-modifier and appositional noun phrases.</Paragraph> <Paragraph position="1"> For each word wi in a description, we use a version of WordNet to extract the synset offset of the immediate parent of wi.</Paragraph> </Section> <Section position="2" start_page="1074" end_page="1075" type="sub_section"> <SectionTitle> 4.2 Finding linguistic cues </SectionTitle> <Paragraph position="0"> Initially, we were interested in discovering rules manually and then validating them using the learning algorithm. However, the task proved (nearly) impossible considering the sheer size of the corpus. One possible rule that we hypothesized and wanted to verify empirically at this stage was parallelism. This linguistically-motivated rule states that in a sentence with a parallel structure (consider, for instance, the sentence fragment &quot;... Alija Izetbegovic, a Muslim, Kresimir Zubak, a Croat, and Momcilo Krajisnik, a Serb... &quot;) all entities involved have similar descriptions. However, rules at such a detailed syntactic level take too long to process on a 180 MB corpus and, further, no more than a handful of such rules can be discovered manually. As a result, we made a decision to extract all indicators automatically. We would also like to note that using syntactic information on such a large corpus doesn't appear particularly feasible. We limited therefore our investigation to lexicai, semantic, and contextual indicators only. The following subsection describes the attributes used.</Paragraph> </Section> <Section position="3" start_page="1075" end_page="1075" type="sub_section"> <SectionTitle> 4.3 Extracting linguistic cues </SectionTitle> <Paragraph position="0"> automatically The list of indicators that we use in our system are the following: * Context: (using a window of size 4, excluding the actual description used, but not the entity itself) - e.g., &quot;\['clinton' 'clinton' 'counsel' 'counsel' 'decision' 'decision' 'gore' 'gore' 'ind' 'ind' 'index' 'news' 'november' 'wednesday'\]&quot; is a bag of words found near the description of Bill Clinton in the training corpus.</Paragraph> <Paragraph position="1"> * Length of the article: - an integer.</Paragraph> <Paragraph position="2"> * Name of the entity: - e.g., &quot;Bill Clinton&quot;. null * Profile: The entire profile related to a per-son (all descriptions of that person that are found in the training corpus).</Paragraph> <Paragraph position="3"> * Synset Offsets: - the WordNet node numbers of all words (and their parents)) that appear in the profile associated with the entity that we want to describe.</Paragraph> </Section> <Section position="4" start_page="1075" end_page="1075" type="sub_section"> <SectionTitle> 4.4 Applying machine learning method </SectionTitle> <Paragraph position="0"> To learn rules, we ran Ripper on 90% (10,353) of the entities in the entire corpus. We kept the remaining 10% (or 1,151 entities) for evaluation.</Paragraph> <Paragraph position="1"> Sample rules discovered by the system are shown in Table 3.</Paragraph> </Section> </Section> class="xml-element"></Paper>