File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1012_metho.xml
Size: 11,715 bytes
Last Modified: 2025-10-06 14:08:10
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1012"> <Title>Semantic Coherence Scoring Using an Ontology</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Ontology-based Scoring of SRHs </SectionTitle> <Paragraph position="0"> ONTOSCORE performs a number of processing steps, each of them will be described separately in the respective subsections.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Mapping of SRH to Sets of Concepts </SectionTitle> <Paragraph position="0"> A necessary preprocessing step is to convert each SRH into a concept representation (CR). For that purpose we augmented the system's lexicon with specific concept mappings. That is, for each entry in the lexicon either zero, one or many corresponding concepts where added.</Paragraph> <Paragraph position="1"> A simple vector of the concepts, corresponding to the words in the SRH for which concepts in the lexicon exist, constitutes the resulting CR. All other words with empty concept mappings, e.g. articles, are ignored in the conversion. null</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract Replacement Abstract Repetition Process Abstract Imitation Process </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Mapping of CR to Graphs </SectionTitle> <Paragraph position="0"> ONTOSCORE converts the domain model, i.e. an ontology, into a directed graph with concepts as nodes and relations as edges. One additional problem that needed to be solved lies in the fact that the directed subclass-of relations enable path algorithms to ascend the class hierarchy upwards, but do not let them descend, therefore missing a significant set of possible paths. In order to remedy that situation the graph was enriched during its conversion by corresponding parent-of relations, which eliminated the directionality problems as well as avoids cycles and 0paths. In order to find the shortest path between two concepts, ONTOSCORE employs the single source shortest path algorithm of Dijkstra (Cormen et al., 1990).</Paragraph> <Paragraph position="1"> Given a concept representation CR a0a2a1a4a3 , ..., a1a6a5a8a7 , the algorithm runs once for each concept. The Dijkstra algorithm calculates minimal paths from a source node to all other nodes. Then, the minimal paths connecting a given concept a1a10a9 with every other concept in CR (excluding a1a11a9 itself) are selected, resulting in an a12a14a13a15a12 matrix of the respective paths.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 The Scoring Algorithm </SectionTitle> <Paragraph position="0"> To score the minimal paths connecting all concepts with each other in a given CR, we first adopted a method proposed by Demetriou and Atwell (1994) to score the semantic coherence of alternative sentence interpretations against graphs based on the Longman Dictionary of Contemporary English (LDOCE). To construct the graph the dictionary lemmata were represented as nodes in an isa hierarchy and their semantic relations were represented as edges, which were extracted automatically from the LDOCE.</Paragraph> <Paragraph position="1"> As defined by Demetriou and Atwell (1994), a16 a1 a0a11a17 a3a4a18a17a11a19 a18 a4 a4 a4 a18a17 a5 a7 is the set of direct relations (both isa and semantic relations) that can connect two nodes (concepts); and a20 a1 a0a2a21 a3a22a18a21a23a19 a18 a4 a4 a4 a18a21 a5 a7 is the set of corresponding weights, where the weight of each isa relation is set to a3 and that of each other relation to a24 . For each two conceptsa1 a9 , a1a26a25 the set a27 a1 a0a29a28 a3a2a18a28a30a19 a18 a4 a4 a4 a18a28a8a31a32a7 denotes the scores of all possible paths that link the two concepts.</Paragraph> <Paragraph position="2"> The score for path a33a35a34a36a33 a1a37a24</Paragraph> <Paragraph position="4"> a9 represents the number of times the relation a17a4a9 exists in path a33 . The ensuing distance between two concepts a1 a9 and a1a26a25 is, then, defined as the minimum score derived between a1a10a9 and a1 a25 , i.e.:</Paragraph> <Paragraph position="6"> The algorithm selects from the set of all paths between two concepts the one with the smallest weight, i.e. the cheapest. The distances between all concept pairs in CR are summed up to a total score. The set of concepts with the lowest aggregate score represents the combination with the highest semantic relatedness.</Paragraph> <Paragraph position="7"> Demetriou and Atwell (1994) do not provide concrete evaluation results for the method. Also, their algorithm only allows for a relative judgment stating which of a set of interpretations given a single sentence is more semantically related.</Paragraph> <Paragraph position="8"> Since our objective is to compute semantic coherence scores of arbitrary CRs on an absolute scale, certain extensions are necessary. In this application, the CRs to be scored can differ in terms of their content, the number of concepts contained therein and their mappings to the original SRH. Moreover, in order to achieve absolute values, the final score should be related to the number of concepts in an individual set and the number of words in the original SRH. Therefore, the results must be normalized in order to allow for evaluation, comparability and clearer interpretation of the semantic coherence scores.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Scoring Concept Representations </SectionTitle> <Paragraph position="0"> We modified the algorithm described above to make it applicable and evaluatable with respect to the task at hand as well as other possible tasks. The basic idea is to calculate a score based on the path distances in a5 a16 .</Paragraph> <Paragraph position="1"> Since short distances indicate coherence and many concept pairs in a given a5 a16 may have no connecting path, we define the distance between two concepts a1a11a9 and a1 a25 that are only connected via isa relations in the knowledge base as a0 a31a7a6a9a8 . This maximum value can also serve as a maximum for long distances and can thus help to prune the search tree for long paths. This constant has to be set according to the structure of the knowledge base. For example, employing the ontology described above, the maximum distance between two concepts does not exceed ten and we chose in that case a0 a31a7a6a9a8 a1 a24 a3 . We can now define the semantic coherence score for a5 a16 as the average path length between all concept pairs</Paragraph> <Paragraph position="3"> a29 pairs of concepts with possible directed connections, i.e., a path from concept a1a2a9 to concept a1 a25 may be completely different to that from a1 a25 to a1a6a9 or even be missing. As a symmetric alternative, we may want to consider a path from a1a10a9 to a1 a25 and a path from a1 a25 to a1a10a9 to be semantically equivalent and thus model every relation in a bidirectional way. We can then compute a symmetric</Paragraph> <Paragraph position="5"> ONTOSCORE implements both options. In the ontology currently employed by the system some reverse relations can be found, e.g. given a1a4a3 =Broadcast and a1 a19 =Channel, there exists a path from a1a2a3 to a1 a19 via the relation has-channel and a different path from a1 a19 to a1a2a3 via the relation has-broadcast. However, such reverse relations are only sporadically represented in the ontology. Consequently, it is difficult to account for their influence on a10 a34a11a5 a16 a39 in general. That is why we chose the a1a6a9a18 a1 a25 a39 between a given pair of concepts, regardless of the direction, is taken into account.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Word/Concept Relation </SectionTitle> <Paragraph position="0"> Given the algorithm proposed above, a significant number of misclassifications for SRHs would result from the cases when an SRH contains a high proportion of function words (having no conceptual mappings in the resulting CR) and only a few content words. Let's consider the following example: The corresponding CR is constituted out of a single concept Information Search Process.</Paragraph> <Paragraph position="1"> ON TOSCORE would classify the CR as coherent with the highest possible score, as this is the only concept in the set. This, however, would often lead to misclassifications. We, therefore, included a post-processing technique that takes the relation between the number of ontology concepts a37 in a given CR and the total number of words a37a39a38 in the original SRH into account. This relation is defined by the ratio a40 a1a41a37 a13a9a42 a37a43a38 . ONTOSCORE automatically classifies an SRH as being incoherent irrespective of its semantic coherence score, if a40 is less then the threshold set. The threshold may be set freely. The corresponding findings are presented in the evaluation section.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.6 ONTOSCORE at Work </SectionTitle> <Paragraph position="0"> Looking at an example of ONTOSCORE at work, we will examine the utterance given in Example (1). The resulting two SRHs - a10 a16a45a44 a3 and a10 a16a45a44 a19 - are given in Example (1a) and (1b) respectively. The human annotators considered a10 a16a46a44 a3 to be coherent and labeled a10 a16a46a44 a19 as incoherent. According to the concept entries in the lexicon, the SRHs are transformed into two alternative concept representations. As no ambiguous words are found in this example, a5 a16 a3 corresponds to a10 a16a45a44 a3 and</Paragraph> <Paragraph position="2"> They are converted into a graph. According to the algorithm shown in Section 4.3, all paths between the concepts of each graph are calculated and weighted. This yields the following non-a0 a31a7a6a9a8 paths: In both cases the results are sufficient for a relative judgment, i.e. a10 a16a45a44 a19 constitutes a less semantically coherent structure as a10 a16a45a44 a3 . To allow for a binary classification into semantically coherent vs. incoherent samples, a cut-off threshold must be set. The results of the corresponding experiments will be presented in Section 5.2.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.7 Word Sense Disambiguation </SectionTitle> <Paragraph position="0"> Due to lexical ambiguity, the process of transforming an n-best list of SRH to concept representations often results in a set of CRs that is greater than 1, i.e. a given SRH could be transformed into a set of CRs a0 a5 a16 a3 , ..., a5 a16 a5a42a7 . Word sense disambiguation could, therefore, also independently be performed using the semantic coherence scoring described herein as an additional application of our approach. However, that has not been investigated thoroughly yet.</Paragraph> <Paragraph position="1"> For example, lexicon entries for the words: and corresponding final scores:</Paragraph> <Paragraph position="3"> The examination of the resulting scores allows us to conclude that a5 a16 a19 constitutes the most semantically coherent representation of the initial SRH, a5 a16 a3 and a5 a16a80a84 display a slightly lesser degree of semantic coherence, whereas a5 a16a86a85 , a5 a16a80a87 and a5 a16a86a88 are much less coherent and may, thus, be considered inadequate.</Paragraph> </Section> </Section> class="xml-element"></Paper>