File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0903_metho.xml
Size: 25,218 bytes
Last Modified: 2025-10-06 14:09:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0903"> <Title>Constructing Text Sense Representations</Title> <Section position="4" start_page="2" end_page="4" type="metho"> <SectionTitle> 3 TSR Trees </SectionTitle> <Paragraph position="0"> In this Section we will informally describe our two algorithms for constructing Text Sense Representation Trees. The first algorithm builds &quot;initial&quot; TSR trees of single input words or very short phrases (Section 3.1), the second generates &quot;derived&quot; TSR trees for arbitrary texts from pre-computed TSR trees.</Paragraph> <Section position="1" start_page="2" end_page="4" type="sub_section"> <SectionTitle> 3.1 Building Initial TSR Trees </SectionTitle> <Paragraph position="0"> The algorithm for building initial TSR trees is based on the retrieval of pages from a &quot;web directory&quot; A &quot;web directory&quot; (other sources use the term &quot;web catalogue&quot; (Fabrizio Sebastiani, 2003)) is a browsable taxonomy of web pages. These web pages are parsed and category descriptions and weight values are extracted from them. The extracted information is then merged into term-specific TSR trees, optionally normalized and pruned.</Paragraph> <Paragraph position="1"> In the following explanations we will use the notions &quot;input term&quot;, &quot;input word&quot; and &quot;input phrase&quot; as follows: An input term is any text that is used as input to an algorithm or program. An input word is any singular word that is an input term. A word is defined as sequence of alphanumeric symbols not interrupted by whitespace. An input phrase is any sequence of input words that are separated by whitespace or other non-alphanumeric characters.</Paragraph> <Paragraph position="2"> Our algorithm takes single words or very short phrases as input terms and assumes that every part of the input phrase has pragmatic and semantic relevance. Input term selection is therefore a fundamental prerequisite in this context.</Paragraph> <Paragraph position="3"> The output of our algorithm consists of a tree structure of labeled and weighted nodes. The labels are short phrases that provide some meaningful context while the weights are simple integer numbers. Each tree node has exactly one label and one weight attached to it.</Paragraph> <Paragraph position="4"> The following five steps will explain how to generate initial TSR Trees: a. Retrieval The input term is redirected as input to a web directory.</Paragraph> <Paragraph position="5"> Since our prototype was based on the &quot;Open Directory Project&quot; (ODP), consisting of a web directory and a search engine (Netscape Inc., 2004), we will refer to this particular service throughout this article and use it as implementation data source. Nonetheless, our algorithm is not restricted to the ODP but can use other web directories like Yahoo Inc. (Yahoo Inc., 2004) or even internet newsgroups.</Paragraph> <Paragraph position="6"> The web directory to use is not assumed to meet strict requirements in terms of sensible category labels, balanced branches, etc. but can be any taxonomic structure provided it can be transformed into weighted paths and is large enough to cover a substantial subset of the target language.</Paragraph> <Paragraph position="7"> Outcome of this redirection is a HTML-formatted list of categories including the number of hits for each category.</Paragraph> <Paragraph position="8"> b. Tree Construction The lines of the output list returned by the web directory are then parsed and converted into a sequence of weighted category terms. Because each sequence represents a different contextual use of the word (in the symbolic sense), each sequence also represents a different sense of that word in that topical context.</Paragraph> <Paragraph position="9"> Each term contains a singular category path label and the number of query hits within that category.</Paragraph> <Paragraph position="10"> An excerpt of the account example terms is exemplified in the Figure 1: After that, all terms are merged into a single hierarchical tree with weighted and labeled nodes. Figure 2 provides an example hereof.</Paragraph> <Paragraph position="11"> The resulting tree then reprenrepresentsts the input text phrase. Even though the uniqueness of this representation cannot be guaranteed in theory, a clash of two different terms representation is highly unlikely.</Paragraph> <Paragraph position="12"> The tree generation process obviously fails if the input term cannot be found within the web directory and hence no categorical context is available for that term.</Paragraph> <Paragraph position="13"> c. Normalization In order to enable uniform processing of arbitrary trees, each tree has to be &quot;normalized&quot; by weight adjustment: The sum of all node weights is computed as 100 percent. All weights are then recalculated as percentage of this overall tree weight sum. The sum weight is attached to the TSR tree root as &quot;tree weight&quot;.</Paragraph> <Paragraph position="14"> d. Node Pruning (optional) Due to the nature of the underlying web directory, there are sometimes false positives in wrong categories, i.e. when a term is used in a rather esoteric way (e.g. as part of a figure of speech, etc.).</Paragraph> <Paragraph position="15"> In order to sort out such &quot;semantical noise&quot;, &quot;insignificant&quot; nodes can be deleted using a common heuristic. Some preliminary experiments have shown that using a certain threshold on the node weight percentage is a good heuristic. An example of a processed TSR tree is shown in Figure 4. while the corresponding unprocessed TSR tree is depicted in Figure 3 e. List Transformation (optional) It is possible to transform a TSR tree into a list: by iterating the TSR tree and selecting only the nodes with the high null The labels within this figure might be printed too small to read but it is the shape of the structure that is important rather than individual node labels.</Paragraph> <Paragraph position="16"> Figure 5: List representation of &quot;account&quot; example (from excerpt) est weight at each respective depth, a TSR list is easily created. This list represents the most common meaning of the input term. Since this meaning is applicable in most cases, sufficiently robust algorithms may use these lists successfully, e.g. for simple text classification purposes. An example list, derived from the &quot;account&quot; example (Figure 2), is depicted in Figure 5.</Paragraph> <Paragraph position="17"> f. External Representation (optional) Lastly, the tree is converted into an external representation in the RDF (O. Lassila and R. Swick, 1999) language. We chose this particular paradigm because it is an open standard and well suited for representing graph and tree based data. Furthermore, a number of tools for dealing with RDF already exist - RDF is one of the basic building blocks of the Semantic Web (O. Lassila and R. Swick, 1999) and we expect RDF based TSR trees to be of great use in that domain (e.g. for classification and information extraction). null Summary In this section we presented the construction of computer usable complex TSR trees by utilizing an underlying web directory containing explicit world knowledge. The generated trees are our basic building blocks to represent the &quot;sense&quot; of the input term in a programmatically usable way.</Paragraph> <Paragraph position="18"> The construction of a TSR tree can therefore be seen as the result of a (shallow) text &quot;understanding&quot; process as defined in (Allen, 1995).</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.2 Constructing Derived TSR Trees </SectionTitle> <Paragraph position="0"> TSR trees can also be constructed by merging existent TSR trees. This process provides means of dealing with complex phrases: through adding TSR trees (applying the set union operation) it is possibly to acquire TSR trees of arbitrary text fragments, i.e. to build TSR trees by merging the TSR trees of its constituents.</Paragraph> <Paragraph position="1"> By using the derivation process, TSR trees can be built for arbitrary input texts while maintaining comparability through the respective tree features (see 4.1).</Paragraph> <Paragraph position="2"> Since TSR trees consist of weighted paths, out-of-context senses of single terms will be eliminated in the merging process. This makes using TSR trees in large texts a very robust algorithm (preliminary experiments have shown that virtually all errors occur in preprocessing steps such as language identification, etc.). Superficial investigation showed that TSR trees generated from complex descriptions are of higher quality than TSR trees from single terms (less &quot;semantic noise&quot;, features are more expressive). null On the other hand, the derivation process (in conjunction with a dictionary) can also be used to build TSR trees of descriptions of words that cannot be found in the web directory as a substitute for the word itself.</Paragraph> <Paragraph position="3"> It is a matter of current research whether TSR trees derived from dictionary-like descriptions of terms are in general preferrable to the use of initial TSR trees (see the discussion of the &quot;distance&quot; feature in 4.2).</Paragraph> </Section> </Section> <Section position="5" start_page="4" end_page="4" type="metho"> <SectionTitle> 4 Using Text Sense Representation Trees </SectionTitle> <Paragraph position="0"> In this Section, the term &quot;feature&quot; will be used quite synonymous to the term &quot;operation&quot;: a feature is a bit of information that is retrieved by application of the corresponding operation on a TSR tree.</Paragraph> <Paragraph position="1"> It is important to note that even though the TSR trees themselves are very subjective to the underlying web directory, the resulting features do not show this weakness. Any NLP application implementing algorithms that are based on TSR trees should not rely on the tree representations themselves (in terms of tree structure or node labels), but rather on the operational TSR tree features discussed in this section.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.1 Simple TSR Tree Features </SectionTitle> <Paragraph position="0"> At first, we define a set of four features that can be computed for single TSR trees: 1. Tree Weight. The individual tree's weight can be interpreted as quantitative measure of the input term within the web directory. By comparing the weight of individual trees, it is possible to determine which term occurs more often in written language (as defined by the web direc- null tory itself).</Paragraph> <Paragraph position="1"> 2. Generality. The tree's breadth factor is an in- null dicator for the &quot;generality&quot; of the input term, i.e. the broader the tree, the more general the use of the word in written language and the more textual contexts the word can be used in. General terms tend to be not specific to particular web pages, hence will show up in a number of pages throughout the web taxonomy. In contrast, less general terms tend to occur only on pages specific to a particular category in the web taxonomy.</Paragraph> <Paragraph position="2"> 3. Domain Dependency. The tree's depth factor can be interpreted as &quot;domain dependency indicator&quot; of the input term. Deep structures only occur when the input term is most often located at deep subcategory levels in the web directory which is an indicator for restricted use of that term within a particular domain of interest. 4. Category Label. Usually the node labels themselves provide clues to its respective terms meaning. Even though these clues may be quite subjective and in some cases misleading or incomplete, in most cases they can serve as hints for human skimming of the categories involved. Since these labels are provided as english words or short phrases, they might themselves be subject to initial TSR tree building (see Section 4.3).</Paragraph> </Section> <Section position="2" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.2 Advanced TSR Tree Features and Operations </SectionTitle> <Paragraph position="0"> While operations on single TSR trees provide simple text processing clues, operations on sets of trees are much more informative: 1. Difference. A number of &quot;difference&quot; features are available that can be used to compare individual features of a number of trees: These difference features arise from comparisons of the simple TSR tree features, hence they describe numerical differences between such values. For example, a high weight difference shows a high difference between the respective terms' general use in language. It is important to note that the difference features are not only usable in respect to complete trees but can be applied to tree branches as well, e.g. in order to analyze tree behaviour in certain contexts.</Paragraph> <Paragraph position="1"> 2. Distance. The &quot;distance&quot; feature is computed by counting the number of &quot;edit&quot; operations (add node, delete node, transform node) it takes to transform one tree into another. This feature is designed much after the &quot;Levenshtein-distance feature&quot; in the field of text-processing (D. S. Hirschberg, 1997).</Paragraph> <Paragraph position="2"> In general, this feature describes a notion of &quot;semantic relatedness&quot; between two input terms, i.e. a high distance value is expected between largely unrelated terms such as &quot;air&quot; and &quot;bezier curve&quot; while a low value is expected between closely related terms such as &quot;account&quot; and &quot;cash&quot;.</Paragraph> <Paragraph position="3"> The distance feature can be implemented by applying the set difference operation: the subtracting of TSR tree from one another results in a number of remaining nodes, i.e. the actual distance value.</Paragraph> <Paragraph position="4"> Recent findings have shown though that this simple procedure is only applicable on trees of roughly the same number of nodes: obviously, the computed distance of two words can achieve a high value when one word is much more common in language than the other (and is thusly represented by a much larger tree).</Paragraph> <Paragraph position="5"> This is true even when these two words are actually synonyms of each other, or just two different lexical representations of the same word, like &quot;foto&quot; and &quot;photo&quot;. In fact, because the co-occurence of different lexical representations of the same word in the same text is quite seldom, is is very likely that in these situations a high distance will show.</Paragraph> <Paragraph position="6"> It can be reasoned that these difficulties will prominently lead to the use of TSR trees derived from term descriptions rather than initial trees (see 3.2).</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 4.3 TSR Tree Translation and Corpus Related Features </SectionTitle> <Paragraph position="0"> In some cases, a need for &quot;higher-level&quot; operations will occur, e.g. when two agents cooperate, who use different web taxonomies. Our approach is able to deal with these situations through translation of TSR trees of category labels (this can be interpreted as a simple classification task).</Paragraph> <Paragraph position="1"> Sometimes, information about TSR tree features of a corpus as a whole is important. In these cases, the individual TSR trees of all items that constitute the respective corpus are merged into one &quot;corpus&quot; TSR tree. Afterwards, the corpus tree can be analyzed using the features described in Section 4.1 and Section 4.2.</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="6" type="metho"> <SectionTitle> 5 Preliminary Evaluation Results </SectionTitle> <Paragraph position="0"> For testing, we set up some preliminary experiments null :We built a prototype system based on a Tomcat application server that was able to generate TSR trees for lists of input terms and store these Exhaustive evaluation is the goal of our current research efforts c.f. http://jakarta.apache.org trees along with their width, weight and depth features in an SQL database. From this database we extracted the data used in the evaluation process. We applied each feature explained in Section 4 on a set of words taken from 4 corpora. These corpora were constructed as follows: The Basic corpus: The 100 first terms from a dictionary of &quot;basic english words&quot; like account, brother, crush, building, cry, etc. The Med corpus: The 100 first terms from a specialized medical dictionary. The Comp corpus: The 100 first terms from a specialized dictionary of computer science. The Top' corpus: The 100 terms that were ranked as &quot;top 100&quot; by the Wortschatz engine (Uwe Quasthoff, 2000).</Paragraph> <Paragraph position="1"> We expected terms of the Basic and Top corpora to show high weight and breadth and low depth values. We also expected terms from the Med and Comp corpora to be of high depth but differing in weight and breadth.</Paragraph> <Paragraph position="2"> These Expectations were supported by our results from generating and comparing the respective corpus TSR trees (see below).</Paragraph> <Paragraph position="3"> For brevity, we will only present a summary of our findings here.</Paragraph> <Paragraph position="4"> Single Tree Features Comparing the outcome of applying single TSR tree features onto the four corpora showed some interesting results: 1. Tree Weight. Terms from the Med corpus are often not represented within the web directory which means that a TSR tree cannot be built for these terms. In general, terms from the Med corpus have a very low tree weight value (in most cases < 10). Strangely, some words such as &quot;by&quot;, &quot;and&quot;, &quot;because&quot; etc. from the Top corpus also have low ratings. Examining the actual web directory pages exhibits that these terms seldom contributed to a web pages semantic context and thusly were seldom represented in the web directory. It appears that all input terms were interpreted by the ODP search engine as being semantically relevant, e.g. the word &quot;about&quot; only generated hits in categories about movie titles, e.g.</Paragraph> <Section position="1" start_page="6" end_page="6" type="sub_section"> <SectionTitle> Arts:Movies:Titles:A:About a Boy, Arts:Movies:Titles:A:About Adam, </SectionTitle> <Paragraph position="0"> etc.</Paragraph> <Paragraph position="1"> This strongly indicates that the input to the algorithm should be a noun or a common noun phrase.</Paragraph> <Paragraph position="2"> Terms from the Basic corpus and the Comp corpus are rated comparably high, e.g. some common words from the Basic corpus such as &quot;air&quot;, &quot;animal&quot;, etc. were assigned very high weight values (weight > 100).</Paragraph> <Paragraph position="3"> 2. Generality. The generality values listing ex- null hibits that indeed mostly general terms are identified by this feature. Surprisingly, some terms such as &quot;software&quot; and &quot;design&quot; were also attributed high generality. Further investigation shows that &quot;generality&quot; is a context dependent feature, e.g. the term &quot;software&quot; is very general for the computer domain. Only at the first tree level, a domain independent generality factor can be attributed to this feature. We also found that pruning has its greatest effect on this feature; this leads to the conclusion that the generality feature should be applied on TSR trees that are not pruned according to some threshold.</Paragraph> <Paragraph position="4"> 3. Domain Dependency. Except a very few cases, all top rated terms are in the Comp or in the Med corpus i.e. the two specialized corpora. These terms are apparently more specific in context than the lower rating terms.</Paragraph> <Paragraph position="5"> Advanced Tree Features Even though we tested the Multi Tree features on only a few test cases (about 30), we are confident that future evaluation will confirm our preliminary results.</Paragraph> <Paragraph position="6"> 1. Difference. Computing the difference of two or three single TSR trees turned out to be less informative than the distance value between these trees but a small number of experiments lead us to the conclusion that TSR trees of large text fragments can be compared by difference features with a conclusive outcome.</Paragraph> <Paragraph position="7"> 2. Distance. Using node labels and weights for comparison in any case resulted in a 100% distance. This effect derived from the fact that even though some trees were similar in structure, their respective weights differed in every case. The distance feature therefore is applicable to node labels only or has to introduce arithmetical means for adjusting weights. After correcting the distance algorithm, it worked as expected on trees with about the same node number (High distance between e.g. &quot;blood&quot; and &quot;air&quot;, low distance between &quot;account&quot; and &quot;credit&quot;). We also achieved reasonable results on trees differing in node number when applying a methodology of filtering homonymous aspects of the respective larger TSR tree (i.e. by using the node number of the smaller tree as upper bound and filtering first level tree nodes). Nonetheless we did not yet manage to find an absolute numerical expression that describes the distance feature appropriately.</Paragraph> <Paragraph position="8"> TSR Tree Translation and Corpus Features 1. Corpus Tree Features.</Paragraph> <Paragraph position="9"> We have merged all of the terms of each respective corpus in order to generate a &quot;corpus representation tree&quot;. These corpus representations can be used to demonstrate certain properties of the chosen corpora. Our experiments exhibit that terms from the &quot;general&quot; corpora (Basic and Top) had a higher generality value than terms from the more specialized corpora (Comp and Med). The same results also confirm our hypothesis of the WWW occurrence property of the computer corpus, since it is also well represented in the web dictionary.</Paragraph> </Section> </Section> <Section position="7" start_page="6" end_page="7" type="metho"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> In this paper, we have introduced a novel concept of representing text senses as defined in Section 1.</Paragraph> <Paragraph position="1"> According to the results of our preliminary evaluation, our approach shows the following advantages: null TSR trees can be used to unambiguously represent text senses. There is a fundamental semantic relationship between TSR trees and their respective input terms. Their use is efficient: TSR trees can - once retrieved and computed - be re-used without necessary modifications. In that sense they can be used &quot;stand-alone&quot;. Application of TSR tree features is very fast (one SELECT SQL statement within our prototype system). Meaning representation within TSR trees is robust: generating trees of text fragments by merging the TSR trees of its constituents reduces potential errors.</Paragraph> <Paragraph position="2"> TSR trees are in close interaction with the semantic context of the input terms, it is therefore possible to determine topical relationships between textual fragments.</Paragraph> <Paragraph position="3"> Nonetheless, our findings also exhibit some weaknesses and dependencies: If an input term cannot be found within the web directory in use, a corresponding initial TSR tree cannot be built. This is a big problem for languages that are not well represented in the web directory (there is a strong bias towards the english language). Very specialized domains (e.g. medical topics) are also underrepresented in the web directory and hence problematic for the same reason. Observations show that there is a strong bias e.g. sentences, paragraphs or static size text windows towards computer and business related topics. One approach to solving these problems would be to use derived TSR trees in place of directly acquired TSR trees. It is yet a matter of current research to which degree intial TSR trees should be substituted by derived TSR trees.</Paragraph> <Paragraph position="4"> TSR tree usage usually depends on the output quality of a number of preprocessing steps, e.g. language identification, noun phrase identification, morphological analysis, etc.</Paragraph> </Section> class="xml-element"></Paper>