File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2003_metho.xml
Size: 16,789 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-2003"> <Title>Searching for Topics in a Large Collection of Texts</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Concept-formative clusters </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Graph of a text collection </SectionTitle> <Paragraph position="0"> Let a2a4a3a6a5a8a7a10a9a12a11a13a7a15a14a16a11a18a17a18a17a18a17a19a11a13a7a21a20a23a22 be a collection of text documents; a24 is the size of the collection. Now suppose that we have a function a25a13a26a28a27a30a29a31a7a33a32a34a11a13a7a19a35a19a36a37a3 a25a13a26a38a27a30a29a31a7a39a35a40a11a13a7a15a32a41a36a43a42 a44a31a45a46a11a12a47a8a48 , which gives a degree of document similarity for each pair of documents.</Paragraph> <Paragraph position="1"> Then we represent the collection as a graph.</Paragraph> <Paragraph position="2"> Definition: A labeled graph a0 is called graph of collection a2 ifa0 a3 a29 a2 a11a2a1 a36 where a1 a3 a5 a5a8a7a40a32a34a11a13a7a39a35a16a22a4a3a6a5a8a7a3a10a9a12a11 a25a13a26a38a27a30a29a31a7a40a32a34a11a13a7a39a35a19a36a14a13a10a15a17a16a22 and each edge a18 a3 a5a8a7a21a32a34a11a13a7a19a35a16a22 a42a19a1 is labeled by number a15 a29a20a18a19a36 a3 a25a13a26a28a27a30a29a31a7a15a32 a11a13a7a39a35a19a36 , called weight of a18 ; a15 a16a22a21 a45 is a given document similarity threshold (i.e. a threshold weight of edge).</Paragraph> <Paragraph position="3"> Now we introduce some terminology and necessary notation. Leta0 a3 a29 a2 a11a2a1 a36 be a graph of collection a2 . Each subseta23a25a24 a2 is called a cut ofa0 ; a23 stands for the complement a2a28a27a29a23 . Ifa30 a11a2a31a32a24 a2 are disjoint cuts then</Paragraph> <Paragraph position="5"> a15a57a72a78a3a23a10a3a79a72a29a31a24a81a80a82a3a23a83a3a36 is the expected weight of the connection between cut X and the rest of the collection;</Paragraph> <Paragraph position="7"> a23 naturally splits the collection into three disjoint subsets a2 a3a84a23a85a55a83a86</Paragraph> <Paragraph position="9"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Quality of cuts </SectionTitle> <Paragraph position="0"> Now we formalize the property of &quot;being concept-formative&quot; by a positive real function called quality of cut. A high value of quality means that a cut must be specific and extensive.</Paragraph> <Paragraph position="1"> A cut a23 is called specific if (i) the weight a15 a29a71a23 a36 is relatively high and (ii) the connection between a23 and the rest of the collection</Paragraph> <Paragraph position="3"> a23 a36 is relatively small. The first prop-erty is called compactness of cut, and is defined as a96a29a97a40a27 a29a71a23 a36 a3a98a15 a29a71a23 a36a49a66 a26a15 a29a71a23 a36 , while the other is called exhaustivity of cut, which is defined as</Paragraph> <Paragraph position="5"> are positive.</Paragraph> <Paragraph position="6"> Thus, the specificity of cuta23 can be formalized by the following formula</Paragraph> <Paragraph position="8"> -- the greater this value, the more specific the cut a23 ; a113 a9 and a113 a14 are positive parameters, which are used for balancing the two factors.</Paragraph> <Paragraph position="9"> The extensity of cut a23 is defined as a positive function a99a22a100a115a114 a29a71a23 a36 a3a117a116a118a97a38a119</Paragraph> <Paragraph position="11"> threshold size of cut.</Paragraph> <Paragraph position="12"> Definition: The total quality of cuta125 a29a71a23 a36 is a positive real function composed of all factors mentioned above and is defined as</Paragraph> <Paragraph position="14"> a104a89a129where the three lambdas are parameters whose purpose is balancing the three factors.</Paragraph> <Paragraph position="15"> To be concept-formative, a cut (i) must have a sufficiently high quality and (ii) must be locally optimal.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Local optimization of cuts </SectionTitle> <Paragraph position="0"> A cut a23 a24 a2 is called locally optimal regarding quality function a125 if each cut a23a35a130a131a24 a2 which is only a small modification of the original a23 does not have greater quality, i.e. a125 a29a71a23a35a130a28a36a29a132a81a125 a29a71a23 a36 . Now we describe a local search procedure whose purpose is to optimize any input cut a23 ; if a23 is not locally optimal, the output of the Local Search procedure is a locally optimal cuta23a35a133 which results from the originala23 as its local modification. First we need the following definition: null Definition: Potential of document a7 a42 a2 with respect to cuta23a25a24 a2 is a real function</Paragraph> <Paragraph position="2"> 1. Local Search gradually generates a se-</Paragraph> <Paragraph position="4"> Input: the graph of text collection a0 ; an initial cut a1a3a2a4a1a6a5a8a7a10a9a12a11a13a0 . Output: locally optimal cut a1a15a14 .</Paragraph> <Paragraph position="5"> Algorithm: a16a18a17a20a19 loop: a21a22a17a24a23a26a25a10a27a29a28a31a30a8a32a6a121a34a33a36a35a38a37a8a39a41a40a43a42a45a44a47a46a49a48a50a1a51a5a8a52a53a9a55a54 if a42a38a44a56a21a57a48a10a1 a5a8a52a41a9 a54a59a58a60a19 then a61</Paragraph> <Paragraph position="7"> (i) a125 a29a71a23 a48a32a85a84 a9a52a36a87a86a81a125 a29a71a23 a48a32a52a36 fora5 a21 a47 , and (ii) cut a23 a48a32a52 always arises from a23 a48a32a85a84 a9a52 by adding or taking away one document into/from it; 2. since the quality of modified cuts cannot increase infinitely, a finite a88 a21 a45 necessarily exists so thata23 a48a53a89 a52 is locally optimal and consequently the program stops at least after the a88 -th iteration; 3. each output cuta23 a133 is locally optimal.</Paragraph> <Paragraph position="8"> Now we are ready to precisely define concept-formative clusters: Definition: A cut a23 a24 a2 is called a concept-formative cluster if (i) a125 a29a71a23 a36 a21a91a90 a16 where a90 a16 is a threshold quality and (ii) a23 a3a98a23 a133 where a23 a133 is the output of the Local Search algorithm.</Paragraph> <Paragraph position="9"> The whole procedure for finding concept-formative clusters consists of two basic stages: first, a set of initial cuts is found within the whole collection, and then each of them is used as a seed for the Local Search algorithm, which locally optimizes the quality functiona125 .</Paragraph> <Paragraph position="10"> Note that a113 a9a12a11 a113a10a14a16a11 a113a29a92 are crucial parameters, which strongly affect the whole process of searching and consequently also the character of resulting concept-formative clusters. We have optimized their values by a sort of machine learning, using a small manually annotated collection of texts. When optimized a113 -parameters are used, the Local Search procedure tries to simulate the behavior of human annotator who finds topically coherent clusters in a training collection. The task ofa113 -optimization leads to a system of linear inequalities, which we solve via linear programming. As there is no scope for this issue here, we cannot go into details.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Virtual concepts </SectionTitle> <Paragraph position="0"> In this section we first show that concept-formative clusters can be viewed as fuzzy sets. In this sense, each concept-formative cluster can be characterized by a membership function. Fuzzy clustering allows for some ambiguity in the data, and its main advantage over hard clustering is that it yields much more detailed information on the structure of the data (cf. (Kaufman and Rousseeuw, 1990), chapter 4).</Paragraph> <Paragraph position="1"> Then we define virtual concepts as linear functions which estimate degree of membership of documents in concept-formative clusters. Since virtual concepts are weighted mixtures of words represented as vectors, they can also be seen as virtual documents representing specific topics that emerge in the analyzed collection.</Paragraph> <Paragraph position="2"> Now we formalize the notion of virtual concepts. Let a99 a9 a11a65a99 a14 a11a18a17a18a17a18a17a19a11a65a99 a20 a42a126a141a60a100 be vector representations of documents a7 a9a12a11a13a7a15a14a16a11a18a17a18a17a18a17a19a11a13a7a21a20 , where</Paragraph> <Paragraph position="4"> is then called virtual concept corresponding to concept-formative clustera23 . The task of finding virtual concepts can be solved using the Greedy Regression Algorithm (GRA), originally suggested by Semeck'y (2003).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Greedy Regression Algorithm </SectionTitle> <Paragraph position="0"> The GRA is directly based on multiple linear regression (see e.g. (Rice, 1994)). The GRA works in iterations and gradually increases the number of non-zero elements in the resulting vector, i.e. the number of words with non-zero weight in the resulting mixture. So this number can be explicitly restricted by a parameter. This feature of the GRA has been designed for the sake of generalization, in order to not overfit the input sample.</Paragraph> <Paragraph position="1"> The input of the GRA consists of (i) a sample set of document vectors with the corresponding values of a93 a29a31a7a10a11a49a23 a36 , (ii) a maximum number of non-zero elements, and (iii) an error threshold.</Paragraph> <Paragraph position="2"> The GRA, which is described in Fig. 2, requires a procedure for solving multiple linear regression (MLR) with a limited number of non-zero elements in the resulting vector. Formally, a22 of the elements which are allowed to be non-zero in the output vector.</Paragraph> <Paragraph position="3"> The output of the MLR is a vector</Paragraph> <Paragraph position="5"> fulfill a73 a32 a3a4a45 for anya5 a42 a5a15a47a16a11a18a17a18a17a18a17a19a11 a48 a22a14a27a74a61 .</Paragraph> <Paragraph position="6"> Implementation and time complexity For solving multiple linear regression we use a public-domain Java package JAMA (2004), developed by the MathWorks and NIST. The computation of inverse matrix is based on the LU decomposition, which makes it faster (Press et al., 1992).</Paragraph> <Paragraph position="7"> As for the asymptotic time complexity of the GRA, it is in a75 a29a98a88a82a72 a48 a72 complexity of the MLRa36 since the outer loop runs a88 times at maximum and the inner loop always runs nearly a48 times. The MLR substantially consists of matrix multiplications in dimension a63 a135 a88 and a matrix inversion in dimension a88a40a135 a88 . Thus the complexity of the</Paragraph> <Paragraph position="9"> To reduce this high computational complexity, we make a term pre-selection using a heuristic method based on linear programming. Then, the GRA does not need to deal with high-dimensional vectors ina141 a100 , but works with vectors in dimension a48 a130a82a81 a48 . Although the acceleration is only linear, the required time has been reduced more than ten times, which is practically significant.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Experiments </SectionTitle> <Paragraph position="0"> The experiments reported here were done on a small experimental collection of a24 a3 a83a40a84a33a11a86a85a40a85a39a87 Czech documents. The texts were articles from two different newspapers and one journal. Each document was morphologically analyzed and lemmatized (HajiVc, 2000) and then indexed and represented as a vector. We indexed only lemmas of nouns, adjectives, verbs, adverbs and numerals whose document frequency was greater than a47a18a45 and less than a0a16a45a46a11a13a45 a45 a45 . Then the number of indexed terms was a48 a3a65a83a2a1a33a11a86a83a40a84a40a83 . The cosine similarity was used to compute the document similarity; threshold wasa15 a16 a3 a45a46a17 a83 . There were a3 a87a4a0a33a11a5a3a6a0a39a87 edges in the graph of the collection.</Paragraph> <Paragraph position="1"> We had computed a set of concept-formative clusters and then approximated the corresponding membership functions by virtual concepts.</Paragraph> <Paragraph position="2"> The first thing we have observed was that the quadratic residual error systematically and progresivelly decreases in each GRA iteration. Moreover, the words in virtual concepts are obviously intelligible for humans and strongly suggest the topic. An example is given in Table 1.</Paragraph> <Paragraph position="3"> words in the concept the weights corresponding to cluster #318.</Paragraph> <Paragraph position="4"> Another example is cluster #19 focused on &quot;pension funds&quot;, which was approximated (The signs after the words indicate their positive or negative weights in the concept.) Figure 3 shows the approximation of this cluster by virtual tion corresponding to cluster #19 by a virtual concept (the number of words in the concept a88 a3a22a1 ).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Discussion 4.1 Related work </SectionTitle> <Paragraph position="0"> A similar approach to searching for topics and employing them for document retrieval has been recently suggested by Xu and Croft (2000), who, however, try to employ the topics in the area of distributed retrieval.</Paragraph> <Paragraph position="1"> They use document clustering, treat each cluster as a topic, and then define topics as probability distributions of words. They use the Kullback-Leibler divergence with some modification as a distance metric to determine the closeness of a document to a cluster. Although our virtual concepts cannot be interpreted as probability distributions, in this point both approaches are quite similar. null The substantial difference is in the clustering method used. Xu and Croft have chosen the K-Means algorithm, &quot;for its efficiency&quot;. In contrast to this hard clustering algorithm, (i) our method is consistently based on empirical analysis of a text collection and does not require an a priori given number of topics; (ii) in order to induce permeable topics, our concept-formative clusters are not disjoint; (iii) the specificity of our clusters is driven by training samples given by human.</Paragraph> <Paragraph position="2"> Xu and Croft suggest that retrieval based on topics may be more robust in comparison with the classic vector technique: Document ranking against a query is based on statistical correlation between query words and words in a document.</Paragraph> <Paragraph position="3"> Since a document is a small sample of text, the statistics in a document are often too sparse to reliably predict how likely the document is relevant to a query. In contrast, we have much more texts for a topic and the statistics are more stable. By excluding clearly unrelated topics, we can avoid retrieving many of the non-relevant documents.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4.2 Future work </SectionTitle> <Paragraph position="0"> As our work is still in progress, there are some open questions, which we will concentrate on in the near future. Three main issues are (i) evaluation, (ii) parameters setting (which is closely connected to the previous one), and (iii) an effective implementation of crucial algorithms (the current implementation is still experimental).</Paragraph> <Paragraph position="1"> As for the evaluation, we are building a manually annotated test collection using which we want to test the capability of our model to estimate inter-document similarity in comparison with the classic vector model and the LSI model. So far, we have been working with a Czech collection for we also test the impact of morphology and some other NLP methods developed for Czech. Next step will be the evaluation on the English TREC collections, which will enable us to rigorously evaluate if our model really helps to improve IR tasks.</Paragraph> <Paragraph position="2"> The evaluation will also give us criteria for parameters setting. We expect that a positive value ofa54 a16 will significantly accelerate the computation without loss of quality, but finding the right value must be based on the evaluation. As for the most important parameters of the GRA (i.e. the size of the sample set a63 and the number of words in concept a88 ), these should be set so that the resulting concept is a good membership estimator also for documents not included in the sample set.</Paragraph> </Section> class="xml-element"></Paper>