File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-3005_metho.xml
Size: 8,224 bytes
Last Modified: 2025-10-06 14:09:01
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-3005"> <Title>Customizing Parallel Corpora at the Document Level</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> Corpora 3 Selecting Documents from Parallel Corpora </SectionTitle> <Paragraph position="0"> While selecting and weighing entire training corpora is a problem already explored by (Rogati and Yang, 2004), in this paper we focus on a lower granularity level: individual documents in the parallel corpora. We seek to construct a custom parallel corpus, by choosing individual documents which best match the testing collection. We compute the similarity between the test collection (in German or English) and each individual document in the parallel corpora for that respective language. We have a choice of similarity metrics, but since this computation is simply retrieval with a long query, we start with the Okapi model (Robertson, 1993), as implemented by the Lemur system (Olgivie and Callan, 2001). Although the Okapi model takes into account average document length, we compare it with its length-normalized version, measuring per-word similarity. The two measures are identified in the results section by &quot;Okapi&quot; and &quot;Normalized&quot;.</Paragraph> <Paragraph position="1"> Once the similarity is computed for each document in the parallel corpora, only the top N most similar documents are kept for training. They are an approximation of the domain(s) of the test collection. Selecting N has not been an issue for this corpus (values between 10-75% were safe).</Paragraph> <Paragraph position="2"> However, more generally, this parameter can be tuned to a different test corpus as any other parameter. Alternatively, the document score can also be incorporated into the translation model, eliminating the need for thresholding.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 CLIR Method </SectionTitle> <Paragraph position="0"> We used a corpus-based approach, similar to that in (Rogati and Yang, 2003). Let L1 be the source language and L2 be the target language. The cross-lingual retrieval consists of the following steps: 1. Expanding a query in L1 using blind feedback 2. Translating the query by taking the dot product between the query vector (with weights from step 1) and a translation matrix obtained by calculating translation probabilities or term-term similarity using the parallel corpus.</Paragraph> <Paragraph position="1"> 3. Expanding the query in L2 using blind feedback 4. Retrieving documents in L2 Here, blind feedback is the process of retrieving documents and adding the terms of the top-ranking documents to the query for expansion. We used simplified Rocchio positive feedback as implemented by Lemur (Olgivie and Callan, 2001). For the results in this paper, we have used Pointwise Mutual Information (PMI) instead of IBM Model 1 (Brown et al., 1993), since (Rogati and Yang, 2004) found it to be as effective on Springer, but faster to compute.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Results and Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Empirical Settings </SectionTitle> <Paragraph position="0"> For the retrieval part of our system, we adapted Lemur (Ogilvie and Callan, 2001) to allow the use of weighted queries. Several parameters were tuned, none of them on the test set. In our corpus-based approach, the main parameters are those used in query expansion based on pseudorelevance, i.e., the maximum number of documents and the maximum number of words to be used, and the relative weight of the expanded portion with respect to the initial query. Since the Springer training set is fairly small, setting aside a subset of the data for parameter tuning was not desirable.</Paragraph> <Paragraph position="1"> We instead chose parameter values that were stable on the CLEF collection (Peters, 2003): 5 and 20 as the maximum numbers of documents and words, respectively. The relative weight of the expanded portion with respect to the initial query was set to 0.5. The results were evaluated using mean average precision (AvgP), a standard performance measure for IR evaluations.</Paragraph> <Paragraph position="2"> In the following sections, DE-EN refers to retrieval where the query is in German and the documents in English, while EN-DE refers to retrieval in the opposite direction.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Using the Parallel Corpora Separately </SectionTitle> <Paragraph position="0"> Can we simply choose a parallel corpus that performed very well on news stories, hoping it is robust across domains? Natural approaches also include choosing the largest corpus available, or using all corpora together. Figure 1 shows the effect of these strategies.</Paragraph> <Paragraph position="1"> Figure 1. CLIR results on the Springer test set by using PMI with different training corpora.</Paragraph> <Paragraph position="2"> We notice that choosing the largest collection (EUROPARL), using all resources available without weights (ALL), and even choosing a large collection in the medical domain (MEDTITLE) are all sub-optimal strategies.</Paragraph> <Paragraph position="3"> Given these results, we believe that resource selection and weighting is necessary. Thoroughly exploring weighting strategies is beyond the scope of this paper and it would involve collection size, genre, and translation quality in addition to a measure of domain match. Here, we start by selecting individual documents that match the domain of the test collection. We examine the effect this choice has on domain-specific CLIR.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Using Okapi weights to build a custom </SectionTitle> <Paragraph position="0"> parallel corpus Figures 2 and 3 compare the two document selection strategies discussed in Section 3 to using all available documents, and to the ideal (but not truly optimal) situation where there exists a &quot;best&quot; resource to choose and this collection is known. By &quot;best&quot;, we mean one that can produce optimal results on the test corpus, with respect to the given metric In reality, the true &quot;best&quot; resource is unknown: as seen above, many intuitive choices for the best collection are not optimal.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> SPRINGER MEDTITLE WAC NEWS EUROPARL ALL </SectionTitle> <Paragraph position="0"> Notice that the normalized version performs better and is more stable. Per-word similarity is, in this case, important when the documents are used to train translation scores: shorter parallel documents are better when building the translation matrix. Our strategy accounts for a 4-7% improvement over using all resources with no weights, for both retrieval directions. It is also very close to the &quot;oracle&quot; condition, which chooses the best collection in advance. More importantly, by using this strategy we are avoiding the sharp performance drop when using a mismatched, although very good, resource (such as EUROPARL).</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Future Work </SectionTitle> <Paragraph position="0"> We are currently exploring weighting strategies involving collection size, genre, and estimating translation quality in addition to a measure of domain match. Another question we are examining is the granularity level used when selecting resources, such as selection at the document or cluster level.</Paragraph> <Paragraph position="1"> Similarity and overlap between resources themselves is also worth considering while exploring tradeoffs between redundancy and noise. We are also interested in how these approaches would apply to other domains.</Paragraph> </Section> class="xml-element"></Paper>