File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0319_metho.xml

Size: 8,513 bytes

Last Modified: 2025-10-06 14:08:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0319">
  <Title>An LSA Implementation Against Parallel Texts in French and English</Title>
  <Section position="4" start_page="1" end_page="2" type="metho">
    <SectionTitle>
3. LSA
</SectionTitle>
    <Paragraph position="0"> The LSA methodology begins with the term-by-document matrix, an n x m matrix where each value in the matrix is the frequency of the nth word in the mth document. A weighting procedure is applied that weights each of the term frequencies (TF) by the inverse document frequency (IDF)  (Salton, G. et al 1968). A very powerful mathematical procedure, known as singular value decomposition (SVD), is then performed against the transformed matrix. SVD permits the reduction of any n x m matrix to a set of three matrices, such that M = USV</Paragraph>
    <Paragraph position="2"> right singular vectors ).</Paragraph>
    <Paragraph position="3"> While the SVD solution of any given matrix can re-create the original matrix, exactly, its primary value lies in its capacity to infer what the pattern of relationships and associations is for the words in the documents, if all the linguistic data in the corpus is represented on a smaller number of dimensions than that of the original matrix (Landauer et al 1998).</Paragraph>
  </Section>
  <Section position="5" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4. Interpretation
</SectionTitle>
    <Paragraph position="0"> Although it is useful to think of the &amp;quot;dimensions&amp;quot; as representing the document vectors, it should be emphasized that a &amp;quot;dimension&amp;quot; is a more abstract notion related to the contribution that a given document vector makes in explaining the relationships and associations of the linguistic data contained in the corpus. In this section, an example of how the SVD procedure depicts the word and document relationships on the different dimensions is discussed.</Paragraph>
    <Paragraph position="1"> In Figure 1, the location of the documents is shown on Dimensions 1 and 2. Notice that on Dimension 2 (the vertical axis), all of the English documents (E n ) lie above zero and all of the French documents (F n ) lie below. Moreover, the corresponding French and English document pairs are almost perfectly aligned across the horizontal axis. As an abstract notion, Dimension 2 clearly represents the two languages.</Paragraph>
    <Paragraph position="2"> On Dimension 1, the documents occur from left to right, in the order of smallest document to largest document, making this dimension representative of document size. No attention was paid to ordering the documents according to size when the data was input into the model. The LSA model is able to identify this relationship, automatically, and represent it, as shown here in Figure 1.</Paragraph>
    <Paragraph position="3">  The IDF of term, t, is the ratio of the total number of documents to the total number of documents in which the given term, t, occurs, e.g., if term, t, occurred in 6 of the 60 documents in the Corpus, its IDF ratio is 60/6 (or 10).  Where the two symbols cross, the data point represents a cognate.</Paragraph>
    <Paragraph position="4">  In the same way that the documents were split with the English documents occuring above zero  The term cognate is used to include words such as &amp;quot;chose&amp;quot;, which means &amp;quot;thing&amp;quot; in French and is the past tense of &amp;quot;to choose&amp;quot; in English; &amp;quot;pays&amp;quot;, which is &amp;quot;country&amp;quot; in French and present tense of &amp;quot;to pay&amp;quot; in English; and so on. and the French documents below in Figure 1, so are the English and French words split in Figure 2. As shown, the English words are located above zero on the vertical axis and the French words are located below. Whether a cognate is represented as an English term and appears above zero or as a French term appearing below, is entirely dependent on which set of language documents &amp;quot;drives&amp;quot; the association for the cognate.</Paragraph>
    <Paragraph position="5"> Figure 2 shows that, in spite of the very high degree of symmetry between the document-pairs of the French and English texts (Figure 1), the cross-language patterns of association among the words in the documents are not completely symmetrical. If they were, there would be a corresponding &amp;quot;x&amp;quot; sign above the horizontal axis in exactly the same position as every &amp;quot;+&amp;quot; sign below the axis. From an MT or TA perspective, the greater the degree of cross-language symmetry among words in the documents, the easier the task of selecting the appropriate target term. When cross-language symmetry is low, however, the task of finding the appropriate target term is more difficult.</Paragraph>
  </Section>
  <Section position="6" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5. Symmetry of Query Results
</SectionTitle>
    <Paragraph position="0"> For the most part, single-term queries accurately identified the most relevant, same-language documents and they did so in spite of nonsymmetrical, language-specific, usage-associations of the query terms. For example, in Table 1, the query using the English term &amp;quot;aboriginals&amp;quot; returned E22 as its most relevant document; however, the query using the corresponding French term &amp;quot;autochtones&amp;quot; returned F09 as its most relevant document.</Paragraph>
    <Paragraph position="1">  At first glance, these query results would seem to be undesirable. However, they are perfectly consistent with the language-specific usage patterns of these two terms.</Paragraph>
    <Paragraph position="2"> In French, because of number agreement between adjectives and nouns, the plural form of &amp;quot;autochtone&amp;quot; is used quite frequently in comparison to the plural form of its English counterpart &amp;quot;aboriginal&amp;quot;, where number agreement is not required. For example, the French usage of &amp;quot;les (peuples) autochtones&amp;quot; is often realized as &amp;quot;the aboriginal people(s)&amp;quot; in the corresponding English document. The impact of this non-symmetrical usage pattern of corresponding language terms is seen in the query results. While the most relevant French document, F09, contains 50 occurrences of the plural &amp;quot;autochtones&amp;quot;, its corresponding English document (returned as second relevant in Query E) contains two occurrences of the plural &amp;quot;aboriginals&amp;quot; and 49 occurrences of the singular &amp;quot;aboriginal&amp;quot;.</Paragraph>
    <Paragraph position="3"> Continuing on with this example, Query E identified the English document, E22, as the most relevant to the query term. E22 contains nine occurrences of the plural &amp;quot;aboriginals&amp;quot; and no occurrences of the singular &amp;quot;aboriginal&amp;quot;. Thus, the results of Queries E and F shown in Table 1 demonstrate that not only is the LSA methodology sensitive to the language-specific word distributions of cross-language word pairs, it is also capable of distinguishing those distributional variations in order to identify the most relevant documents for the language of the query term, accordingly. In other words, given the dissimilarity in the distributions of the plural forms of each term in the cross-language word pair, the LSA methodology behaved appropriately for each of the queries.</Paragraph>
    <Paragraph position="4"> The order of the documents returned as relevant to the queries in Table 1 is important from an MT and TA perspective also because it shows that LSA has some capability to &amp;quot;align&amp;quot; similar, but not exact, terms. For example, in Query E, the first three documents all contain the exact query term, &amp;quot;aboriginals&amp;quot;. The next five documents contain only the singular form of the query term. The remaining two documents do not contain either form of the query term. In other words, after the documents that contained the exact query term, the LSA methodology chose as more relevant, documents containing the singular form of the query term over documents that contained no form of the query term. If the LSA methodology were only identifying relevant documents on the basis of finding terms with an exact match to the query term, it would have no preference for choosing documents containing the singular form of the query term over documents that contained no form of the query term.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML