File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-3403_relat.xml

Size: 9,019 bytes

Last Modified: 2025-10-06 14:15:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3403">
  <Title>Computational Measures for Language Similarity across Time in Online Communities</Title>
  <Section position="3" start_page="0" end_page="16" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> In the next sections, we review literature on language similarity or convergence. We also review literature on the three computational tools, Spearman's Correlation Coefficient (SCC), Zipping, and</Paragraph>
    <Section position="1" start_page="0" end_page="15" type="sub_section">
      <SectionTitle>
Latent Semantic Analysis (LSA).
2.1 Language Similarity in Computer-
mediated Communication
</SectionTitle>
      <Paragraph position="0"> In dyadic settings, speakers often converge to one another's speech styles, not only matching the choice of referring expressions or other words, but also structural dimensions such as syntax, sound characteristics such as accent, prosody, or phonol- null ogy, or even non-verbal behaviors such as gesture (Brennan &amp; Clark, 1996; Street &amp; Giles, 1982).</Paragraph>
      <Paragraph position="1"> Some scholars suggest that this convergence or entrainment is based on a conscious need to accommodate to one's conversational partner, or as a strategy to maximize communication effectiveness (Street &amp; Giles, 1982). Others suggest that the alignment is an automatic response, in which echoic aspects of speech, gesture and facial expressions are unconscious reactions (Garrod &amp; Anderson, 1987; Lakin, Jefferies, Cheng, &amp; Chartrand, 2003). In short, conversational partners tend to accommodate to each other by imitating or matching the semantic, syntactic and phonological characteristics of their partners (Brennan &amp; Clark, 1996; Garrod &amp; Pickering, 2004).</Paragraph>
      <Paragraph position="2"> Many studies have concentrated on dyadic interactions, but large-scale communities also demonstrate language similarity or convergence. In fact, speech communities have a strong influence in creating and maintaining language patterns, including word choice or phonological characteristics (Labov, 2001). Language use often plays an important role in constituting a group or community identity (Eckert, 2003). For example, language 'norms' in a speech community often result in the conformity of new members in terms of accent or lexical choice (Milroy, 1980). This effect has been quite clear among non-native speakers, who quickly pick up the vernacular and speech patterns of their new situation (Chambers, 2001), but the opposite is also true, with native speakers picking up speech patterns from non-native speakers (Auer &amp; Hinskens, 2005) Linguistic innovation is particularly salient on the Internet, where words and linguistic patterns have been manipulated or reconstructed by individuals and quickly adopted by a critical mass of users (Crystal, 2001). Niederhoffer &amp; Pennebaker (2002) found that users of instant messenger tend to match each other's linguistic styles. A study of language socialization in a bilingual chat room suggests that participants developed particular linguistic patterns and both native and non-native speakers were influenced by the other (Lam, 2004). Similar language socialization has been found in ethnographic research of large-scale online communities as well, in which various expressions are created and shared by group members (Baym, 2000; Cherny, 1999).</Paragraph>
      <Paragraph position="3"> Other research not only confirms the creation of new linguistic patterns online, and subsequent adoption by users, but suggests that the strength of the social ties between participants influences how patterns are spread and adopted (Paolillo, 2001).</Paragraph>
      <Paragraph position="4"> However, little research has been devoted to how language changes over longer periods of time in these online communities.</Paragraph>
    </Section>
    <Section position="2" start_page="15" end_page="15" type="sub_section">
      <SectionTitle>
2.2 Computational Measures of Language
Similarity
</SectionTitle>
      <Paragraph position="0"> The unit of analysis in online communities is the (e-mail or chat) message. Therefore, measuring entrainment in online communities relies on assessing whether or not similarity between the messages of each participant increases over time. Most techniques for measuring document similarity rely on the analysis of word frequencies and their co-occurrence in two or more corpora (Kilgarriff, 2001), so we start with these techniques.</Paragraph>
    </Section>
    <Section position="3" start_page="15" end_page="16" type="sub_section">
      <SectionTitle>
Spearman's Rank Correlation Coefficient (SCC)
</SectionTitle>
      <Paragraph position="0"> is particularly useful because it is easy to compute and not dependent on text size. Unlike some other statistical approaches (e.g. chi-square), SCC has been shown effective on determining similarity between corpora of varying sizes, therefore SCC will serve as a baseline for comparison in this paper (Kilgarriff, 2001).</Paragraph>
      <Paragraph position="1"> More recently, researchers have experimented with data compression algorithms as a measure of document complexity and similarity. This technique uses compression ratios as an approximation of a document's information entropy (Baronchelli, Caglioti, &amp; Loreto, 2005; Benedetto, Caglioti, &amp; Loreto, 2002). Standard Zipping algorithms have demonstrated effectiveness in a variety of document comparison and classification tasks. Behr et al. (2003) found that a document and its translation into another language compressed to approximately the same size. They suggest that this could be used as an automatic measure for testing machine translation quality. Kaltchenko (2004) argues that using compression algorithms to compute relative entropy is more relevant than using distances based on Kolmogorov complexity. Lastly, Bendetto et al. (2002) present some basic findings using GZIP for authorship attribution, determining the language of a document, and building a tree of language families from a text written in different languages. Although Zipping may be a conten- null tious technique, these results present intriguing reasons to continue exploration of its applications.</Paragraph>
      <Paragraph position="2"> Latent Semantic Analysis is another technique used for measuring document similarity. LSA employs a vector-based model to capture the semantics of words by applying Singular Value Decomposition on a term-document matrix (Landauer, Foltz, &amp; Laham, 1998). LSA has been successfully applied to tasks such as measuring semantic similarity among corpora of texts (Coccaro &amp; Jurafsky, 1998), measuring cohesion (Foltz, Kintsch, &amp; Landauer, 1998 ), assessing correctness of answers in tutoring systems (Wiemer-Hastings &amp; Graesser, 2000) and dialogue act classification (Serafin &amp; Di Eugenio, 2004).</Paragraph>
      <Paragraph position="3"> To our knowledge, statistical measures like SCC, Zipping compression algorithms, or LSA have never been used to measure similarity of messages over time, nor have they been applied to online communities. However, it is not obvious how we would verify their performance, and given the nature of the task - similarity in over 15,000 e-mail messages - it is impossible to compare the computational methods to hand-coding. As a preliminary approach, we therefore decided to apply all three methods in turn to the messages in an online community to examine change in linguistic similarity over time, and to compare their results.</Paragraph>
      <Paragraph position="4"> Through the combination of lexical, phrasal and semantic similarity metrics, we hope to gain insight into the questions of whether entrainment occurs in online communities, and of what computational measures can be used to measure it.</Paragraph>
    </Section>
    <Section position="4" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
2.3 The Junior Summit
</SectionTitle>
      <Paragraph position="0"> The Junior Summit launched in 1998 as a closed online community for young people to discuss how to use technology to make the world better. 3000 children ages 10 to 16 participated in 1000 teams (some as individuals and some with friends). Participants came from 139 different countries, and could choose to write in any of 5 languages. After 2 weeks online, the young people divided into 20 topic groups of their own choosing. Each of these topic groups functioned as a smaller community within the community of the Junior Summit; after another 6 weeks, each topic group elected 5 delegates to come to the US for an in-person forum.</Paragraph>
      <Paragraph position="1"> The dataset from the Junior Summit comprises more than 40,000 e-mail messages; however, in the current paper we look at only a sub-set of these data - messages written in English during the 6week topic group period. For complete details, please refer to Cassell &amp; Tversky (2005).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML