File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1139_intro.xml
Size: 5,546 bytes
Last Modified: 2025-10-06 14:02:10
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1139"> <Title>Linguistic profiling of texts for the purpose of language verification</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Research in linguistics and language engineering thrives on the availability of data. Traditionally, corpora would be compiled with a specific purpose in mind. Such corpora characteristically were well-balanced collections of data. In the form of metadata, record was kept of the design criteria, sampling procedures, etc. Thus the researcher would have a fair idea of where his data originated from. Over past decades, data collection has been boosted by technological developments. More and more and increasingly large collections of data have been and are being compiled. It is tempting to think that the problem of data sparseness has been solved - at least for raw data or data without any annotation other than can be provided fully automatically - especially now that large amounts of data can be accessed through the internet.</Paragraph> <Paragraph position="1"> However, with data coming to us from all over the world, originating from all sorts of sources, we now possibly have a new problem on our hands: often the origins of the found data remain obscure.</Paragraph> <Paragraph position="2"> It is not always clear what exactly the implications for our research are of employing data whose origin we do not know. Is it legal to use these data, ethical, appropriate, ...? In this paper we will focus on the last point: the appropriateness of the data in the light of a specific application or research goal. More in particular, we will investigate to what extent we can devise a procedure that will enable us to identify texts produced by native speakers of the language (and thus by default those produced by non-native speakers). The present study is motivated by the fact that for many uses the (near-)nativeness of the data is a critical factor in the development of adequate resources and applications. Thus, for example, a style checker or some other writing assistant tool which has been based on erroneous materials or at least materials deviant from the language targeted, will not always respond appropriately.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.1 Assessing (near-)nativeness </SectionTitle> <Paragraph position="0"> In the general absence of metadata which attest that texts have been produced by native speakers, there is one obvious approach that one may consider in order to assess the (near-)nativeness of texts of unknown origin and that is to exploit their specific linguistic characteristics.</Paragraph> <Paragraph position="1"> Previous studies investigating language variation (eg Biber, 1995, 1998; Biber et al., 1998; Conrad and Biber, 2001; Granger, 1998, Granger et al., 2002) have shown that language use in different genres and by different (groups of) speakers displays characteristic use of specific linguistic features (lexical, morphological, syntactic, semantic, discoursal). These studies are all based on data of known origin. In the present study, we take a somewhat different approach as we aim to profile texts of unknown origin and identify native vs non-native language use, a task for which we coined the term language verification.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 1.2 Non-native language use </SectionTitle> <Paragraph position="0"> Texts produced by non-native speakers will generally pass superficial inspection, i.e. they are deemed to be texts in the target language and will be treated as such. However, on closer inspection there is a wide range of features in the language use of non-natives which may have a disruptive effect on for instance derived language models. It is important to realize that non-native use is the complex result of different processes and conditions. First of all, there is the level of achievement. A non-native user gradually developes language skills in the target language.</Paragraph> <Paragraph position="1"> As he/she masters certain lexical items or morpho-syntactic structures and feels confident in using them, certain items and structures are bound to be overused. At the same time, other items and structures remain underused as the user avoids them since he is not familiar with them or does not (yet) feel confident enough to employ them.</Paragraph> <Paragraph position="2"> Moreover, even for speakers who have attained a relatively high degree of proficiency, the influence of the native language remains. This may lead to transfer effects and interference (the effects of which are found, for example, in the use of false friends and word order deviations).</Paragraph> <Paragraph position="3"> In the present paper, we report the results obtained in some experiments that were carried out and which aimed to assess whether texts are of (British English) native or non-native origin using the method of linguistic profiling. The structure of the paper is as follows: In section 2, we describe the method of linguistic profiling. Next, in section 3, its application in establishing the nativeness of texts is described, while in section 4 it is investigated whether the approach holds up when we shift from one domain to another. Finally, section 5 presents the conclusions.</Paragraph> </Section> </Section> class="xml-element"></Paper>