File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2023_metho.xml

Size: 17,455 bytes

Last Modified: 2025-10-06 14:09:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2023">
  <Title>An Unsupervised System for Identifying English Inclusions in German Text</Title>
  <Section position="4" start_page="0" end_page="133" type="metho">
    <SectionTitle>
(1) &amp;quot;Security-Tool verhindert, dass Hacker &amp;quot;uber
Google Sicherheitsl&amp;quot;ucken finden&amp;quot;1
</SectionTitle>
    <Paragraph position="0"> Security tool prevents hackers from finding security holes via Google.</Paragraph>
    <Paragraph position="1"> An automatic classifier of foreign inclusions would  study this language-mixing phenomenon because lexical resources need to be updated and reflect this trend. As foreign inclusions carry critical content in terms of pronunciation and semantics, their correct recognition will also provide vital knowledge in applications such as polyglot TTS synthesis or MT.</Paragraph>
  </Section>
  <Section position="5" start_page="133" end_page="133" type="metho">
    <SectionTitle>
3 Data
</SectionTitle>
    <Paragraph position="0"> Our corpus is made up of a random selection of online German newspaper articles published in the Frankfurter Allgemeine Zeitung between 2001 and 2004 in the domains of (1) internet &amp; telecomms, (2) space travel and (3) European Union. These domains were chosen to examine the different use and frequency of English inclusions in German texts of a more technological, scientific and political nature.</Paragraph>
    <Paragraph position="1"> With approximately 16,000 tokens per domain, the overall corpus comprises of 48,000 tokens (Table 1).</Paragraph>
    <Paragraph position="2"> We created a manually annotated gold standard using an annotation tool based on NITE XML (Carletta et al., 2003). We annotated two classes whereby English words and abbreviations that expand to English terms were classed as &amp;quot;English&amp;quot; (EN) and all other tokens as &amp;quot;Outside&amp;quot; (O).2 Table 1 presents the number of English inclusions annotated in each gold standard set and illustrates that English inclusions are very sparse in the EU domain (49 tokens) but considerably frequent in the documents in the internet and space travel domains (963 and 485 tokens, respectively). The type-token ratio (TTR) signals that the English inclusions in the space travel data are less diverse than those in the internet data.</Paragraph>
    <Paragraph position="3">  (Shuttleflug) or with German inflections (Receivern) as further morphological analysis is required to recognise them. Our aim is to address these issues in future work.</Paragraph>
  </Section>
  <Section position="6" start_page="133" end_page="135" type="metho">
    <SectionTitle>
4 System Description
</SectionTitle>
    <Paragraph position="0"> Our system is a UNIX pipeline which converts HTML documents to XML and applies a set of modules to add linguistic markup and to classify nouns as German or English. The pipeline is composed of a pre-processing module for tokenisation and POS-tagging as well as a lexicon lookup and Google lookup module for identifying English inclusions.</Paragraph>
    <Section position="1" start_page="133" end_page="133" type="sub_section">
      <SectionTitle>
4.1 Pre-processing Module
</SectionTitle>
      <Paragraph position="0"> In the pre-processing module, the downloaded Web documents are firstly cleaned up using Tidy3 to remove HTML markup and any non-textual information and then converted into XML. Subsequently, two rule-based grammars which we developed specifically for German are used to tokenise the XML documents. The grammar rules are applied with lxtransduce4, a transducer which adds or rewrites XML markup on the basis of the rules provided. Lxtransduce is an updated version of fsgmatch, the core program of LT TTT (Grover et al., 2000). The tokenised text is then POS-tagged using TnT trained on the German newspaper corpus Negra (Brants, 2000).</Paragraph>
    </Section>
    <Section position="2" start_page="133" end_page="134" type="sub_section">
      <SectionTitle>
4.2 Lexicon Lookup Module
</SectionTitle>
      <Paragraph position="0"> For the initial lookup, we used CELEX, a lexical database of English, German and Dutch containing full and inflected word forms as well as corresponding lemmas. CELEX lookup was only performed for tokens which TnT tagged as nouns (NN), foreign material (FM) or named entities (NE) since anglicisms representing other parts of speech are relatively infrequent in German (Yeandle, 2001).</Paragraph>
      <Paragraph position="1"> Tokens were looked up twice, in the German and the English database and parts of hyphenated compounds were checked individually. To identify capitalised English tokens, the lookup in the English database was made case-insensitive. We also made the lexicon lookup sensitive to POS tags to reduce classification errors. Tokens were found either only in the German lexicon (1), only in the English lexicon (2) in both (3) or in neither lexicon (4).</Paragraph>
      <Paragraph position="2">  the German lexicon are actual German words. The remaining are English words with German case inflection such as Computern. The word Computer is used so frequently in German that it already appears in lexicons and dictionaries. To detect the base language of the latter, a second lookup can be performed checking whether the lemma of the token also occurs in the English lexicon.</Paragraph>
      <Paragraph position="3"> (2) Tokens found exclusively in the English lexicon such as Software or News are generally English words and do not overlap with German lexicon entries. These tokens are clear instances of foreign inclusions and consequently tagged as English.</Paragraph>
      <Paragraph position="4"> (3) Tokens which are found in both lexicons are words with the same orthographic characteristics in both languages. These are words without inflectional endings or words ending in s signalling either the German genitive singular or the German and English plural forms of that token, e.g. Computers.</Paragraph>
      <Paragraph position="5"> The majority of these lexical items have the same or similar semantics in both languages and represent assimilated loans and cognates where the language origin is not always immediately apparent. Only a small subgroup of them are clearly English loan words (e.g. Monster). Some tokens found in both lexicons are interlingual homographs with different semantics in the two languages, e.g. Rat (council vs.</Paragraph>
      <Paragraph position="6"> rat). Deeper semantic analysis is required to classify the language of such homographs which we tagged as German by default.</Paragraph>
      <Paragraph position="7">  (4) All tokens found in neither lexicon are submitted to the Google lookup module.</Paragraph>
    </Section>
    <Section position="3" start_page="134" end_page="135" type="sub_section">
      <SectionTitle>
4.3 Google Lookup Module
</SectionTitle>
      <Paragraph position="0"> The Google lookup module exploits the World Wide Web, a continuously expanding resource with documents in a multiplicity of languages. Although the bulk of information available on the Web is in English, the number of texts written in languages other than English has increased rapidly in recent years (Crystal, 2001; Grefenstette and Nioche, 2000).</Paragraph>
      <Paragraph position="1"> The exploitation of the Web as a linguistic corpus is developing into a growing trend in computational linguistics. The sheer size of the Web and the continuous addition of new material in different languages make it a valuable pool of information in terms of language in use. The Web has already been used successfully for a series of NLP tasks such as MT (Grefenstette, 1999), word sense disambiguation (Agirre and Martinez, 2000), synonym recognition (Turney, 2001), anaphora resolution (Modjeska et al., 2003) and determining frequencies for unseen bi-grams (Keller and Lapata, 2003).</Paragraph>
      <Paragraph position="2"> The Google lookup module obtains the number of hits for two searches per token, one on German Web pages and one on English ones, an advanced language preference offered by Google. Each token is classified as either German or English based on the search that returns the higher normalised score of the number of hits. This score is determined by weighting the number of raw hits by the size of the Web corpus for that language. We determine the latter following a method proposed by Grefenstette and Niochi (2000) by using the frequencies of a series of representative tokens within a standard corpus in a language to determine the size of the Web corpus for that language. We assume that a German word is more frequently used in German text than in English and vice versa. As illustrated in Table 2, the German word Anbieter (provider) has a considerably higher weighted frequency in German Web documents (DE). Conversely, the English word provider occurs more often in English Web documents (EN).</Paragraph>
      <Paragraph position="3"> If both searches return zero hits, the token is classified as German by default. Word queries that return zero or a low number of hits can also be indicative of new expressions that have entered a language.</Paragraph>
      <Paragraph position="4"> Google lookup was only performed for the tokens found in neither lexicon in order to keep computational cost to a minimum. Moreover, a preliminary experiment showed that the lexicon lookup is already sufficiently accurate for tokens contained exclusively in the German or English databases. Current Google search options are also limited in that queries cannot be treated case- or POS-sensitively.</Paragraph>
      <Paragraph position="5"> Consequently, interlingual homographs would often mistakenly be classified as English.</Paragraph>
      <Paragraph position="6">  5 Evaluation of the Lookup System We evaluated the system's performance for all tokens against the gold standard. While the accuracies in Table 3 represent the percentage of all correctly tagged tokens, the F-scores refer to the English tokens and are calculated giving equal weight to precision (P) and recall (R) as a0a2a1a4a3a6a5a8a7a10a9a11a7a10a12a14a13a16a15a17a3a18a9a20a19a21a12a14a13 . The system yields relatively high F-scores of 72.4 and 73.1 for the internet and space travel data but only a low F-score of 38.6 for the EU data. The latter is due to the sparseness of English inclusions in that domain (Table 1). Although recall for this data is comparable to that of the other two domains, the number of false positives is high, causing low precision and F-score. As the system does not look up one-character tokens, we implemented further post-processing to classify individual characters as English if followed by a hyphen and an English inclusion. This improves the F-score by 4.8 for the internet data to 77.2 and by 0.6 for the space travel data to 73.7 as both data sets contain words like E-Mail or E-Business. Post-processing does not decrease the EU score. This indicates that domain-specific post-processing can improve performance.</Paragraph>
      <Paragraph position="7"> Baseline accuracies when assuming that all tokens are German are also listed in Table 3. As F-scores are calculated based on the English tokens in the gold standard, we cannot report comparable baseline F-scores. Unsurprisingly, the baseline accuracies are relatively high as most tokens in a German text are German and the amount of foreign material is relatively small. The added classification of English inclusions yielded highly statistical significant improvements (pa22 0.001) over the baseline of 3.5% for the internet data and 1.5% for the space travel data. When classifying English inclusions in the EU data, accuracy decreased slightly by 0.3%.</Paragraph>
      <Paragraph position="8"> Table 3 also shows the performance ofTextCat, an n-gram-based text categorisation algorithm of Cavnar and Trenkle (1994). While this language idenfication tool requires no lexicons, its F-scores are low for all 3 domains and very poor for the EU data. This confirms that the identification of English inclusions is more difficult for this domain, coinciding with the result of the lookup system. The low scores also prove that such language identification is unsuitable for token-based language classification.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="135" end_page="137" type="metho">
    <SectionTitle>
6 Machine Learning Experiments
</SectionTitle>
    <Paragraph position="0"> The recognition of foreign inclusions bears great similarity to classification tasks such as named entity recognition (NER), for which various machine learning techniques have proved successful. We were therefore interested in determining the performance of a trained classifier for our task. We experimented with a conditional Markov model tagger that performed well on language-independent NER (Klein et al., 2003) and the identification of gene and protein names (Finkel et al., 2005).</Paragraph>
    <Section position="1" start_page="135" end_page="136" type="sub_section">
      <SectionTitle>
6.1 In-domain Experiments
</SectionTitle>
      <Paragraph position="0"> We performed several 10-fold cross-validation experiments with different feature sets. They are referred to as in-domain (ID) experiments as the tagger is trained and tested on data from the same domain (Table 4). In the first experiment (ID1), we use the tagger's standard feature set including words, character sub-strings, word shapes, POS-tags, abbreviations and NE tags (Finkel et al., 2005). The resulting F-scores are high for the internet and space travel data (84.3 and 91.4) but are extremely low for the EU data (13.3) due to the sparseness of English inclusions in that data set. ID2 involves the same setup as ID1 but eliminating all features relying on the POS-tags. The tagger performs similarly well for the internet and space travel data but improves by 8 points to an F-score of 21.3 for the EU data.</Paragraph>
      <Paragraph position="1"> This can be attributed to the fact that the POS-tagger  does not perform with perfect accuracy particularly on data containing foreign inclusions. Providing the tagger with this information is therefore not necessarily useful for this task, especially when the data is sparse. Nevertheless, there is a big discrepancy between the F-score for the EU data and those of the other two data sets. ID3 and ID4 are set up as ID1 and ID2 but incorporating the output of the lookup system as a gazetteer feature. The tagger benefits considerably from this lookup feature and yields better F-scores for all three domains in ID3 (internet: 90.6, space travel: 93.7, EU: 44.4).</Paragraph>
      <Paragraph position="2"> Table 4 also compares the best F-scores produced with the tagger's own feature set (ID2) to the best results of the lookup system and the baseline. While the tagger performs much better for the internet and the space travel data, it requires hand-annotated training data. The lookup system, on the other hand, is essentially unsupervised and therefore much more portable to new domains. Given the necessary lexicons, it can easily be run over new text and text in a different language or domain without further cost.</Paragraph>
    </Section>
    <Section position="2" start_page="136" end_page="137" type="sub_section">
      <SectionTitle>
6.2 Cross-domain Experiments
</SectionTitle>
      <Paragraph position="0"> The tagger achieved surprisingly high F-scores for the internet and space travel data, considering the small training data set of around 700 sentences used for each ID experiment described above. Although both domains contain a large number of English inclusions, their type-token ratio amounts to 0.29 in the internet data and 0.15 in the space travel data (Table 1), signalling that English inclusions are frequently repeated in both domains. As a result, the likelihood of the tagger encountering an unknown inclusion in the test data is relatively small.</Paragraph>
      <Paragraph position="1"> To examine the tagger's performance on a new domain containing more unknown inclusions, we ran two cross-domain (CD) experiments: CD1, training on the internet and testing on the space travel data, and CD2, training on the space travel and testing on the internet data. We chose these two domain pairs to ensure that both the training and test data contain a relatively large number of English inclusions. Table 5 shows that the F-scores for both CD experiments are much lower than those obtained when training and testing the tagger on documents from the same domain. In experiment CD1, the F-score only amounts to 54.2 while the percentage of  unknown target types (UTT) for cross-domain experiments compared to best lookup and baseline unknown target types in the space travel test data is 81.9%. The F-score is even lower in the second experiment at 22.2 which can be attributed to the fact that the percentage of unknown target types in the internet test data is higher still at 93.9%.</Paragraph>
      <Paragraph position="2"> These results indicate that the tagger's high performance in the ID experiments is largely due to the fact that the English inclusions in the test data are known, i.e. the tagger learns a lexicon. It is therefore more complex to train a machine learning classifier to perform well on new data with more and more new anglicisms entering German over time.</Paragraph>
      <Paragraph position="3"> The amount of unknown tokens will increase constantly unless new annotated training data is added.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML