File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1701_concl.xml

Size: 6,377 bytes

Last Modified: 2025-10-06 13:55:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1701">
  <Title>Web-based frequency dictionaries for medium density languages</Title>
  <Section position="4" start_page="4" end_page="6" type="concl">
    <SectionTitle>
3 Conclusions
</SectionTitle>
    <Paragraph position="0"> Once the disambiguation of morphological analyses is under control, lemmatization itself is a mechanical task which we perform in a database framework. This has the advantage that it supports a rich set of query primitives, so that we can easily find e.g. nouns with back vowels that show stem vowel elision and have approximately the same frequency as the stem orvos 'doctor'.</Paragraph>
    <Paragraph position="1"> Such a database has obvious applications both in psycholinguistic experiments (which was one of the design goals) and in settling questions of theoretical morphology. But there are always nagging doubts about the closed world assumption behind databases, famously exposed in linguistics by Chomsky's example colorless green ideas sleep furiously: how do we distinguish this from *green sleep colorless furiously ideas if the observed frequency is zero for both? Clearly, a naive empirical model that assigns zero probability to each unseen word form makes the wrong predictions. Better estimates can be achieved if unseen words which are known to be possible morphologically complex forms of seen lemmas are assigned positive probability. This can be done if the probability of a complex form is in some way predictable from the probabilities of its component parts. A simple variant of this model is the positional independence hypothesis which takes the probabilities of morphemes in separate positional classes to be independent of each other.</Paragraph>
    <Paragraph position="2"> Here we follow Antal (1961) and Kornai (1992) in establishing three positional classes in the inflectional paradigm of Hungarian nouns.</Paragraph>
    <Paragraph position="4"> The innermost class is used for number and possessive, with a total of 18 choices including the zero morpheme (no possessor and singular). The second positional class is for anaphoric possessives with a total of three choices including the zero morpheme, and the third (outermost) class is for case endings with a total of 19 choices including the zero morpheme (nominative) for a total of 1026 paradigmatic forms. The parameters were obtained by downhill simplex minimization of absolute errors. The average absolute error is of the values computed by the independece hypothesis from the observed values is 0.000099 (mean squared error is 9.18 * 10[?]7), including the 209 paradigmatic slots for which no forms were found in the webcorpus at all (but the independence modelwill assignpositive probabilityto any of them as the product of the component probabilities). When checking the independence hypothesis with Ph statistics in the webcorpus for every nominal inflectional morpheme pair the members of which are from different dimensions, the Ph co-efficient remained less than 0.1 for each pair but 3. For these 3 the coefficient is under 0.2 (which means that the shared variance of these pairs is between1% and2%)sowe havenoreason todiscard the independence hypothesis. If we run the same test on the 150 million words Hungarian National Corpus, which was analyzed and tagged by different tools, we also get the same result (Nagy, 2005). It is very easy to construct low probability combinations using this model. Taking a less frequent possessive ending such as the 2nd singular posessor familiar plural -od'ek, the anaphoric plural -'ei, and a rarer case ending such as the formalis -k'ent we obtain combinations such as bar'atod'ek'eik'ent &amp;quot;as the objects owned by your friends' company&amp;quot;. The model predicts we need a corpus with about</Paragraph>
    <Section position="1" start_page="4" end_page="6" type="sub_section">
      <SectionTitle>
4.2 * 1012 noun tokens to see this suffix combina-
</SectionTitle>
      <Paragraph position="0"> tion (not necessarily with the stem bar'at &amp;quot;friend&amp;quot;) or about ten trillion tokens. While the current corpus falls short by four orders of magnitude, this is about the contribution of the anaphoric plural (which we expect to see only once in about 40k noun tokens) so for any two of the three position classes combined the prediction that valid inflectional combinations will actually be attested is already testable.</Paragraph>
      <Paragraph position="1"> Using the fitted distribution of the position classes, the entropy of the nominal paradigm is computed simply as the sum of the class entropies, 1.554 + 0.0096 + 2.325 or 3.888 bits. Since the nominal paradigm is considerably more complex than the verbal paradigm (which has a total of 52 forms) or the infinitival paradigm (7 forms), this value can serve as an upper bound on the inflectional entropy of Hungarian. In Table 3 we present the actual values, computed on a variety of frequency dictionaries. The smallest of these is based on a single text, the Hungarian translation of Orwell's 1984. The mid-range corpora used in this comparison are segregated in broad topics: law (EU laws and regulations), literature, movie subtitles, and software manuals: all were collected from the web as part of building a bilingual English-Hungarian corpus. Finally, the large-range is the full webcorpus at the best (4% reject) quality stratum.</Paragraph>
      <Paragraph position="2">  Our overall conclusion is that for many purposes a web-based corpus has significant advantages over more traditional corpora. First, it is cheap to collect. Second, it is sufficiently heterogeneous to ensure that language models based on it generalize better on new texts of arbitrary topics than models built on (balanced) manual corpora.</Paragraph>
      <Paragraph position="3"> As we have shown, automatically tagged and lemmatized webcorpora can be used to obtain large coverage stem and wordform frequency dictionaries. While there is a significant portion of OOV entries (about 3% for our current MA), in the design of psycholinguistic experiments it is generally sufficient to consider stems already known to the MA, and the variety of these (over three times the stem lexicon of the standard Hungarian frequencydictionary)enablesmanycontrolledexper- null iments hitherto impossible.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML