File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0409_intro.xml

Size: 5,722 bytes

Last Modified: 2025-10-06 14:02:27

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0409">
  <Title>Integrating Morphology with Multi-word Expression Processing in Turkish</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Multi-word expression extraction is an important component in language processing that aims to identify segments of input text where the syntactic structure and the semantics of a sequence of words (possibly not contiguous) are usually not compositional. Idiomatic forms, support verbs, verbs with speci c particle or pre/post-position uses, morphological derivations via partial or full word duplications are some examples of multi-word expressions.</Paragraph>
    <Paragraph position="1"> Further, expressions such as time-date expressions or proper nouns which can be described with simple (usually nite state) grammars, and whose internal structure is of no real importance to the overall analysis of the sentence, can also be considered under this heading. Marking multi-word expressions in text usually reduces (though not signi cantly) the number of actual tokens that further processing modules use as input, although this reduction may depend on the domain the text comes from. It can also reduce the multiplicative ambiguity as morphological interpretations of tokens are reduced when they are coalesced into multi-word expressions with usually a single interpretation.</Paragraph>
    <Paragraph position="2"> Turkish presents some interesting issues for multi-word expression processing as it makes substantial use of support verbs with lexicalized direct or oblique objects subject to various morphological constraints. It also uses partial and full reduplication of forms of various parts-of-speech, across their whole domain to form what we call non-lexicalized collocations, where it is the duplication and contrast of certain morphological patterns that signal a collocation rather than the speci c root words used.</Paragraph>
    <Paragraph position="3"> In this paper, we describe a multi-word expression processor for preprocessing Turkish text for various language engineering applications. In the next section after a very short overview of relevant aspects of Turkish, we present a rather comprehensive description of the multi-word expressions we handle. We then summarize the structure of the multi-word expression processor which employs a series of components for tokenization, morphological analysis, conservative non-statistical morphological disambiguation, and multi-word expression extraction. We nally present results from runs over a large corpus and a small gold-standard corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Related Work
</SectionTitle>
      <Paragraph position="0"> Recent work on multi-word expression extraction, use three basic approaches: statistical, rule-based, and hybrid. Statistical approaches require a corpus that contains signi cant numbers of occurrences of multi-word expressions. But even if the corpus consists of millions of words, usually, the frequencies of multi-word expressions are too low for statistical extraction. Baldwin and Villavicencio (2002) indicate that two-thirds of verb-particle constructions occur at most three times in the overall corpus, meaning that any extraction method must be able to handle extremely sparse data. They use a rule-based method to extract multi-word expressions in the form of a head verb and a single obligatory preposition employing a tagger augmented with an Second ACL Workshop on Multiword Expressions: Integrating Processing, July 2004, pp. 64-71 existing chunking system with which they rst identify the particle chunked and then turn back for the verb part of the construction.</Paragraph>
      <Paragraph position="1"> Piao et al. (2003) employ their semantic eld annotator USAS, containing 37,000 words and a template list of 16,000 multi-word units, all constructed manually from various resources, in order to extract multi-word expressions. The evaluation indicates a high precision (over 90%) but the estimated recall is about 40%. Deeper investigation on the corpus has indicated that two-thirds of the multi-word expressions occur in the corpus once or twice, verifying the fact that the statistical methods ltering low frequencies would fail.</Paragraph>
      <Paragraph position="2"> Urizar et al. (2000) describe a Basque terminology extraction system which covered multi-word term extraction as a subset. As Basque is a highly in ected agglutinative language like Turkish, morphological information is exploited to better de ne multi-word patterns. Their lemmatizer/tagger EU-SLEM, consists of a tokenizer followed by two sub-systems for the treatment of single word and multi-word expressions, and a disambiguator. The proposed term extraction tool uses the tagged input as the input of a shallow parsing phase which consists of regular expressions representing morphosyntactic patterns. The nal step uses statistical measures to eliminate incorrect candidates.</Paragraph>
      <Paragraph position="3"> The basic disadvantages of rule-based approaches are that they usually lack exibility, and it is a time-consuming and never ending process to try to cover a high percentage of the multi-word expressions in a language with rules and prede ned lists. The LINGO group which de nes multi-word expressions as a pain in the neck for NLP (Sag et al., 2002), suggests hybrid approaches using rule based approaches to identify possible multi-word expressions out of a corpus and using statistical methods to enhance the results obtained.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML