File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1027_metho.xml

Size: 15,645 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1027">
  <Title>Shallow language processing architecture for Bulgarian</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 LINGUA - an architecture for
</SectionTitle>
    <Paragraph position="0"> language processing in Bulgarian LINGUA is a text processing framework for Bulgarian which automatically performs tokenisation, sentence splitting, part-of-speech tagging, parsing, clause segmentation, sectionheading identification and resolution for third person personal pronouns (Figure 1). All modules of LINGUA are original and purposebuilt, except for the module for morphological analysis which uses Krushkov's morphological analyser BULMORPH (Krushkov, 1997). The anaphora resolver is an adaptation for Bulgarian of Mitkovs knowledge-poor pronoun resolution approach (Mitkov, 1998).</Paragraph>
    <Paragraph position="1"> LINGUA was used in a number of projects covering automatic text abridging, word semantic extraction (Totkov and Tanev, 1999) and term extraction. The following sections outline the basic language processing functions, provided by the language engine.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Text segmentation: tokenisation,
</SectionTitle>
      <Paragraph position="0"> sentence splitting and paragraph identification The first stage of every text processing task is the segmentation of text in terms of tokens, sentences and paragraphs.</Paragraph>
      <Paragraph position="1"> LINGUA performs text segmentation by operatingwithinaninputwindowof30tokens, applying rule-based algorithm for token synthesis, sentence splitting and paragraph identification.  Tokens identified from the input text serve as input to the token stapler. The token stapler forms more complex tokens on the basis of a  token grammar. With a view to improving tokenisation, a list of abbreviations has been incorporated into LINGUA.</Paragraph>
      <Paragraph position="2">  LINGUA's sentence splitter operates to identify sentence boundaries on the basis of 9 main end-of-sentence rules and makes use of a list of abbreviations. Some of the rules consist of several finer sub-rules. The evaluation of the performance of the sentence splitter on a text of 190 sentences reports a precision of 92% and a recall of 99%. Abbreviated names such as J.S.Simpson are filtered by special constraints. The sentence splitting and tokenising rules were adapted for English. The resulting sentence splitter was then employed for identifying sentence boundaries in the Wolverhampton Corpus of Business English project.</Paragraph>
      <Paragraph position="3">  Paragraph identification is based on heuristics such as cue words, orthography and typographical markers. The precision of the paragraph splitter is about 94% and the recall is 98% (Table 3).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Morphological analysis and
</SectionTitle>
      <Paragraph position="0"> part-of-speech tagging  Bulgarian morphology is complex, for example the paradigm of a verb has over 50 forms. Krushkov's morphological analyser BULMORPH (Krushkov, 1997) is integrated in the language engine with a view to processing Bulgarian texts at morphological level.  The level of morphological ambiguity for Bulgarian is not so high as it is in other languages. As a guide, we measured the ratio: Number of all tags=Number of all words. The results show that this ratio is comparatively low and for a corpus of technical texts of 9000 words the ratio tags per word is 1,26, whereas for a 13000-word corpus from the genre of fiction this ratio is 1,32. For other languages such as Turkish this ratio is about 1,9 and for certain English corpora 2,0 1.</Paragraph>
      <Paragraph position="1"> We used 33 hand-crafted rules for disambiguation. Since large tagged corpora in Bulgarian are not widely available, the development of a corpus-based probabilistic tagger was an unrealistic goal for us. However, as some studies suggest (Voutilainen, 1995), the precision of rule-based taggers may exceed that of the probabilistic ones.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Parsing
</SectionTitle>
      <Paragraph position="0"> Seeking a robust flexible solution for parsing we implemented two alternative approaches in LINGUA: a fast-working NP extractor and more general parser, which works more slowly, but delivers better results both in accuracy and coverage. AsnosyntacticallyannotatedBulgariancorporawereavailabletous, usingstatistical data to implement probabilistic algorithm was not an option.</Paragraph>
      <Paragraph position="1"> The NP extraction algorithm is capable of analysing nested NPs, NPs which contain left  modifiers, prepositional phrases and coordinating phrases. The NP extractor is based on a simple unification grammar for NPs and APs.</Paragraph>
      <Paragraph position="2"> The recall of NP extraction, measured against 352 NPs from software manuals, was 77% and the precision - 63.5%.</Paragraph>
      <Paragraph position="3"> A second, better coverage parser was implemented which employs a feature grammar based on recent formal models for Bulgarian, (Penchev, 1993), (Barkalova, 1997). All basic types of phrases such as NP, AP, PP, VP and AdvP are described in this grammar. The parser is supported by a grammar compiler, working on grammar description language for representation of non context unification grammars. For example one of the rules for synthesis of NP phrases has the form: NP(def :Y full art:F ext:+ rex:! nam:!) !AP(gender:X def:Y full art:F number: L ) NP(ext:! def:! number:L gender:X rex:!) The features and values in the rules are not fixed sets and can be changed dynamically. The flexibility of this description allows the grammar to be extended easily. The parser uses a chart bottom-up strategy, which allows for partial parsing in case full syntactic tree cannot be built over the sentence.</Paragraph>
      <Paragraph position="4"> There are currently about 1900 syntactic rules in the grammar which are encoded through 70 syntactic formulae.</Paragraph>
      <Paragraph position="5"> Small corpus of 600 phrases was syntactically annotated by hand. We used this corpus to measure the precision of the parsing algorithm (Table 3).</Paragraph>
      <Paragraph position="6"> We found that the precision of NP extraction performed by the chart parser is higher than the precision of the standalone NP extraction 74.8% vs. 63.5% while the recall improves by only 0.9% - 77.9% vs. 77% .</Paragraph>
      <Paragraph position="7">  Thesyntacticambiguityisresolvedusingsyntactic verb frames and heuristics, similar to the ones described in (Allen, 1995).</Paragraph>
      <Paragraph position="8"> The parser reaches its best performance for  NPs(74.8%precisionand77.9%recall)andlowest for VPs (33% precision, 26% recall) and Ss (20% precision and 5.9% recall) (Table 3).</Paragraph>
      <Paragraph position="9"> The overall (measured on all the 600 syntactic phrases) precision and recall, are 64.9% and 60.5% respectively. This is about 20% lower, compared with certain English parsers (Murat and Charniak, 1995), which is due to the insufficient grammar coverage, as well as the lack of reliable disambiguation algorithm. However the bracket crossing accuracy is 80%, which is comparable tosomeprobabilistic approaches. It should be noted that in our experiments we restricted the maximal number of arcs up to 35000 per sentence to speed up the parsing.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Anaphora resolution in Bulgarian
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Adaptation of Mitkovs
</SectionTitle>
      <Paragraph position="0"> knowledge-poor approach for Bulgarian The anaphora resolution module is implemented as the last stage of the language processing architecture (Figure 1). This module resolves third-person personal pronouns and is an adaptation of Mitkov's robust, knowledge-poor multilingualapproach(Mitkov, 1998)whoselatest implementation by R. Evans is referred to as MARS 2 (Orasan et al., 2000). MARS does not make use of parsing, syntactic or semantic constraints; nor does it employ any form of non-linguistic knowledge. Instead, the approach relies on the efficiency of sentence splitting, part-of-speech tagging, noun phrase identification and the high performance of the antecedent indicators; knowledge is limited to a small noun phrase grammar, a list of (indicating) verbs and a set of antecedent indicators. The core of the approach lies in activating the antecedent indicators after filtering candidates (from the current and two preceding sentences) on the basis of gender and number agreement and the candidate with the highest composite score is proposed as antecedent 3. Before that, the text is pre-processed by a sentence splitter which determines the sentence boundaries, a part-of-speech tagger which identifies the parts of speech and a simple phrasal grammar which detects the noun phrases. In the case of complex sentences, heuristic 'clause identification' rules track the clause boundaries.</Paragraph>
      <Paragraph position="1"> LINGUA performs the pre-processing, needed as an input to the anaphora resolution algorithm: sentence, paragraph and clause splitters, NP grammar, part-of-speech tagger,  section heading identification heuristics. Since one of the indicators that Mitkov's approach uses is term preference, we manually developed4 a small term bank containing 80 terms from the domains of programming languages, word processing, computer hardware and operating systems 5. This bank additionally featured 240 phrases containing these terms.</Paragraph>
      <Paragraph position="2"> The antecedent indicators employed in MARS are classified as boosting (such indicators when pointing to a candidate, reward it with a bonus since there is a good probability of it being the antecedent) or impeding (such indicators penalise a candidate since it does not appear to have high chances of being the antecedent). The majority of indicators are genre-independent and are related to coherence phenomena (such as salience and distance) or to structural matches, whereas others are genre-specific (e.g. term preference, immediate reference, sequential instructions). Most of the indicators have been adopted in LINGUA without modification from the original English version (see (Mitkov, 1998) for more details). However, we have added 3 new indicators for Bulgarian: selectional restriction pattern, adjectival NPs and name preference.</Paragraph>
      <Paragraph position="3"> The boosting indicators are First Noun Phrases: A score of +1 is assigned to the first NP in a sentence, since it is deemed 4This was done for experimental purposes. In future applications, we envisage the incorporation of automatic term extraction techniques.</Paragraph>
      <Paragraph position="4"> 5Note that MARS obtains terms automatically using TF.IDF.</Paragraph>
      <Paragraph position="5"> to be a good candidate for the antecedent.</Paragraph>
      <Paragraph position="6"> Indicating verbs: A score of +1 is assigned to those NPs immediately following the verb which is a member of a previously defined set such as discuss, present, summarise etc.</Paragraph>
      <Paragraph position="7"> Lexical reiteration: A score of +2 is assigned those NPs repeated twice or more in the paragraph in which the pronoun appears, a score of +1 is assigned to those NP, repeated once in the paragraph.</Paragraph>
      <Paragraph position="8"> Section heading preference: A score of +1 is assigned to those NPs that also appear in the heading of the section.</Paragraph>
      <Paragraph position="9"> Collocation match: A score of +2 is assigned to those NPs that have an identical collocation pattern to the pronoun.</Paragraph>
      <Paragraph position="10">  Immediate reference: A score of +2 is assigned to those NPs appearing in constructions of the form &amp;quot; ...V1 NP &lt; CB &gt; V2 it &amp;quot; , where &lt; CB &gt; is a clause boundary.</Paragraph>
      <Paragraph position="11"> Sequential instructions: A score of +2 is  applied to NPs in the NP1 position of constructions of the form: &amp;quot;To V1 NP1 ... To V2 it ...&amp;quot; Term preference: a score of +1 is applied to those NPs identified as representing domain terms.</Paragraph>
      <Paragraph position="12"> Selectional restriction pattern: a score of Text Pronouns Intrasentential: Average Average Average Average Intersentential number of distance from distance from distance from anaphors candidates the antecedent the antecedent the antecedent per anaphor in clauses in sentences in NP  +2 is applied to noun phrases occurring in collocation with the verb preceding or following the anaphor. This preference is different from the collocation match preference in that it operates on a wider range of 'selectional restriction patterns' associated with a specific verb 6 and not on exact lexical matching. If the verb preceding or following the anaphor is identified to be in a legitimate collocation with a certain candidate for antecedent, that candidate is boosted accordingly. As an illustration, assume that 'Delete file' has been identified as a legitimate collocation being a frequent expression in a domain specific corpus and consider the example 'Make sure you save the file in the new directory. You can now delete it. ' Whereas the 'standard' collocation match will not be activated here, the selectional restriction pattern will identify 'delete file' as an acceptable construction and will reward the candidate 'the file'.</Paragraph>
      <Paragraph position="13"> Adjectival NP: a score of +1 is applied to NPs which contain adjectives modifying the head. Empirical analysis shows that Bulgarian constructions of that type are more salient than NPs consisting simply of a noun. Recent experiments show that the success rate of the anaphora resolution is improved by 2.20%, using this indicator. It would be interesting to establish if this indicator is applicable for English.</Paragraph>
      <Paragraph position="14"> Name preference: a score +2 is applied to names of entities (person, organisation, product 6At the moment these patterns are extracted from a list of frequent expressions involving the verb and domain terms in a purpose-built term bank but in generally they are automatically collected from large domain-specific corpora.</Paragraph>
      <Paragraph position="15"> names).</Paragraph>
      <Paragraph position="16"> The impeding indicator is Prepositional Noun Phrases: NPs appearing in prepositional phrases are assigned a score of -1.</Paragraph>
      <Paragraph position="17"> Two indicators, Referential distance and Indefiniteness may increase or decrease a candidate's score.</Paragraph>
      <Paragraph position="18"> Referential distance gives scores of +2 and +1 for the NPs in the same and in the previous sentence respectively, and -1 for the NPs two sentences back. This indicator has strong influence on the anaphora resolution performance, especially in the genre of technical manuals.</Paragraph>
      <Paragraph position="19"> Experiments show that its switching off can decrease the success rate by 26% .</Paragraph>
      <Paragraph position="20"> Indefiniteness assigns a score of -1 to indefinite NPs, 0 to the definite (not full article) and +1 to these which are definite, containing the definite 'full' article in Bulgarian.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML