File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1602_intro.xml

Size: 6,002 bytes

Last Modified: 2025-10-06 14:02:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1602">
  <Title>Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools</Title>
  <Section position="3" start_page="0" end_page="1" type="intro">
    <SectionTitle>
2 Issues of methodology and training with
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Modern Standard Arabic
2.1 Defining the specificities of 'Modern
Standard Arabic'
</SectionTitle>
      <Paragraph position="0"> Modern Standard Arabic (MSA), the natural language under investigation, is not natively spoken by Arabs, who acquire it only through formal schooling. MSA is the only form of written communication in the whole of the Arab world.</Paragraph>
      <Paragraph position="1"> Thus, there exists a living writing and reading community of MSA. However, the level of MSA acquisition by its members is far from being homogeneous, and their linguistic knowledge, even at the highest levels of education, very unequal.</Paragraph>
      <Paragraph position="2"> This problem is going to have its impact on our corpus annotation training, routine, and results. As in other Semitic languages, inflection in MSA is mostly carried by case endings, which are represented by vocalic diacritics appended in word-final position. One must specify here that the MSA material form used in the corpus data we use consists of a graphic representation in which short vowel markers and other pertinent signs like the 'shaddah' (consonantal germination) are left out, as is typical in most written Arabic, especially news writing. However, this deficient graphic representation does not indicate a deficient language system. The reader reads the text and interprets its meaning by 'virtually providing' the missing grammatical information that leads to its acceptable interpretation.</Paragraph>
      <Paragraph position="3"> 2.2 How important is the missing information? Our description and analysis of MSA linguistic structures is first done in terms of individual words and then expanded to syntactic functions. Each corpus token is labeled in terms of its category and also in terms of its functions. It is marked morphologically and syntactically, and other relevant relationship features also intervene such as concord, agreement and adjacency. This redundancy decreases the importance of the absence of most vocalic features.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
2.3 The issue of vocalization
</SectionTitle>
      <Paragraph position="0"> The corpus for our annotation in the ATB requires that annotators complement the data by mentally supplying morphological information before choosing the automatic analysis, which amounts to a pre-requisite 'manual/human' intervention and which takes effect even before the annotation process begins. Since no automatic vocalization of unvocalized MSA newswire data is provided prior to annotation, vocalization becomes the responsibility of annotators at both layers of annotation. The part-of-speech (POS) annotators provide a first interpretation of the text/data and a vocalized output is created for the syntactic treebank (TB) annotators, who then engage in the responsibility of either validating the interpretation under their scrutiny or challenging it and providing another interpretation. This can have drastic consequences as in the case of the so-called 'Arabic deverbals' where the same bare graphemic structure can be two nouns in an 'idhafa (annexation or construct state) situation' with a genitive case ending on the second noun or a 'virtual' verb or verbal function with a noun complement in the accusative to indicate a direct object. In Example 1, genitive case is assigned under the noun interpretation, while accusative case is assigned by the same graphemic form of the word in its more verbal function (Badawi, et al., 2004, cf. Section 2.10, pp. 237-246).</Paragraph>
      <Paragraph position="1"> Example 1  Neutral form: &lt;xbArh Al+nb&gt; r Idhafa: &lt;ixbAruhu Al+naba&gt;i i a a iu ur his receipt (of) the news [news genitive] Verbal: &lt;ixbAruhu Al+naba&gt;a a a a u ur i his telling the news [news accusative] These are sometimes difficult decisions to make, and annotators' agreement in this case is always at  For the transliteration system of all our Arabic corpora, we use Tim Buckwalter's code, at http://www.ldc.upenn.edu/myl/morph/buckwalter.html its lowest. Vocalization decisions have a non-trivial impact on the overall annotation routine in terms of both accuracy and speed.</Paragraph>
      <Paragraph position="2"> Vocalization is a difficult problem, and we did not have the tools to address it when the project began. We originally decided to treat our first corpus, AFP, by having annotators supply word-internal lexical identity vocalization only, because that is how people normally read Arabic - taking the normal risks taken by all readers, with the assumption that any interpretation of the case or mood chosen would be acceptable as the interpretation of an educated native speaker annotator. In our second corpus, UMAAH, we decided that it would improve annotation and the overall usefulness of the corpus to vocalize the texts, by putting the necessary rules of syntax and vocalization at the POS level of annotation - our annotators added case endings to nouns and voice to verbs, in addition to the word-internal lexical identity vocalization. For our third corpus, ANNAHAR (currently in production), we have decided to fully vocalize the text, adding the final missing piece, mood endings for verbs. In conclusion, vocalization is a nagging but necessary &amp;quot;nuisance&amp;quot; because while its presence just enhances the linguistic analysis of the targeted corpus, its absence could be turned into an issue of quality of annotation and of grammatical credibility among Arab and non-Arab users.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML