File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4038_metho.xml
Size: 5,396 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-4038"> <Title>Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunksa0</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 SVM Based Approach </SectionTitle> <Paragraph position="0"> In the literature, various machine learning approaches are applied to the problem of POS tagging and BP Chunking. Such problems are cast as a classification problem where, given a number of features extracted from a pre-defined linguistic context, the task is to predict the class of a token. Support Vector Machines (SVMs) (Vapnik, 1995) are one class of such model. SVMs are a supervised learning algorithm that has the advantage of being robust where it can handle a large number of (overlapping) features with good generalization performance. Consequently, SVMs have been applied in many NLP tasks with great success (Joachims, 1998; Kudo and Matsumato, 2000; Hacioglu and Ward, 2003).</Paragraph> <Paragraph position="1"> We adopt a tagging perspective for the three tasks.</Paragraph> <Paragraph position="2"> Thereby, we address them using the same SVM experimental setup which comprises a standard SVM as a multi-class classifier (Allwein et al., 2000). The difference for the three tasks lies in the input, context and features. None of the features utilized in our approach is explicitly language dependent. The following subsections illustrate the different tasks and their corresponding features and tag sets.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Word Tokenization </SectionTitle> <Paragraph position="0"> We approach word tokenization (segmenting off clitics) as a one-of-six classification task, in which each letter in a word is tagged with a label indicating its morphological identity.4 Therefore, a word may have a0a2a1a4a3 proclitics and a0a5a1a4a6 enclitic from the lists described in Section 2.</Paragraph> <Paragraph position="1"> A word may have no clitics at all, hence the a0 .</Paragraph> <Paragraph position="2"> Input: A sequence of transliterated Arabic characters processed from left-to-right with &quot;break&quot; markers for word boundaries.</Paragraph> <Paragraph position="3"> Context: A fixed-size window of -5/+5 characters centered at the character in focus.</Paragraph> <Paragraph position="4"> Features: All characters and previous tag decisions within the context.</Paragraph> <Paragraph position="5"> Tag Set: The tag set is a36 B-PRE1, B-PRE2, B-WRD, I-WRD, B-SUFF, I-SUFFa37 where I denotes inside a segment, B denotes beginning of a segment, PRE1 and PRE2 are proclitic tags, SUFF is an enclitic, and WRD is the stem plus any affixes and/or the determiner Al.</Paragraph> <Paragraph position="6"> Table 1 illustrates the correct tagging of the example above, w-b-hsnAt-hm, 'and by their virtues'.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Part of Speech Tagging </SectionTitle> <Paragraph position="0"> We model this task as a 1-of-24 classification task, where the class labels are POS tags from the collapsed tag set in is derived from the collapsed POS-tagged Treebank. Input: A sequence of tokens processed from left-to-right. Context: A window of -2/+2 tokens centered at the focus token.</Paragraph> <Paragraph position="1"> Features: Every character a0 -gram, a0 a1a2a1 that occurs in the focus token, the 5 tokens themselves, their 'type' from the set a36 alpha, numerica37 , and POS tag decisions for previous tokens within context.</Paragraph> <Paragraph position="2"> Tag Set: The utilized tag set comprises the 24 collapsed tags available in the Arabic TreeBank distribution.</Paragraph> <Paragraph position="3"> This collapsed tag set is a manually reduced form of the 135 morpho-syntactic tags created by AraMorph.</Paragraph> <Paragraph position="4"> The tag set is as follows: a36 CC, CD, CONJ+NEG PART, DT, FW, IN, JJ, NN, NNP, NNPS, NNS, NO FUNC, NU-MERIC COMMA, PRP, PRP$, PUNC, RB, UH, VBD, VBN, VBP, WP, WRBa37 .</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Base Phrase Chunking </SectionTitle> <Paragraph position="0"> In this task, we use a setup similar to that of (Kudo and Matsumato, 2000), where 9 types of chunked phrases are recognized using a phrase IOB tagging scheme; Inside I a phrase, Outside O a phrase, and Beginning B of a phrase. Thus the task is a one of 19 classification task (since there are I and B tags for each chunk phrase type, and a single O tag). The training data is derived from the Arabic TreeBank using the ChunkLink software.5. ChunkLink flattens the tree to a sequence of base (non-recursive) phrase chunks with their IOB labels.</Paragraph> <Paragraph position="1"> The following example illustrates the tagging scheme: Tags: O B-VP B-NP I-NP Translit: w qAlt rwv $wArtz Arabic: a35 a3a5a4 a28a6 a7a9a8a11a10 a12 a27a13a10a15a14a8a17a16 Gloss: and said Ruth Schwartz Input: A sequence of (word, POS tag) pairs.</Paragraph> <Paragraph position="2"> Context: A window of -2/+2 tokens centered at the focus token.</Paragraph> <Paragraph position="3"> Features: Word and POS tags that fall in the context along with previous IOB tags within the context.</Paragraph> </Section> </Section> class="xml-element"></Paper>