File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1014_metho.xml
Size: 13,066 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1014"> <Title>REFERENCES</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> RULE BASED LEXICAL ANALYSIS OF MALTESE </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> Since no computer based dictionaries exist for Maltese, the only analysis that can be made at present is rule based. The paper describes a rule based system taking into account the mixed origins, (semitic and romance), of Maltese to categorise function and verb words. The system separates the database, the rule formalisms and the rule definitions, enabling easier analysis and quicker changes to rules when necessary.</Paragraph> <Paragraph position="1"> ~ITRODUCHON Maltese is written in Roman script using 30 letters made up of six vowels and twenty four consonants. Of particular interest are two graphemes gh and ie which though written as two letters are actually considered as one grapheme in the language. A recent survey, based on the most well known Maltese dictionary \[l\], found that the origin of the words is 40% Semitic, 40% Romance (Italian) and 20% English. The linguistic processing has therefore to take into account this mixed structure. The purpose of this work was to analyse the words to try and distinguish function words and verb words in the running text, within a text-to-speech synthesis application. This is necessary to obtain pause information at appropriate word boundaries.</Paragraph> <Paragraph position="2"> Therefore only a partial lexical analysis is being considered. The method used here involves a rule based system with separation between the rule definitions, the rule operations and the input data formalism. Two databases have the function words and the verb words respectively, to compare with the word under test. The rule formalisms are kept in small local tables pertaining to the various rules under test. In this way addition/deletion of the data is completely independent of the rule formalism, and rules can be added or amended independently of each other.</Paragraph> </Section> <Section position="3" start_page="0" end_page="102" type="metho"> <SectionTitle> MALTESE HNGUISTICS </SectionTitle> <Paragraph position="0"> As in other Semitic languages, Maltese words of semitic origin do not have a stem to which affixes are connected, but rather use transfixes.</Paragraph> <Paragraph position="1"> The stem or root is made up of a number of consonants, which can never occur in isolation, and whose order cannot be altered. Transfixes are then added to the root, sometimes also with prefixes and suffixes. Transfixes are made up of a number of vowels and may include operation on consonants such as doubling the middle consonant, (geminate). On the other hand, words of Romance origin follow the usual pattern of a stem and affixes of the inflectional and derivational type to form other lexemes. For example the semitic derivations from k,t,b are kiteb to write</Paragraph> </Section> <Section position="4" start_page="102" end_page="102" type="metho"> <SectionTitle> ~-'UNCZON WORDS </SectionTitle> <Paragraph position="0"> Function words in Maltese are classified as p~tidelli and pronomi and they are made up of pronouns, prepositions, conjunctions, interjections and adverbs. These can in turn be distinct words hawn, hekk, gSal, qabel, jien have personal pronoun suffixes g$~ alina, qablek composite hawnhekk, g~alhekk short phrases fuq il-qalb, sewwa sew In addition the definite article is added to the list of function words. The dictionary by Aquilina was used to obtain a comprehensive list of function words, for the database. Some function words use inflectional morphemes as suffixes. The definite article and function words that assimilate the definite article change their final letter for some consonants, distinguished as xemxin. Therefore morphological analysis is essential to keep the function word database to a reasonable size. The function word types are also distinguished for the purposes of the application as article il-, irpronoun fien, huma adverb hekk, sewwa conjunction u, jekk, mela prepositions assimilating the article mas-, fil-, ta' This subdivision is important as it helps in the syntactic analysis. The definite article is associated with a noun or adjective, the pronoun is associated with a noun phrase, a conjunction introduces a phrase, and the adverb indicates a verb phrase. Many function words can be of more than one type, and depend on the sentence syntax for the outcome.</Paragraph> </Section> <Section position="5" start_page="102" end_page="102" type="metho"> <SectionTitle> WORDS </SectionTitle> <Paragraph position="0"> Verb words in Maltese can be, like in other languages, conjugated in the present and past tense. There is no formal future tense, as in Italian, and auxiliary verbs are used to obtain other tenses. In general the present tense has prefixes to distinguish person, and suffixes to distinguish quantity. The past tense has suffixes to distinguish both person and quantity. In all of these the surface form of the stem can change. Like other languages there are numerous exceptions. Other verb words include the imperative, the negation of the verb and the passive and reflexive forms. Additionally the Maltese language tends to use suffixes extensively within verb words, to obtain generic accusative and dative object. For example</Paragraph> <Paragraph position="2"> She wrote it (fem.) to me The number of pronoun suffixes that can be added to every verb is considerable, \[2\].</Paragraph> </Section> <Section position="6" start_page="102" end_page="104" type="metho"> <SectionTitle> LINGUISTIC ANALYSIS </SectionTitle> <Paragraph position="0"> To keep the same formalism, the same data structure is defined for the analysis of the function words and the verb words. This consists essentially of the stem consonants, the stem word, the part of speech, and the lexical group. The lexical group relates to the VC, (vowel consonant) sequence within the word.</Paragraph> <Paragraph position="1"> This order gives rise to different manipulations of prefixes and suffixes with the stem, and therefore different sets of morphological rules.</Paragraph> <Section position="1" start_page="102" end_page="103" type="sub_section"> <SectionTitle> Function Word Analysis </SectionTitle> <Paragraph position="0"> The word under test is first checked for a valid function word. The analysis starts by looking for valid suffixes. The remaining stem is analysed for the consonants within the stem.</Paragraph> <Paragraph position="1"> The function word database is then examined for keys with the same consonants or more.</Paragraph> <Paragraph position="2"> Each corresponding entry has its lexical group which is then used for the test. The test is according to the morphological rule appropriate for that lexical group. If a match is made the word is assigned the corresponding function word category stored in the function word database. The rule syntax is as follows.</Paragraph> <Paragraph position="3"> Removing suffix I results in a stem that has operations done on stem. Stem operations are denoted as</Paragraph> <Paragraph position="5"> where Type can be (+,,-) meaning addition, no addition, deletion or (+/-) meaning delete the left string and add the right string.</Paragraph> <Paragraph position="6"> Letter Strings are ASCII strings to add or delete. In cases where no operation or only one type of operation is to be done, the rest of the field is left blank. Position is optional and denotes where in the stem the change should happen. The position is with respect to the end (right hand side of stem). Default is 1 and means abut to the stem. For example (+)('e')(2) means add to the stem letter e at position 2 from end of the stem (+/-)(a'/ieg~) means delete the part iegL~ from the stem end and substitute with a'.</Paragraph> <Paragraph position="7"> No operation is also valid. The decision for validity depends on whether the resulting stem, after stem operations, is a valid lexeme in the database.</Paragraph> </Section> <Section position="2" start_page="103" end_page="104" type="sub_section"> <SectionTitle> Verb Word Analysis </SectionTitle> <Paragraph position="0"> If not a valid function word, a further test is made to check whether it is a valid verb word.</Paragraph> <Paragraph position="1"> Verb word analysis is initially different from function word analysis since verb words earl also have prefixes. All possible consonant group sequences made up of two or more consonants from the surface word are considered, starting from the sequence with all the consonants in the surface word. Initially these groups are passed through the irregular verb list, then through the mute verb list, and then through the database. (Mute verbs are those that have a consonant in the stem that is missing in the surface form). Any consonant groupings found in the surface form that have entries in the dictionaries (irregular, mute or main) are potential candidates. For the first potential candidate the lexeme stem part defined as that part of the surface word that incorporates the lexeme consonants is isolated.</Paragraph> <Paragraph position="2"> If any prefix or suffix stems result, these are examined using the morphological rules for verbs. This results in either rejection since the affix stems cannot result in valid affixes, or valid affixes. If the affix stems have any remaining parts after valid suffix and prefix assignment, it is returned to the lexeme stem.</Paragraph> <Paragraph position="3"> The lexeme stem is now examined for a valid lexeme stem for the particular database lexeme under test. This is done by a separate rule set for each verb type. If it is validated the stem has the immediate prefix and suffix restored.</Paragraph> <Paragraph position="4"> (This should exclude all pronoun suffixes). This results in a word form (P).X.(S), where P is an optional prefix, and similarly for the suffix.</Paragraph> <Paragraph position="5"> This diminished surface word form is used to obtain, from the linguistic database for the verb type, the linguistic analysis for the word.</Paragraph> <Paragraph position="6"> The rules are based on the root which is the 3rd person singular masculine in the past tense. The keys of the database contain the consonants of the root. Sixteen verb categories are defined in the database. Each verb category has a number of formalisms, defined as VC patterns, the surface word is allowed in the stem part for that particular verb category. Each VC formal pattern has in turn, a set of rules : (prefix) + stem + (suffix) (2) where the optional prefix and suffix are defined in the rules. One entry in the rule set for verbs of category type Vl 1 is as follows</Paragraph> <Paragraph position="8"> The first column has format <prefix 1>, <prefix 2> ..<prefix i> ; <suffix 1>, <suffix 2>, ..</Paragraph> <Paragraph position="9"> The meaning is any of the prefixes is valid. The ' ; ' is a delimiter when there are both prefixes and suffixes. The second column defines the surface stem for the current rule in the verb category and the valid affixes it can take. The stem definition is in terms of consonant and vowel positions. For a given root the stem consonants are identical to the root consonants in the sequence of the root consonants. The vowels in the stem do not necessarily have to be the same vowels as in the root.</Paragraph> <Paragraph position="10"> For example consider the word under test jiksruhomlok. Initially all potential aggregate of consecutive consonants is examined as a potential verb stem. In particular for the consecutive consonants k, s, r, the database yields the verb kiser, in verb category V11. Using (2) the affixes are 'j'i' and 'uhomlok'. 'uhomlok' is found in the initial analysis as being three valid suffixes - 'u', ~om' 'lok'. )&quot; is found as a valid prefix.</Paragraph> <Paragraph position="11"> The immediate affixes are returned so that the lexeme under test is now )'iksru'. Out of all the rules defined for verb category V11 to which kiser belongs, the rule definition (3) gives a valid outcome.</Paragraph> <Paragraph position="12"> For Romance verbs a corresponding rule entry is as follows n ; a,i +< >+ verb(indicative present first person singular) (4) where the < > implies the root stem which remains the same in all verb forms for romance verbs.</Paragraph> <Paragraph position="13"> For example given the word form nippermettilek, the consecutive consonants p,p,r,m,t,t find a database entry ippermetti under the verb category for romance verbs.</Paragraph> <Paragraph position="14"> Using (2) 'n' and 'ilek' are found as valid affixes. The immediate affixes are returned and the rule (4) is then found to be satisfied.</Paragraph> </Section> </Section> class="xml-element"></Paper>