File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-2024_metho.xml
Size: 12,047 bytes
Last Modified: 2025-10-06 14:10:07
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-2024"> <Title>A Suite of Shallow Processing Tools for Portuguese: LX-Suite</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Sentence chunker </SectionTitle> <Paragraph position="0"> The sentence chunker is a finite state automaton (FSA), where the state transitions are triggered by specified character sequences in the input, and the emitted symbols correspond to sentence (<s>) and paragraph (<p>) boundaries. Within this setup, a transition rule could define, for example, that a period, when followed by a space and a capital letter, marks a sentence boundary: &quot;***. A***&quot; - &quot;***.</s><s>A***&quot; Being a rule-based chunker, it was tailored to handle orthographic conventions that are specific to Portuguese, in particular those governing dialog excerpts. This allowed the tool to reach a very good performance, with values of 99.95% for recall and 99.92% for precision.3</Paragraph> </Section> <Section position="4" start_page="0" end_page="179" type="metho"> <SectionTitle> 3 Tokenizer </SectionTitle> <Paragraph position="0"> Tokenization is, for the most part, a simple task, as the whitespace character is used to mark most token boundaries. Most of other cases are also rather simple: Punctuation symbols are separated fromwords, contracted formsareexpanded andclitics in enclisis or mesoclisis position are detached from verbs. It is worth noting that the first element of an expanded contraction is marked with a symbol (+) indicating that, originally, that token occurred as part of a contraction:4 um, dois -|um|,|dois| da -|de+|a| viu-o -|viu|-o| In what concerns Portuguese, the non-trivial aspects of tokenization are found in the handling of ambiguous strings that, depending on their POS tag, may or may not be considered a contraction. For example, the word deste can be tokenized as the single token |deste |if it occurs as a verb (Eng.: [you] gave) or as the two tokens |de+|este|if it occurs as a contraction (Eng.: of this).</Paragraph> <Paragraph position="1"> It is worth noting that this problem is not a minor issue, as these strings amount to 2%of the corpus that was used and any tokenization error will have a considerable negative influence on the subsequent steps of processing, such as POS tagging. To resolve the issue of ambiguous strings, a two-stage tokenization strategy is used, where the ambiguous strings are not immediately tokenized. Instead, the decision counts on the contribution of the POS tagger: The tagger must first be trained on a version of the corpus where the ambiguous strings are not tokenized, and are tagged with a composite tagwhenoccurring asacontraction (for example P+DEM for a contraction of a preposition and a demonstrative). The tagger then runs over the text and assigns a simple or a composite tag to the ambiguous strings. A second pass with the tokenizer then looks for occurrences of tokens with a composite tag and splits them:</Paragraph> <Paragraph position="3"> This approach allowed us to successfully resolve 99.4% of the ambiguous strings. This is a much better value than the baseline 78.20% obtained by always considering that the ambiguous strings are a contraction.5</Paragraph> </Section> <Section position="5" start_page="179" end_page="179" type="metho"> <SectionTitle> 4 POS tagger </SectionTitle> <Paragraph position="0"> ForthePOStagging task weused Brant's TnTtagger (Brants, 2000), a very efficient statistical tagger based on Hidden Markov Models.</Paragraph> <Paragraph position="1"> For training, we used 90% of a 280,000 token corpus, accurately hand-tagged with atagset of ca.</Paragraph> <Paragraph position="2"> 60 tags, with inflectional feature values left aside. Evaluation showed an accuracy of 96.87% for this tool, obtained by averaging 10 test runs over different 10% contiguous portions of the corpus that were not used for training.</Paragraph> <Paragraph position="3"> The POS tagger we developed is currently the fastest tagger for the Portuguese language, and it is in line with state-of-the-art taggers for other languages, as discussed in (Branco and Silva, 2004).</Paragraph> </Section> <Section position="6" start_page="179" end_page="180" type="metho"> <SectionTitle> 5 Nominal featurizer </SectionTitle> <Paragraph position="0"> This tool assigns feature value tags for inflection (Gender and Number) and degree (Diminutive, Superlative and Comparative) to words from nominal morphosyntactic categories.</Paragraph> <Paragraph position="1"> Such tagging is typically done by a POS tagger, by using a tagset where the base POS tags have been extended with feature values. However, this increase in the number of tags leads to a lower tagging accuracy due to the data-sparseness problem. With our tool, we explored what could be gained by having a dedicated tool for the task of nominal featurization.</Paragraph> <Paragraph position="2"> We tried several approaches to nominal featurization. Herewereport onthe rule-based approach which is the one that better highlights the difficulties in this task.</Paragraph> <Paragraph position="3"> Forthis tool, webuilt onmorphological regularities and used a set of rules that, depending on the word termination, assign default feature values to words. Naturally, these rules were supplemented by a list of exceptions, which was collected by using anmachine readable dictionary (MRD)that allowed us to search words by termination.</Paragraph> <Paragraph position="4"> Nevertheless, this procedure is still not enough to assign a feature value to every token. The most direct reason is due to the so-called invariant words, which are lexically ambiguous with respect to feature values. For example, the Common Nounermita(Eng.: hermit)can bemasculine or feminine, depending ontheoccurrence. Bysimply using termination rules supplemented with exceptions, such words will always be tagged with underspecified feature values:6 ermita/?S Tohandle such cases thefeaturizer makesuse of feature propagation. With this mechanism, words from closed classes, for which we know their feature values, propagate their values to the words from open classes following them. These words, in turn, propagate those features to other words: Special care must be taken to avoid that feature propagation reaches outside NP boundaries. For this purpose, some sequences of POS categories block feature propagation. In the example below, a PP inside an NP context, azul (an &quot;invariant&quot; adjective) might agree with facaor with the preceding word, ac,o. To prevent mistakes, propagation from ac,o to azul should be blocked. faca/FS de ac,o/MS azul/FS Eng.: blue (steel knife) or faca/FS de ac,o/MS azul/MS Eng.: (blue steel) knife For the sake of comparability with other possible similar tools, we evaluated the featurizer only over Adjectives and Common Nouns: It has 95.05% recall (leaving ca. 5% of the tokens with underspecified tags) and 99.05% precision.7</Paragraph> </Section> <Section position="7" start_page="180" end_page="181" type="metho"> <SectionTitle> 6 Nominal lemmatizer </SectionTitle> <Paragraph position="0"> Nominal lemmatization consists in assigning to Adjectives and Common Nouns a normalized form, typically the masculine singular if available. Our approach uses a list of transformation rules that helps changing the termination of the words.</Paragraph> <Paragraph position="1"> For example, one states that any word ending in ta should have that ending transformed into to: gata ([female] cat) - gato ([male] cat) There are, however, exceptions that must be accounted for. The word porta, for example, is a feminine common noun, and its lemma is porta: porta(door, feminine common noun) - porta Relevant exceptions like the one above were collected by resorting to a MRD that allowed to search words on the basis of their termination. Being that dictionaries only list lemmas (and not inflected forms), it is possible to search for words with terminations matching the termination of inflected words (for example, words ending in ta). Any word found by the search can thus be considered as an exception.</Paragraph> <Paragraph position="2"> A major difficulty in this task lies in the listing of exceptions when non-inflectional affixes are taken into account. As an example, lets consider again the word porta. This word is an exception to the rule that transforms ta into to. As expected, this word can occur prefixed, as in superporta. Therefore, this derived word 7For a much more extensive analysis, including a comparison with other approaches, see (Branco and Silva, 2005a). should also appear in the list of exceptions to prevent it from being lemmatized into superporto by the rule. However, proceeding like this for every possible prefix leads to an explosion in the number of exceptions. To avoid this, a mechanism was used that progressively strips prefixes from words while checking the resulting word forms against the list of exceptions: A similar problem arises when tackling words with suffixes. For instance, the suffix -zinho and its inflected forms (-zinha, -zinhos and -zinhas) are used as diminutives. These suffixesshould be removed bythe lemmatization process. However, there are exceptions, such as the word vizinho (Eng.: neighbor) which is not a diminutive. This word has to be listed as an exception, together with its inflected forms (vizinha, vizinhos and vizinhas), which again leads to a great increase in the number of exceptions. To avoid this, only vizinhois explicitly listed as an exception and the inflected forms of the diminutive are progressively undone while looking for an exception:</Paragraph> <Paragraph position="4"> To ensure that exceptions will not be overlooked, when both these mechanisms work in parallel one must follow all possible paths of affix removal. An heuristic chooses the lemma as being the result found in the least number of steps.8 To illustrate this, consider the word antena (Eng.: antenna). Figure 1 shows the paths followed by the lemmatization algorithm when it is faced with antenazinha (Eng.: [small] antenna). Both ante- and -zinha are possible affixes. In a first step, two search branches are opened, the first where ante- is removed and the second where -zinha is transformed into 8ThiscanbeseenasfollowingarationalesimilartoKarlsson's (1990) local disambiguation procedure. -zinho. The search proceeds under each branch until no transformation ispossible, oran exception has been found. The end result is the &quot;leaf node&quot; with the shortest depth which, in this example, is antena(an exception).</Paragraph> <Paragraph position="5"> This branching might seem to lead to a great performance penalty, but only a few words have affixes, and most of them have only one, in which case there is no branching at all.</Paragraph> <Paragraph position="6"> This tool evaluates to an accuracy of 94.75%.9</Paragraph> </Section> <Section position="8" start_page="181" end_page="181" type="metho"> <SectionTitle> 7 Verbal featurizer and lemmatizer </SectionTitle> <Paragraph position="0"> To each verbal token, this tool assigns the corresponding lemma and tag with feature values for Mood, Tense, Person and Number.</Paragraph> <Paragraph position="1"> The tool uses a list of rules that, depending on the termination of the word, assign all possible lemma-feature pairs. The word diria, for example, is assigned the following lemma-feature pairs: Currently, this tool does not attempt to disambiguate among the proposed lemma-feature pairs. So, each verbal token will be tagged with all its possible lemma-feature pairs.</Paragraph> <Paragraph position="2"> The tool was evaluated over a list with ca.</Paragraph> <Paragraph position="3"> 800,000 verbal forms. It achieves 100% precision, but at 50% recall, as half of those forms are ambiguous and receive more than one lemma-feature pair.</Paragraph> </Section> class="xml-element"></Paper>