File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1031_intro.xml
Size: 3,311 bytes
Last Modified: 2025-10-06 14:06:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1031"> <Title>A Flexible POS Tagger Using an Automatically Acquired Language Model*</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In NLP, it is necessary to model the language in a representation suitable for the task to be performed.</Paragraph> <Paragraph position="1"> The language models more commonly used are based on two main approaches: first, the linguistic approach, in which the model is written by a linguist, generally in the form of rules or constraints (Voutilainen and Jgrvinen, 1995). Second, the automatic approach, in which the model is automatically obtained from corpora (either raw or annotated) 1 , and consists of n-grams (Garside et al., 1987; Cutting et ah, 1992), rules (Hindle, 1989) or neural nets (Schmid, 1994). In the automatic approach we can distinguish two main trends: The low-level data trend collects statistics from the training corpora in the form of n-grams, probabilities, weights, etc. The high level data trend acquires more sophisticated information, such as context rules, constraints, or decision trees (Daelemans et al., 1996; M/~rquez and Rodriguez, 1995; Samuelsson et al., 1996). The acquisition methods range from supervised-inductivelearning-from-example algorithms (Quinlan, 1986; *This research has been partially funded by the Spanish Research Department (CICYT) and inscribed as TIC96-1243-C03-02 I When the model is obtained from annotated corpora we talk about supervised learning, when it is obtained from raw corpora training is considered unsupervised.</Paragraph> <Paragraph position="2"> Aha et al., 1991) to genetic algorithm strategies (Losee, 1994), through the transformation-based error-driven algorithm used in (Brill, 1995), Still another possibility are the hybrid models, which try to join the advantages of both approaches (Voutilainen and Padr6, 1997).</Paragraph> <Paragraph position="3"> We present in this paper a hybrid approach that puts together both trends in automatic approach and the linguistic approach. We describe a POS tagger based on the work described in (Padr6, 1996), that is able to use bi/trigram information, automatically learned context constraints and linguistically motivated manually written constraints. The sources and kinds of constraints are unrestricted, and the language model can be easily extended. The structure of the tagger is presented in figure 1.</Paragraph> <Paragraph position="4"> Language Model . I~:.i:;:;~: I / le~ed |t wri.e. |...</Paragraph> <Paragraph position="5"> l i.wco us Figure h Tagger architecture.</Paragraph> <Paragraph position="6"> Corpus We also present a constraint-acquisition algorithm that uses statistical decision trees to learn context constraints from annotated corpora and we use the acquired constraints to feed the POS tagger.</Paragraph> <Paragraph position="7"> The paper is organized as follows. In section 2 we describe our language model, in section 3 we describe the constraint acquisition algorithm, and in section 4 we expose the tagging algorithm. Descriptions of the corpus used, the experiments performed and the results obtained can be found in sections 5 and 6.</Paragraph> </Section> class="xml-element"></Paper>