File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1035_intro.xml
Size: 2,482 bytes
Last Modified: 2025-10-06 14:06:57
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1035"> <Title>Inside-Outside Estimation of a Lexicalized PCFG for German</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Corpus and morphology </SectionTitle> <Paragraph position="0"> The data for the experiment is a corpus of German subordinate clauses extracted by regular expression matching from a 200 million token newspaper corpus. The clause length ranges between four and 12 words. Apart from infinitival VPs as verbal arguments, there are no further clausal embeddings, and the clauses do not contain any punctuation except for a terminal period. The corpus contains 4128873 tokens and 450526 clauses which yields an average of 9.16456 tokens per clause. Tokens are automatically annotated with a list of part-of-speech (PoS) tags using a computational morphological analyser based on finite-state technology (Karttunen et al. (1994), Schiller and StSckert (1995)).</Paragraph> <Paragraph position="1"> A problem for practical inside-outside estimation of an inflectional language like German arises with the large number of terminal and low-level non-terminal categories in the grammar resulting from the morpho-syntactic features of words. Apart from major class (noun, adjective, and so forth) the analyser provides an ambiguous word with a list of possible combinations of inflectional features like gender, person, number (cf. the top part of Fig. 1 for an example ambiguous between nominal and adjectival PoS; the PoS is indicated following the '+' sign).</Paragraph> <Paragraph position="2"> In order to reduce the number of parameters to be estimated, and to reduce the size of the parse forest used in inside-outside estimation, we collapsed the inflectional readings of adjectives, adjective derived nouns, article words, and pronouns to a single morphological feature (see of Fig. 1 for an example). This reduced the number of low-level categories, as exemplified in Fig. 2: das has one reading as an article and one as a demonstrative; westdeutschen has one reading as an adjective, with its morphological feature N indicating the inflectional suffix.</Paragraph> <Paragraph position="3"> We use the special tag UNTAGGED indicating that the analyser fails to provide a tag for the word. The vast majority of UNTAGGED words are proper names not recognized as such. These gaps in the morphology have little effect on our experiment.</Paragraph> </Section> class="xml-element"></Paper>