File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1025_metho.xml

Size: 14,613 bytes

Last Modified: 2025-10-06 14:09:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1025">
  <Title>Automatic Measurement of Syntactic Development in Child Language</Title>
  <Section position="4" start_page="197" end_page="198" type="metho">
    <SectionTitle>
3 Automatic Syntactic Analysis of Child
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
Language Transcripts
</SectionTitle>
      <Paragraph position="0"> A necessary step in the automatic computation of IPSyn scores is to produce an automatic syntactic analysis of the transcripts being scored. We have developed a system that parses transcribed child utterances and identifies grammatical relations (GRs) according to the CHILDES syntactic annotation scheme (Sagae et al., 2004). This annotation scheme was designed specifically for child-parent dialogs, and we have found it suitable for the identification of the syntactic structures necessary in the computation of IPSyn.</Paragraph>
      <Paragraph position="1"> Our syntactic analysis system takes a sentence and produces a labeled dependency structure representing its grammatical relations. An example of the input and output associated with our system can be seen in figure 1. The specific GRs identified by the system are listed in figure 2.</Paragraph>
      <Paragraph position="2"> The three main steps in our GR analysis are: text preprocessing, unlabeled dependency identification, and dependency labeling. In the following subsections, we examine each of them in more detail.</Paragraph>
    </Section>
    <Section position="2" start_page="197" end_page="198" type="sub_section">
      <SectionTitle>
3.1 Text Preprocessing
</SectionTitle>
      <Paragraph position="0"> The CHAT transcription system2 is the format followed by all transcript data in the CHILDES database, and it is the input format we use for syntactic analysis. CHAT specifies ways of transcribing extra-grammatical material such as disfluency, retracing, and repetition, common in spontaneous spoken language. Transcripts of child language may contain a large amount of extra-grammatical mate-</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="198" end_page="199" type="metho">
    <SectionTitle>
SUBJ, ESUBJ, CSUBJ, XSUBJ
COMP, XCOMP
JCT, CJCT, XJCT
OBJ, OBJ2, IOBJ
PRED, CPRED, XPRED
MOD, CMOD, XMOD
AUX NEG DET QUANT POBJ PTL
CPZR COM INF VOC COORD ROOT
</SectionTitle>
    <Paragraph position="0"> Subject, expletive subject, clausal subject (finite and non[?]finite)Object, second object, indirect object Clausal complement (finite and non[?]finite) Predicative, clausal predicative (finite and non[?]finite) Adjunct, clausal adjunct (finite and non[?]finite) Nominal modifier, clausal nominal modifier (finite and non[?]finite) Auxiliary Negation Determiner Quantifier Prepositional objectVerb particle CommunicatorComplementizer Infinitival &amp;quot;to&amp;quot; Vocative Coordinated itemTop node Figure 2: Grammatical relations in the CHILDES syntactic annotation scheme.</Paragraph>
    <Paragraph position="1"> rial that falls outside of the scope of the syntactic annotation system and our GR identifier, since it is already clearly marked in CHAT transcripts. By using the CLAN tools (MacWhinney, 2000), designed to process transcripts in CHAT format, we remove disfluencies, retracings and repetitions from each sentence. Furthermore, we run each sentence through the MOR morphological analyzer (MacWhinney, 2000) and the POST part-of-speech tagger (Parisse and Le Normand, 2000). This results in fairly clean sentences, accompanied by full morphological and part-of-speech analyses.</Paragraph>
    <Section position="1" start_page="198" end_page="198" type="sub_section">
      <SectionTitle>
3.2 Unlabeled Dependency Identification
</SectionTitle>
      <Paragraph position="0"> Once we have isolated the text that should be analyzed in each sentence, we parse it to obtain unlabeled dependencies. Although we ultimately need labeled dependencies, our choice to produce unlabeled structures first (and label them in a later step) is motivated by available resources. Unlabeled dependencies can be readily obtained by processing constituent trees, such as those in the Penn Tree-bank (Marcus et al., 1993), with a set of rules to determine the lexical heads of constituents. This lexicalization procedure is commonly used in statistical parsing (Collins, 1996) and produces a dependency tree. This dependency extraction procedure from constituent trees gives us a straightforward way to obtain unlabeled dependencies: use an existing statistical parser (Charniak, 2000) trained on the Penn Treebank to produce constituent trees, and extract unlabeled dependencies using the aforementioned head-finding rules.</Paragraph>
      <Paragraph position="1"> Our target data (transcribed child language) is from a very different domain than the one of the data used to train the statistical parser (the Wall Street Journal section of the Penn Treebank), but the degradation in the parser's accuracy is acceptable. An evaluation using 2,018 words of in-domain manually annotated dependencies shows that the dependency accuracy of the parser is 90.1% on child language transcripts (compared to over 92% on section 23 of the Wall Street Journal portion of the Penn Treebank). Despite the many differences with respect to the domain of the training data, our domain features sentences that are much shorter (and therefore easier to parse) than those found in Wall Street Journal articles. The average sentence length varies from transcript to transcript, because of factors such as the age and verbal ability of the child, but it is usually less than 15 words.</Paragraph>
    </Section>
    <Section position="2" start_page="198" end_page="199" type="sub_section">
      <SectionTitle>
3.3 Dependency Labeling
</SectionTitle>
      <Paragraph position="0"> After obtaining unlabeled dependencies as described above, we proceed to label those dependencies with the GR labels listed in Figure 2.</Paragraph>
      <Paragraph position="1"> Determining the labels of dependencies is in general an easier task than finding unlabeled dependencies in text.3 Using a classifier, we can choose one of the 30 possible GR labels for each dependency, given a set of features derived from the dependencies. Although we need manually labeled data to train the classifier for labeling dependencies, the size of this training set is far smaller than what would be necessary to train a parser to find labeled dependen3Klein and Manning (2002) offer an informal argument that constituent labels are much more easily separable in multidimensional space than constituents/distituents. The same argument applies to dependencies and their labels.</Paragraph>
      <Paragraph position="2">  cies in one pass.</Paragraph>
      <Paragraph position="3"> We use a corpus of about 5,000 words with manually labeled dependencies to train TiMBL (Daelemans et al., 2003), a memory-based learner (set to use the k-nn algorithm with k=1, and gain ratio weighing), to classify each dependency with a GR label. We extract the following features for each dependency: null  tree that includes both the head and dependent.</Paragraph>
      <Paragraph position="4"> The accuracy of the classifier in labeling dependencies is 91.4% on the same 2,018 words used to evaluate unlabeled accuracy. There is no intersection between the 5,000 words used for training and the 2,018-word test set. Features were tuned on a separate development set of 582 words.</Paragraph>
      <Paragraph position="5"> When we combine the unlabeled dependencies obtained with the Charniak parser (and head-finding rules) and the labels obtained with the classifier, overall labeled dependency accuracy is 86.9%, significantly above the results reported (80%) by Sagae et al. (2004) on very similar data.</Paragraph>
      <Paragraph position="6"> Certain frequent and easily identifiable GRs, such as DET, POBJ, INF, and NEG were identified with precision and recall above 98%. Among the most difficult GRs to identify were clausal complements COMP and XCOMP, which together amount to less than 4% of the GRs seen the training and test sets.</Paragraph>
      <Paragraph position="7"> Table 1 shows the precision and recall of GRs of particular interest.</Paragraph>
      <Paragraph position="8"> Although not directly comparable, our results are in agreement with state-of-the-art results for other labeled dependency and GR parsers. Nivre (2004) reports a labeled (GR) dependency accuracy of 84.4% on modified Penn Treebank data. Briscoe and Carroll (2002) achieve a 76.5% F-score on a very rich set of GRs in the more heterogeneous and challenging Susanne corpus. Lin (1998) evaluates his MINIPAR system at 83% F-score on identification of GRs, also in data from the Susanne corpus (but using simpler GR set than Briscoe and Carroll).</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="199" end_page="200" type="metho">
    <SectionTitle>
4 Automating IPSyn
</SectionTitle>
    <Paragraph position="0"> Calculating IPSyn scores manually is a laborious process that involves identifying 56 syntactic structures (or their absence) in a transcript of 100 child utterances. Currently, researchers work with a partially automated process by using transcripts in electronic format and spreadsheets. However, the actual identification of syntactic structures, which accounts for most of the time spent on calculating IPSyn scores, still has to be done manually.</Paragraph>
    <Paragraph position="1"> By using part-of-speech and morphological analysis tools, it is possible to narrow down the number of sentences where certain structures may be found. The search for such sentences involves patterns of words and parts-of-speech (POS). Some structures, such as the presence of determiner-noun or determiner-adjective-noun sequences, can be easily identified through the use of simple patterns.</Paragraph>
    <Paragraph position="2"> Other structures, such as front or center-embedded clauses, pose a greater challenge. Not only are patterns for such structures difficult to craft, they are also usually inaccurate. Patterns that are too general result in too many sentences to be manually examined, but more restrictive patterns may miss sentences where the structures are present, making their identification highly unlikely. Without more syntactic analysis, automatic searching for structures in IPSyn is limited, and computation of IPSyn scores still requires a great deal of manual inspection.</Paragraph>
    <Paragraph position="3"> Long, Fey and Channell (2004) have developed a software package, Computerized Profiling (CP), for child language study, which includes a (mostly)  automated computation of IPSyn.4 CP is an extensively developed example of what can be achieved using only POS and morphological analysis. It does well on identifying items in IPSyn categories that do not require deeper syntactic analysis. However, the accuracy of overall scores is not high enough to be considered reliable in practical usage, in particular for older children, whose utterances are longer and more sophisticated syntactically. In practice, researchers usually employ CP as a first pass, and manually correct the automatic output. Section 5 presents an evaluation of the CP version of IPSyn.</Paragraph>
    <Paragraph position="4"> Syntactic analysis of transcripts as described in section 3 allows us to go a step further, fully automating IPSyn computations and obtaining a level of reliability comparable to that of human scoring.</Paragraph>
    <Paragraph position="5"> The ability to search for both grammatical relations and parts-of-speech makes searching both easier and more reliable. As an example, consider the following sentences (keeping in mind that there are no explicit commas in spoken language):  (a) Then [,] he said he ate.</Paragraph>
    <Paragraph position="6"> (b) Before [,] he said he ate.</Paragraph>
    <Paragraph position="7"> (c) Before he ate [,] he ran.</Paragraph>
    <Paragraph position="8"> Sentences (a) and (b) are similar, but (c) is dif- null ferent. If we were looking for a fronted subordinate clause, only (c) would be a match. However, each one of the sentences has an identical part-speechsequence. If this were an isolated situation, we might attempt to fix it by having tags that explicitly mark verbs that take clausal complements, or by adding lexical constraints to a search over part-of-speech patterns. However, even by modifying this simple example slightly, we find more problems: (d) Before [,] he told the man he was cold.</Paragraph>
    <Paragraph position="9"> (e) Before he told the story [,] he was cold.</Paragraph>
    <Paragraph position="10"> Once again, sentences (d) and (e) have identical part-of-speech sequences, but only sentence (e) features a fronted subordinate clause. These limited toy examples only scratch the surface of the difficulties in identifying syntactic structures without syntactic 4Although CP requires that a few decisions be made manually, such as the disambiguation of the lexical item &amp;quot;'s&amp;quot; as copula vs. genitive case marker, and the definition of sentence breaks for long utterances, the computation of IPSyn scores is automated to a large extent.</Paragraph>
    <Paragraph position="11"> analysis beyond part-of-speech and morphological tagging. In these sentences, searching with GRs is easy: we simply find a GR of clausal type (e.g.</Paragraph>
    <Paragraph position="12"> CJCT, COMP, CMOD, etc) where the dependent is to the left of its head.</Paragraph>
    <Paragraph position="13"> For illustration purposes of how searching for structures in IPSyn is done with GRs, let us look at how to find other IPSyn structures5: * Wh-embedded clauses: search for wh-words whose head, or transitive head (its head's head, or head's head's head...) is a dependent in GR of types [XC]SUBJ, [XC]PRED, [XC]JCT, [XC]MOD, COMP or XCOMP; * Relative clauses: search for a CMOD where the dependent is to the right of the head; * Bitransitive predicate: search for a word that is a head of both OBJ and OBJ2 relations.</Paragraph>
    <Paragraph position="14"> Although there is still room for under- and over-generalization with search patterns involving GRs, finding appropriate ways to search is often made trivial, or at least much more simple and reliable than searching without GRs. An evaluation of our automated version of IPSyn, which searches for IPSyn structures using POS, morphology and GR information, and a comparison to the CP implementation, which uses only POS and morphology information, is presented in section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML