File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-2013_intro.xml

Size: 2,742 bytes

Last Modified: 2025-10-06 14:03:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-2013">
  <Title>Automatic Induction of a CCG Grammar for Turkish</Title>
  <Section position="3" start_page="73" end_page="74" type="intro">
    <SectionTitle>
2 Data
</SectionTitle>
    <Paragraph position="0"> The METU-Sabanci Treebank is a subcorpus of the METU Turkish Corpus (Atalay et al., 2003; Oflazer et al., 2003). The samples in the corpus are taken from 3 daily newspapers, 87 journal issues and 201 books. The treebank has 5635 sentences.There are a total of 53993 tokens. The average sentence length is about 8 words. However, a Turkish word may correspond to several English words, since the morphological information which exists in the treebank represents additional information including part-ofspeech, modality, tense, person, case, etc. The list of the syntactic relations used to model the dependency relations are the following.</Paragraph>
    <Paragraph position="1">  ETOL is used for constructions very similar to phrasal verbs in English. &amp;quot;Collocation&amp;quot; is used for the idiomatic usages and word sequences with certain patterns. Punctuation marks do not play a role in the dependency structure unless they participate in a relation, such as the use of comma in coordination. The label &amp;quot;Sentence&amp;quot; links the head of the sentence to the punctuation mark or a conjunct in case of coordination. So the head of the sentence is always known, which is helpful in case of scrambling. Figure 1 shows how (5) is represented in the treebank.</Paragraph>
    <Paragraph position="2">  (5) Kapinin kenarindaki duvara dayanip bize bakti bir an.</Paragraph>
    <Paragraph position="3"> (He) looked at us leaning on the wall next to  the door, for a moment.</Paragraph>
    <Paragraph position="4"> The dependencies in Turkish treebank are surface dependencies. Phenomena such as traces and pro-drop are not modelled in the treebank. A word</Paragraph>
    <Paragraph position="6"> from deps. to the head</Paragraph>
    <Paragraph position="8"> can be dependent on only one word but words can have more than one dependants. The fact that the dependencies are from the head of one constituent to the head of another (Figure 2) makes it easier to recover the constituency information, compared to some other treebanks e.g. the Penn Treebank where no clue is given regarding the head of the constituents. null Two principles of CCG, Head Categorial Uniqueness and Lexical Head Government, mean both extracted and in situ arguments depend on the same category. This means that long-range dependencies must be recovered and added to the trees to be used in the lexicon induction process to avoid wrong predicate argument structures (Section 3.5).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML